Wednesday, August 10, 2011

Re: How to display and remove BOM in utf-8 encoded file

On Aug 10, 6:19 am, Tony Mechelynck <antoine.mechely...@gmail.com>
wrote:
> On 10/08/11 02:18, pansz wrote:
>
> > On Tue, Aug 9, 2011 at 11:13 PM, Tony Mechelynck
> > <antoine.mechely...@gmail.com>  wrote:
>
> >> That message is outdated. The BOM is supported in all Unicode encodings
> >> including UTF-8 by all "reasonably recent" browers. It is also part of the
> >> HTML standard.
>
> > BOM is a standard for UCS2 or UTF-16, not for UTF-8.
>
> According to the Unicode FAQ,http://www.unicode.org/faq//utf_bom.html#bom4(two successive FAQ
> questions) a BOM can be used in UTF-8 as well as in UTF-16 or UTF-32;
> but since UTF-8 doesn't have endianness variants, with UTF-8 it
> specifies encoding only, not endianness. BTW, "good" editors (including
> at least Vim and WordPad, possibly others) handle the BOM correctly,
> even in UTF-8. In fact, in my experience WordPad won't read UTF-8 text
> correctly _unless_ there is a BOM.
>
> However (about your next paragraph), when UTF-8 is fed "transparently"
> to a program which expects ASCII, and in particular to any program which
> expects #! at the start of a file, the BOM should not be used (see the
> 2nd FAQ question linked above, and alsohttp://www.unicode.org/faq//utf_bom.html#bom10"How I should deal with
> BOMs?", point 3.
>
>
>
> > BOM for utf-8 will cause problem for most programs which expect text
> > streams. gcc is a good example, most GNU CLI utilities will reject
> > utf-8 with BOM.
>
> I explicitly mentioned in the part you snipped that for some other kinds
> of text than HTML or CSS (such as, I said, source files and shell
> scripts) it is better to save the file without a BOM.
>
>
>
> > And, W3C validator will of course complain about it...
>
> ...with a warning, not an error; and Tidy won't.
>

W3C specifically recommends you do NOT use a BOM for UTF-8 on HTML/
XHTML/CSS documents. See http://www.w3.org/International/questions/qa-byte-order-mark#bomhow

While developing TOhtml, I ran into problems in some browsers when
using UTF-8 with BOM. If I remember correctly, browsers which actually
handle XHTML correctly, like Opera and Firefox, were interpreting the
BOM as characters appearing before the XML prolog <?xml..., which
makes the XML be not well-formed and therefore (somewhat correctly)
the browser bailed without rendering anything. Re-parsing the document
as HTML of course may allow these browsers to render the document
correctly, but according to the W3C link above, some user agents will
still have problems and attempt to reder characters instead of
treating it as an invisible BOM.

For this reason, syntax/2html contains (after opening the buffer for
the generated file):

" According to http://www.w3.org/TR/html4/charset.html#doc-char-set,
the byte
" order mark is highly recommend on the web when using multibyte
encodings. But,
" it is not a good idea to include it on UTF-8 files. Otherwise, let
Vim
" determine when it is actually inserted.
if s:settings.vim_encoding == 'utf-8'
setlocal nobomb
else
setlocal bomb
endif

--
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

No comments: