Friday, November 9, 2012

Re: Vim BOMing out

On 09/11/12 23:39, Jay Heyl wrote:
> I have some files that came from an outside organization containing Byte
> Order Marks. Looking at these files with a hex editor I can see the BOM
> is that for a UTF-8 file. I don't think I configured the 'fileencodings'
> for Vim, but checking the variables it is using
> fileencodings=ucs-bom,utf-8,default,latin1. With this, Vim fails to read
> these files properly. I've seen oddly varying behavior as I try
> different things, but it usually changes the BOM to indicate UTF-16
> (big-endian). This results in improper display of many characters.
>
> If I change my configuration so 'encoding' is utf-8, then the file is
> displayed correctly, though the BOM sometimes shows up as UTF-16 in hex
> (<FE FF>) and other times as UTF-8 as normal, though funny looking,
> characters.
>
> Since I don't need to send these files back out anywhere and the BOM is
> just unnecessary junk to me, I've used the hex editor to get rid of them
> and Vim behaves like normal. But I'm still curious what is going on with
> Vim and the BOMs. Can anyone explain why Vim is apparently thinking
> these files are or should be UTF-16 when the BOM clearly indicates
> they're UTF-8? Or perhaps just suggest some better settings so Vim will
> behave in a logical manner in regards to file encoding?

In order to handle correctly Unicode files, Vim needs 'encoding' set to
a Unicode value such as UTF-8 (if set to UTF-16 or UCS-4, of any
endianness, Vim will handle it as UTF-8 internally because null bytes
terminate C strings) or to GB18030 (which is not recommended except
maybe for CJK).

With 'fileencodings' [plural] set to "ucs-bom,utf-8,default,latin1", any
file starting with the hex bytes EF BB BF will get 'fileencoding'
[singular] set to "utf-8" and 'bomb' set to TRUE unless it is not a
valid UTF-8 file (see below). The BOM will not be visible while you edit
but it will be written back as you save the file.

See http://vim.wikia.com/wiki/Working_with_Unicode (and the Vim helptags
listed there) for more details.

If Vim displays <feff> it doesn't necessarily mean it thinks the file is
in UTF-16, it means that, at that point in the file, there is the
codepoint U+FEFF ZERO WIDTH NO-BREAK SPACE which is deprecated (in
favour of U+2060 WORD JOINER) except when used as a byte order mark (or,
more precisely, an encoding mark, since UTF-8 is invariant for byte
order) placed at the very start of the file. That codepoint takes up
three bytes (hex EF BB BF) in UTF-8, two (little-endian FF FE or
big-endian FE FF) in UTF-16, four in UTF-32 (little-endian FF FE 00 00,
big-endian 00 00 FE FF, or FE FF 00 00 or 00 00 FF FE in the rarer 3412
and 2143 orderings respectively), but its Unicode scalar value is always
0xFEFF.

If Vim displays the BOM as three funny-looking characters it means it
has *not* recognized the file as UTF-8: for instance in Latin1 you would
see (i-diaresis, closing French quote, and Spanish reversed question
mark). In order for the file to be recognized as UTF-8 it must not
contain any byte sequence which would be illegal for UTF-8. This means
in particular that (in hex):
- bytes FE and FF are forbidden
- any byte in the range C0 to FD is the "leading byte" (the first byte)
of a multi-byte sequence whose total length in bytes is exactly equal to
the number of consecutive high-order one bits in the leading byte, and
whose other ("trailing") bytes are in the range 80 to BF
- trailing bytes may not appear elsewhere

In UTF-8, the BOM can be useful, harmful or indifferent depending on
where it is used:

- in a file beginning with #! it is harmful because it hides the magic
shebang and the name of the program which should handle the script
- similarly, it is harmful in any file to be used as input by a program
which doesn't know about BOMs
- it is useful in a file to be used as input by a program which knows
about it but would use a different setting if it weren't there: for
instance on Windows, if you want to use WordPad to edit a UTF-8 file (as
opposed to UTF-16le or Windows-1252) it had better start with a UTF-8 BOM
- in some filetypes it serves as a confirmation that the file is in
UTF-8. For instance HTML documents must have an encoding declaration, as
one or more of a BOM, an HTML Content-Type header, and a
charset-declaring <meta> element. For other filetypes, the BOM may be
indifferent if the file would be correctly interpreted by any program
using it even if it didn't have a BOM.


Best regards,
Tony.
--
[Nuclear war] ... may not be desirable.
-- Edwin Meese III

--
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

No comments: