Rebbe Malachi: Re: enc,fenc (again!?)

Thursday, November 4, 2010

Re: enc,fenc (again!?)

On 05/11/10 03:59, Alessandro Antonello wrote:
>> Since latin1 is an 8-bit encoding, it cannot give a "fail" signal:
>> fencs=ucs-bom,latin1,default,utf-8 means the same as fencs=ucs-bom,latin1
>> i.e. whenever there is no BOM, the file will be detected as Latin1 because
>> none of the 256 possible byte values, in any sequence, is invalid for Latin1
>> -- and if it is actually UTF-8 without BOM, anything above U+007F wil appear
>> as two or more characters of gibberish.
>>
>> In the 'fileencodings' option, "ucs-bom", if present, should be first, and
>> an 8-bit encoding, if present, should be last (which means that at most one
>> 8-bit encoding should be used), because anything that comes after the first
>> 8-bit encoding will never be used.
>>
>> Setting fencs=ucs-bom,utf-8,latin1 means the following:
>>
>> 1) Is there a BOM at the very start of the file? Then setlocal bomb, "eat"
>> the BOM, and setlocal the corresponding Unicode 'fileencoding', otherwise
>> setlocal nobomb and:
>>
>> 2) Are the full contents of the file valid for UTF-8? (and note: 7-bit ASCII
>> is valid for both UTF-8 and Latin1 and is displayed the same in both) -- if
>> yes, setlocal fenc=utf-8; otherwise
>>
>> 3) Unconditionally setlocal fenc=latin1
>
> Hi!
>
> I see what you mean now.
>
> 1) Yes, I have a BOM in the start of every utf-8 file. You are saying that I
> should set 'fencs=ucs-bom,utf-8,latin1'. What would happen if I open an utf-8
> file from the command line using just 'gvim filename.ext'? Assuming that the
> file has a BOM. Vim would recognize the BOM and set 'fenc=utf-8'? What if I
> use 'gvim filename.ext' for a file in latin1 encoding and no BOM? Vim would
> recognize that it has no BOM and set 'fenc=latin1'? Assuming that I have
> 'enc=latin1' defined.

If a file has a BOM, it is recognised by the first heuristic (ucs-bom),
and nothing else comes into play.

A Latin1 file never has a BOM, so the "ucs-bom" heuristic will fail and
the next heuristics will be tried in turn. If that file contains
characters above 0x7F, it will be found "invalid for UTF-8" by the
second (utf-8) heuristic, which will also fail. The latin1 heuristic
cannot fail and the file will get the equivalent of ":setlocal nobomb
fenc=latin1". The 'bomb' option is immaterial when fenc=latin1, but it
is set to false by the failing ucs-bom heuristic, which cannot know
whether or not a Unicode encoding will later be detected for this file.

A file entirely in 7-bit ASCII is valid for both UTF-8 and Latin1, so
whichever of these is tried first will give a "success" signal, and that
encoding will be used as the file's 'fileencoding'. As long as the file
contains only 7-bit ASCII data, it makes no difference whether it is
read and written as UTF-8 (without BOM) or Latin1, since 7-bit ASCII is
represented identically in both.

If 'encoding' is set to latin1, however, you have a bigger problem: in
that case most Unicode codepoints (in fact, any codepoint above U+00FF)
cannot be represented in Vim's _internal memory_, and such
"unrepresentable" codepoints will be garbled, probably replaced by
inverted question marks or something like that. See the code snippet I
wrote in some earlier post in this thread, or the page
http://vim.wikia.com/Working_with_Unicode , about how to make sure at
startup (in your vimrc) that Vim can edit anything, including if
necessary a page like my own homepage,
http://users.skynet.be/antoine.mechelynck/index.htm , which contains not
only text in several "Western" languages, but also in Esperanto (which
is Latin script but not Latin1-compatible; however the "incompatible"
accented letters don't appear in that page) and in Russian, Arabic,
Chinese and Japanese — or like
http://users.skynet.be/antoine.mechelynck/other/imbecile.htm , which
contains a single sentence in many languages including Portuguese (I
don't know if from Portugal, Angola, Mozambique, Macau, Brazil, or which
combination of them); and how to do it cleanly (because if you change
'encoding' after Vim has started, once some editfile(s) has been loaded
in memory, there's a good chance the data in memory for such file(s)
will get hopelessly corrupt).

>
> 2) No, not all files are valid for both UTF-8 and Latin1. Some files are in
> Portuguese (Brazilian) with accents, cedillas, etc.

Latin1 files with accents, cedillas, etc. will not be accepted by Vim as
UTF-8 files, see my second paragraph from top, above, and the discussion
about the difference between 7-bit ASCII (which is valid in both) and
Latin1 with "higher-ASCII" characters in the hex range 80-FF (which isn't).

>
> Right now I don't thrust in the Vim's automatic behavior. Almost all source
> files that I have are in latin1 encoding. This is why I put 'set enc=latin1'
> in my *.vimrc*, even in the Mac that is UTF-8 by default. Just a few XML files
> are in 'utf-8'. For these I always use '++enc=utf-8' when open/create them.

That won't work. ++enc=utf-8 changes 'fileencoding', not 'encoding' (and
rightly so, because of the risk of corrupting other already loaded
files' data by changing 'encoding'); if 'encoding' is at Latin1 Vim
won't be able to represent in memory any codepoint above U+00FF. For
instance you won't be able to load correctly into Vim either of the two
webpages I mentioned above if 'encoding' is set to Latin1. Not only the
non-Latin1 letters won't be visible, but they will be garbled in Vim memory.

>
> Don't get me wrong. I love Vim. I don't thrust its behavior because I had
> problems in the past with utf-16le encoded files. Maybe I just don't get the
> right configuration at that time. Since then I use the configuration I show.
> But I am open for your advices, and I'll try the way you said.
>
> Thanks again.
> Alessandro Antonello
>

If 'encoding' is set to UTF-8, Vim will correctly load and edit:
- any UTF-16 (be or le) file with BOM, if "ucs-bom" comes first in
'fileencodings';
- any UTF-16 (be or le) file without BOM, if read with the proper ++enc=
modifier. If the file is read as Latin1 by the "automatic" process,
about every second character will be shown as ^@ (a null); then reread
it with e.g. ":e ++enc=utf-16le" and all will be OK.

Best regards,
Tony.
--
Acquaintance, n.:
A person whom we know well enough to borrow from, but not well
enough to lend to.
-- Ambrose Bierce, "The Devil's Dictionary"

--
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

Rebbe Malachi

Thursday, November 4, 2010

Re: enc,fenc (again!?)

No comments:

Blog Archive

About Me