Sunday, July 6, 2014

Re: Encoding and Fileencoding of a latin1 file

See also http://vim.wikia.com/wiki/Working_with_Unicode

You seem to have read that article, which I wrote myself, so I'll try to
explain in more detail (I hope not in boring detail) the logic behind
it. Be sure to check the Vim help for anything which would still be unclear.


'encoding' is a global option determining how Vim represents characters
in memory. The right place to set it is in your vimrc, BEFORE loading
any editfile. Once you have started opening a file, changing 'encoding'
makes the contents of ALL your current editfiles invalid, because it is
not possible to convert all the contents of all your loaded buffers from
one encoding to another as a result of your changing that option.


The :scriptencoding ex-command (not mentioned in that wiki page) tells
Vim to override 'encoding' for the purpose of reading the current
script. For instance if your vimrc is encoded in Windows-1252 you can use
scriptencoding Windows-1252
and any bytes between 0x80 and 0xFF in your script will be interpreted
as in Windows-1252 even after you set 'encoding' to UTF-8.


'fileencoding' (singular) is a local option. It says how the file in
question will be represented on disk. If 'encoding' is UTF-8
(recommended) and if your Vim can use iconv (i.e., has(iconv) returns 1,
i.e. you either have +iconv linked-in statically, or +iconv/dyn
compiled-in dynamically and the iconv or libiconv library found at
runtime), then any encoding can be translated to and from UTF-8, and Vim
can do just that when reading and writing. But note that if 'encoding'
is set to UTF-8, and you modify a file to put in it characters not
acceptable for that file's 'fileencoding', Vim will give you no error
signal as long as you don't save the file; so you can change the
'fleencoding' before or after you change the file contents: as long as
they agree when you write the file it's OK.

If the file contains only bytes less than 0x80, it will be interpreted
identically in any of the following encodings (where those I'm writing
on one line are synonyms, equivalent for Vim with iconv), and in a
number of others:
- us-ascii
- latin1, iso-8859-1
- cp1252, Windows-1252
- latin9, iso-8859-15
- utf-8
so don't be afraid if Vim detects one of your Latin1 files (with no
accented characters, French guillemets, etc.) as being UTF-8. In fact,
with those contents, it could just as well be any of the encodings
mentioned above (or a number of others). If you want to be sure that a
given file remains Latin1 even if you add accented characters to it in
the future, be sure to add some non-ASCII characters in it now (e.g.,
for text, underline the main heading with a line of ÷÷÷÷÷÷÷÷÷÷÷ American
divided-by signs), then save it immediately with one of
:x ++enc=latin1
or
:setl fenc=latin1
:w
Similarly for Windows-1252 or iso-8859-15, but use a different non-ASCII
character, since they both are supersets of Latin1. On a side note,
sometimes I notice that I send an email with headers declaring it to be
8bit utf-8 and that it comes back to me as 7bit us-ascii; the body, in
that case, is byte-for-byte identical. (This one won't, because of the
divided-by signs above. Maybe it'll come back as quoted-printable utf-8,
or even as quoted-printable iso-8859-1.)

To convert a file from one encoding to another (e.g. Windows-1252 to
UTF-8, and assuming that both can be represented in your present
'encoding'), it is extremely easy to do it with Vim (if has(iconv)
returns 1 of course), as follows:
:e ++enc=Windows-1252 filename
:setl fenc=utf-8
:w

You ask what it means to use ":setglobal fileencoding=utf-8". That tells
Vim what 'fileencoding' value to use when you create a new file which
didn't exist before. Or you could use ":setglobal
fileencoding=Windows-1252" which will create files by default in
Windows-1252 encoding, but of course in that case you will get a signal
at write-time (and not before) if you write in the file something that
has no representation in Windows-1252. See ":help local-options".


++enc=something (before the filename in a file-read or file-write
command such as :e or :saveas) tells Vim the 'fileencoding' to use for
this read or write. When reading, it also sets 'fileencoding' (locally)
for the file regardless of the 'fileencodings' heuristics. In spite of
its name, this ++enc modifier has NOTHING TO DO with 'encoding' but only
with 'fileencoding'.


'fileencodings' (plural) is a comma-separated list of values of
'fileencoding' (singular) to be tried when opening an editfile without
the++enc modifier. They are tested from left to right in sequence:

- ucs-bom (if present) should be first. It will test the first few bytes
of the first against the possible representations of U+FEFF in the
various Unicode encodings. If found, and the rest of the file agrees
with that particular encoding, it will set 'bomb' to true and
'fileencoding' to the corresponding encoding. In that case the
heuristics ends there. Otherwise 'bomb' is set to false and the next
encoding is tried.

- Any multibyte encoding (for instance utf-8) tests the contents of the
file against the admissible character values for that encoding. If an
error is found, the test ends there (gives a "fail" result) and the next
encoding in sequence is tested. If the end of the file is reached with
no error (all bytes and byte sequences are acceptable for that
encoding), 'fileencoding' is set and the heuristics ands.

- An 8-byte encoding can never fail: it will set 'fileencoding' with no
test. IOW there should be at most one 8-byte encoding, and it should be
last. If there are more than one 8-byte encoding, Vim won't give an
error, it will just never try anything (not even a multybyte encoding,
if present) after the first 8-byte encoding.

- The value "default" is special: it means the value from your OS
locale, i.e. the value which 'encoding' had before sourcing any startup
script, even the system vimrc. It may be useful to put it last if you
don't already try an 8-bit encoding before that.


Conclusion:
Vim has no built-in mechanism to sort Windows-1252, iso-8859-15 and
Latin1 apart from each other. They are all 8-bit encodings, and
sometimes one of the former two is used for the latter. You will have,
for each of your files, to know which is which and, if necessary, use
the appropriate ++enc modifier when reading it. This will set
'fileencoding' to what you tell Vim, and the same encoding will be used
when writing. Just make sure that if you guess wrong, you notice it
immediately, and read the file again in another 'fileencoding' before
you modify it.



Best regards,
Tony.

--
--
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

---
You received this message because you are subscribed to the Google Groups "vim_use" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vim_use+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

No comments: