Saturday, July 5, 2014

Re: Encoding and Fileencoding of a latin1 file

On Saturday, July 5, 2014 3:06:50 AM UTC-5, rameo wrote:
> Ben,
>
> Try to write these french words in a file with a latin1 fileencoding:
> bœuf, cœur, manœuvre, œil
> (beef, heart, manoeuvre, eye)
>

When I wry writing this in latin1, I get:


"test.txt"
"test.txt" CONVERSION ERROR in line 1; 1L, 30C written
"test.txt" CONVERSION ERROR in line 1; 1L, 30C written

It looks like this is not latin1 at all.

Indeed, looking it up at http://en.wikipedia.org/wiki/ISO/IEC_8859-1 shows that
the 'œ' character is not Latin1 at all. It is the Windows-1252 encoding, a
superset of Latin1 ( http://en.wikipedia.org/wiki/Windows-1252 ).

> Close this file.
> Set encoding to utf-8 in your vimrc.
> Open the file.
>
> Encoding is utf-8
> Fileencoding is latin1 (:set fileencoding?), converted is written after the file name.
> But all words have squares.
> (The same file is visualized well in notepad+++, recognized as latin1)

Here fileencoding is set to latin1 because your "fileencodings" option has
latin1 as the final fallback if UTF-8 fails (which it will). The file itself is
not in latin1, so reading it in latin1 failed on character 140, completely
undefined in latin1 but defined as 'œ' in Windows-1252. Thus the blank squares.

>
> Btw you asked me how I check encoding and fileencoding of a file?
> I have this in my statusline:
> set statusline+=%2*\ E:%{&fileencoding?&fileencoding:&encoding}
> set statusline+=%2*\ F:%{&fileencoding?&fileencoding:&fileencoding}
>

That's good, it will tell you what encoding the file will be saved in, and also
what encoding it got read in. I said earlier 'encoding' doesn't affect the
writing of a file, but that's a simplification. It is used in place of
'fileencoding' if that option is not set, as you apparently have learned.

> Years ago I had also problems with utf-8 and switched back to latin1 encoding.
> These days I switched again to utf-8 and after a while it messed up again my files (p.e. my vimrc file).
>
> A question:
> Why should there be an encoding and fileencoding? Why not put them together?

Because they are two different concepts. 'encoding' is a Vim internal thing.
Really I have no idea why Vim has this option at all. Every other program out
there gives you no control (and you need no control) over the encoding used
internally to represent data. 'fenc' is really the only thing you should be
using for manipulating files.

> If a file is a latin1 file: encoding and fileencoding has to be in latin1.

Wrong. You can use latin1 (or Windows-1252) files just fine regardless of your
encoding option, as long as the characters within the file can all be
represented in your chosen encoding. UTF-8 should be pretty much universally
usable.

> If a file is an utf-8 file: encoding and fileencoding has to be in utf-8.

Probably correct, I wouldn't want to mess with weird encodings that still
support all the characters in utf-8. Vim uses utf-8 internally for ANY unicode
encoding.

> Without "Conversion" written after a file name.

That "conversion" message just tells you the fileencoding differs from the
internal encoding, so Vim had to convert the bytes. It is never a problem.

> And in the Config file a user can then indicate whether a new file should be
> in utf-8 or any other encoding, something like this:
> let NewFileEncoding = "utf-8"

That would be "setglobal fileencoding=utf-8"

So, here's the real question:

Why does Vim pretend like it read a Windows-1252 file in Latin1 fileencoding,
when 'encoding' is net to "Latin1"?

Windows-1252 is commonly mistaken for Latin1. Windows systems use it by default
in place of Latin1 actually. Vim is set so that a default Windows installation
will "just work". Thus when Vim reads files that Windows pretends are Latin1,
Vim also must pretend they are Latin1 when using default settings.

When Vim must actually do encoding conversions however, it does NOT treat
Windows-1252 as Latin1.

I'm not sure if it was an oversight, or if it is just assumed that users know
what they are doing when they set their 'encoding' to a non-default value, but
when 'encoding' is UTF-8, Vim actually pays strict attention to the file
encoding. Probably it is because Vim must actually convert the file content to
its internal encoding and writing dozens of exceptions and special cases would
be prohibitive. Regardless, in your case, I would change your 'fileencodings'
option to include the Windows-1252 encoding rather than Latin1. Or, you could
manually override the encoding selection for that file.

Using Windows-1252 depends on your system. For Windows, the proper value for
your 'fileencoding' and 'fileencodings' options would be simply "cp1252". On
Linux systems, it changes to "8bit-cp1252".

--
--
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

---
You received this message because you are subscribed to the Google Groups "vim_use" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vim_use+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

No comments: