Monday, October 4, 2010

Re: How to set utf-8 locally (for a buffer) on loading the file

On Mon, 4 Oct 2010, esquifit wrote:

> On 4 Oct, 15:42, Ben Fritz wrote:
>>
>> You can also set fileencoding manually after a file read, so that you
>> can convert it to a different encoding when writing the file. You
>> will probably want this new encoding in your fileencodings option so
>> it can be detected,
>
> If I set fileencoding manually, I see no changes on the screen. What
> does this option exactly controls?

It controls the encoding used when writing the file.


> According to the help: "Sets the character encoding for the file of
> this buffer." But honestly, I don't get it. What does the statement
> means? I have a number of fundamental questions about the subject of
> (vim and) encoding:
>
> 1) As far as I know, there is no information stored with a text file
> about in what encoding the series of bytes makes sense as a text. An
> editor makes a guess on trying to open and display the file based on
> fist N bytes, on certain patterns, etc, but in the end is it always a
> guess, and sometimes the editor get it wrong. Is this right?

If you tell the editor to only ever consider certain encodings, you can
improve its "guess". Also, various Unicode formats support a Byte-Order
Mark (BOM). This is common with UTF-16, and discouraged with UTF-8[1].
The BOM prevents the need for guessing, but so does explicitly
specifying what character sets you want to use.

[1] http://www.unicode.org/faq/utf_bom.html#bom4


> 2) When a file is loaded from disk into vim, what does exactly happen
> with the bytes? Is there any option in vim that influences this
> process? My guess is that the editor interprets the original sequence
> of bytes (as on disk) according to the rules of some character
> encoding; for vim, this would be the value of the 'encoding' option.
> Is this correct?

Vim uses the 'fencodings' option to choose (unless explicitly given a
++enc= argument). See :help 'fencodings' for the sequence of what Vim
tries. See :help ++opt for how to use ++enc=


> 3) Based on these rules, the editor knows when to take one or two or
> more bytes to build a single *character*, and if more that one, in
> which order. From that, the editor has decided which *characters*
> (not bytes) the text contains. So for example, the sequence 1A 2B F3
> E5 66 could be interpreted as
> (1A 2B) (F3) (E5 66) according to encoding 1
> (2B 1A) (E5 F3 66) according to encoding 2
> where each () group represents a 'character' in the respective
> encoding. Thus, according to encoding 1 one would have for example:
> "small a", "capital z" and "digit 8", whereas according to encoding 2
> one would have "question mark" and "small u umlaut". Is this
> description correct?

Yes, that's roughly it.


> 4) What decides how the bytes are displayed in the screen? My
> understanding is that the font comes now into play; to each
> *character*, a glyph is provided by the font, and this is what is
> displayed on the screen. Is this description correct?

Oversimplified, but yes. In some encodings (e.g. Unicode), there are
also "combining characters"[2]. Languages that are written
right-to-left need to be laid out. For scripts that have letters whose
shapes differ depending on their context, there is also "shaping" (e.g.
Arabic[3], or Urdu[4]). Other characters might have different glyphs
depending on the locale (e.g. Simplified or Traditional Chinese
characters[5]).

Depending on whether you're using Vim or Gvim, this might be handled by
Gvim or the underlying terminal (in Vim). Most of these things aren't
well-supported by Vim, particularly in Vim proper, as Vim frequently
relies on the assumption that characters can be arranged in a grid (many
discussions on this list if interested).

[2] http://en.wikipedia.org/wiki/Combining_character
[3] Arabic: http://www.w3.org/International/tests/tests-html-css/tests-webfonts/generate?test=5
[4] Urdu: http://www.w3.org/International/tests/tests-html-css/tests-webfonts/generate?test=6
[5] http://en.wikipedia.org/wiki/Help:Multilingual_support_(East_Asian)

[*] (generally interesting tests) http://www.w3.org/International/tests/tests-html-css/list-fonts


> If yes, how can I in vim change the way vim interprets the sequence of
> bytes according to a different encoding? Is it necessary to reload the
> file?

Yes, it's necessary to reload. Once fully loaded in a buffer, the
characters are characters (not bytes).


> If I use 'set fileeconding=blah', no change is visible, whereas if
> when I use ':e ++enc=blah', the displayed glyphs do change. This is
> probably due to the fact that ':e ++enc' effectively reloads the
> sequence from disk (or rereads the original sequence of bytes from
> memory), and in doing so it resolves the bytes into characters
> according to the newly specified character encoding. On the other
> hand, 'set fileencoding=blah' does not seems to reload/reread
> anything. What is the effect of this option?

(as above: the encoding that will be written to disk)


> I have a couple of ideas, but I first like to know the answer to the
> following question.
>
> 5) What happens when I type something on the keyboard? This is a
> similar situation a reading from the disk; in the end, it about a
> sequence of bytes being inserted at some place in the file; there is
> also the need to interpret them as characters and look for glyphs on
> some font to represent them (in case the file is being displayed or
> printed). Also in this case I would expect some option in vim to
> control how the bytes sent by my keyboard are to be interpreted. Which
> are these options?

If in Gvim, it uses the underlying library functionality. If in a
terminal emulator, see:
:help mbyte-terminal

> Is it the current value of 'encoding'? Or of 'fileencoding'? or of
> 'termencoding'? And when? Only on terminals, or also in GUI? and does
> makes a difference whether I am on Win32 or on *nix? Or if I use GTK
> or not? or if I use Cygwin or not?

To oversimplify: Basically the only option that is significantly
different between OS'es (Win32 vs. *nix vs. Cygwin) is 'ff'/'ffs' (the
end-of-line format), which isn't really even discussed in your email.

The defaults for the rest ('tenc','fenc','fencs','enc') generally depend
on the locale (external to Vim -- can affect 'enc', and by association
'fencs') or being in Gvim vs. Vim (Gvim defaults 'tenc' to utf-8).

Each option's help text describes its defaults.


> 6) What happens when the file is written to disk (:w)? My guess is:
> after reading the bytes, resolved then into characters and having
> found a glyph for each character and displayed on the screen, the
> editor works exclusively 'on characters, not on bytes'. According to
> this, when writing back to disk, the editor would then
> reverse-engineer the characters into bytes according to the rules of
> some encoding option. What would be this option, 'encoding',
> 'fileencoding', something derived from 'fileencodings', what?

'fileencoding' if set, otherwise 'encoding'.


> As you see, too many basic question that cannot be answered with
> 'fileencoding: Sets the character encoding for the file of this
> buffer'.

But that's just the summary: If it's set, fileencoding does exactly
that: it sets the character encoding for the file (on disk) of this
buffer.

The text after that (in :help 'fileencoding') explains what happens if
you don't choose something explicitly. Vim tries to pick a reasonable
default. If you use 'encoding=utf-8', that default is usually what you
want. If you don't, Vim has to fall back on more heuristic approaches
(it "guesses").

--
Best,
Ben

--
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

No comments:

Post a Comment