Sunday, November 3, 2013

Re: Save default font on Gvim on Windows 7.

On 03/11/13 14:56, John Joche wrote:
> OK. Thank you for your help...
>
> I can put the command
>
> |set guifont=Lucida_Console:h12:cDEFAULT
> |
>
> inside /C:\Users\JSonderson_gvimrc/ and this font family and font size
> and character set is loaded each time I start gvim.
>
> ------------------------------------------------------------------------
>
> However a question still remains, that is, how come UTF-8 is not on the
> list of character sets?

tl;dr: see last paragraph above your next question

UTF-8 is one of the ways to represent Unicode in memory. Unicode is the
Universal character set, a superset of all possible character sets known
to computer software.

The following encodings can represent all Unicode codepoints ("characters"):
- UTF-8, with between 1 and 4 bytes per character (originally up to 6
bytes had been foreseen, but then it was decided that codepoints above
U+10FFFF would never be attributed). UTF-8 has the property that the 128
US-ASCII characters are represented in UTF-8 by one byte in exactly the
same way as in US-ASCII, Latin1, and most other ASCII-derived encodings.
(EBCDIC is of course a world apart).
- UTF-16, with one or two 2-byte words per character;
- UTF-32 (aka UCS-4), with one 4-byte doubleword per character;
- GB18030, with 1, 2 or 4 bytes per character but biased in favour of
Chinese (this is the current official standard encoding of the PRC).
Conversion between GB18030 and the other ones is possible but not
trivial, and requires bulky tables. The iconv utility can usually do it,
and so can Vim if built with +iconv, or with +iconv/dyn and it can find
the iconv or libiconv library.

UTF-16 and UTF-32 can be big-endian (default) or little-endian (e.g.
UTF-16le). UTF-32 even supports the rarely used 3412 and 2143 byte
orderings but I'm not sure Vim knows about it.

Vim represents internally UTF-16 and UTF-32 as UTF-8 in memory, because
a NUL codepoint is a null word in UTF-16, a null doubleword in UTF-32,
and the many other null bytes in the files would play havoc with Vim's
use of null-terminated C strings. OTOH, in UTF-8 nothing other than the
NUL codepoint U+0000 may validly include a null byte in its representation.

With some filetypes, it is possible to tell user applications which
Unicode encoding and endianness to use by adding the codepoint U+FEFF at
the very start of the file. That codepoint is usually called the BOM
(byte-order mark) but it can even identify UTF-8 which has no endianness
variants. It is supported for at least HTML and CSS; it is not
recognized (and should not be present) in executable scripts in UTF-8,
especially those where the first line starts with #! — I've been caught
by that in the past, and now I know better.

Note that when Windows people say "Unicode" they usually mean UTF-16le.
That's e.g. how one must decode the sentence "The file is not in UTF-8,
it's in Unicode" (which, taken literally, is nonsense) in the mouth of a
Microsoft engineer.

You set the 'encoding' option, preferably near the top of your vimrc, to
tell Vim how characters are to be represented in memory. The advantage
of using ":set enc=utf8" is that it allows Vim to represent in memory
any character of any charset known to computer people. OTOH, e.g. using
Latin1 as your 'encoding' value only allows to represent the 256
characters which are part of the Latin1 charset; those are also the
first 256 codepoints (U+0000 to U+00FF) of Unicode.

See also http://vim.wikia.com/wiki/Working_with_Unicode

All of the above is independent of the 'guifont' setting. Why is there
nothing relating to Unicode in the :cXX parameter of Windows 'guifont'
settings? I'm not sure. Either :cDEFAULT means Unicode, or else it's a
Windows mystery.

>
> Isn't the character set something separate from the font anyways?

Yes, it is; but each font file has glyphs for a certain set of
languages. Usually not for all Unicode codepoints which are defined,
there are an enormous lot of them.

>
> What's the difference between character set and character encoding?

Not much. In most situations they can be used as synonyms. When not
synonymous, the character set is the array of characters, and the
character encoding is the exact manner those characters are represented
(by how many bytes, and which ones) in memory, on disk, on tape, etc.
Sometimes both words are used one for the other: e.g. in HTTP or mail
headers, the Content-Type line uses "charset=" to tell the receiving
application which encoding is used in the document.

Unicode can be regarded as one abstract character set with room for more
than a million characters (originally two thousand million, but then the
number was reduced), which ATM can be represented in at least 8
different encodings if all byte-ordering variants are considered. Not
all the Unicode "slots" have already received an assignment; some are
reserved "for private use" and others have been blocked as
"noncharacters". For details, see http://www.unicode.org/ and in
particular http://www.unicode.org/charts/

>
> How can I display the actual character set which is being used when I
> use the DEFAULT setting?

You don't. The font either has a glyph for the character you're trying
to display (and you should see that glyph), or it doesn't (and you
should see some placeholder glyph instead, e.g. an empty frame or a
reverse-video question mark).

>
> Thanks.
>
>

Best regards,
Tony.
--
Love in your heart wasn't put there to stay.
Love isn't love 'til you give it away.
-- Oscar Hammerstein II

--
--
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

---
You received this message because you are subscribed to the Google Groups "vim_use" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vim_use+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

No comments: