Tuesday, July 27, 2021

Re: unicode: UTF / UCS

'Johannes Köhler' via vim_use wrote:
> *disorientation*
>
> The unix _manpage_ utf-8 describes unicode with 2-byte encoding. But
> _wikipedia_ indicates also 1-byte unicode
> with ascii compatibility.

(If I remember correctly) the first versions of Unicode had only a
2-byte encoding, so that (part of the) manpage is very old.



> Furthermore, be interested myself in the filesystem behavior
> and unicode with ucs-2. Is it possible to use a linux
> filesystem with 2-byte unicode encoding on principle.

I'm not so strong on Linux but filesystems shouldn't have anything to do
with text files encodings



> Due to the cause that linux creates a 2-byte file
> (1-byte character & 1-byte EOF) when creating it with
> touch, and inserting one character into it with vim.

I think it's vim that puts the EOF (see :help 'fixendofline'), not the
touch program or linux



> The bottom line is a 1-byte ascii file... Or a 1-byte
> unicode with ascii compatibility (that what i meant with
> endian abuse appearance).

I haven't understood this or other parts of the first message, but
you're probably thinking too much ahead, these issues have likely
nothing to do with endianness



> Present, i study autodidactic with electric circuits and
> the logical behavior. With that in mind it should be
> faster to use 2-byte all over instead of a 1-byte, 2-byte
> decision with the encoder, decoder.
>

It's not that simple unfortunately, UTF-16 (let's leave aside UCS-2, it
shouldn't matter) cannot be assumed to always have two bytes per
character, and some tests indicated that UTF-8 usually ends up being
better overall (utf8everywhere.org is certainly worth a look, I don't
remember if I agreed with it completely but it for sure is an
interesting document).



All in all, it's nice if you want to understand how things are at the
lower levels, it's quite fun to know it, but in order to achieve that
for text files these days you need to read the Unicode specification, at
least in its first parts; other sources are quite likely to cause more
confusion than clarity. To tackle the varied things you can run into on
the web and other information sources you'll probably also need to know
some of the earlier history of Unicode and the older encodings /
character sets.


Kind regards,
Gabriele

--
--
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

---
You received this message because you are subscribed to the Google Groups "vim_use" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vim_use+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/vim_use/0165ce1c-cd38-2d24-72b2-365849a8f788%40tiscali.it.

No comments: