Sunday, August 1, 2021

Re: unicode: UTF / UCS

Hi 'Johannes,

On Saturday, 2021-07-31 12:37:08 +0200, 'Johannes Köhler' via vim_use wrote:

> > It's not that simple unfortunately, UTF-16 (let's leave aside UCS-2, it
> > shouldn't matter) cannot be assumed to always have two bytes per
>
> UCS: _Uni_versal _Cod_ed Character Set
>
> In my mind, UCS is the mathematical quantum and UTF the
> encoding/decoding function using this:
> magnitudes: 16(32)bit
> plurality: charset / coded character

You are confusing things.

UCS-4 and UTF-32 as its subset are capable to hold respectively encode
assigned Unicode characters as direct representations of the Unicode
characters' code points.

UCS-2 is a 2-byte fixed width character set capable of encoding 65536
characters, or just the Unicode Basic Multilingual Plane (BMP).

UTF-16 is capable to encode the entire Unicode character range. It is
almost identical to UCS-2 in the first 64k characters, except the
"escape sequences" it needs to represent surrogate pairs for characters
of higher planes.


> Assuming that the data of the hdd partition tables (e.g.UID),
> used by the operating system, are encoded in 16bit Unicode.
> Well, my inferring thoughts were that UCS-2 is a
> hardware encoding, UTF-8 for ASCII purpose, UTF-32 a
> high level programmer attitude and UTF-16 the real unicode.

That's all nonsense. Really.

> In the end that means, the controller is made for 2-byte.
> The old ASCII code needs 7bit and probably one for
> sth., now than UTF-8 has to work with a different endian.

There is no endianess in UTF-8. Unless your hardware has less than
8 bits per word..

> And... why should i use a deprecated ASCII scheme
> at my system, when i can have lots of advantage
> using utf-16 (e.g. control/hash functions). It fells
> like utf-8 is a "work around" wrapper for
> the ASCII scheme...

UTF-8 is an efficient encoding that for Unicode characters <128 (which
happen to be identical with ASCII and a subset of Unicode) needs only
1 byte per character, whereas UTF-16 needs at least 2 bytes for each
character.

UTF-16 is a workaround for those who wanted Unicode and started off with
UCS-2 but then realized there's more than just BMP.
Or, UTF-16 is the devil's work:
https://robert.ocallahan.org/2008/01/string-theory_08.html

Eike

--
OpenPGP/GnuPG encrypted mail preferred in all private communication.
GPG key 0x6A6CD5B765632D3A - 2265 D7F3 A7B0 95CC 3918 630B 6A6C D5B7 6563 2D3A
Use LibreOffice! https://www.libreoffice.org/

--
--
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

---
You received this message because you are subscribed to the Google Groups "vim_use" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vim_use+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/vim_use/YQdSsnKXRVsDvhFt%40kulungile.erack.de.

No comments: