Thursday, August 5, 2021

Re: unicode: UTF / UCS

THX to EIKE and TONY for TIME and EFFORT @ REPLY!

I was confused due to reading the unicode documentation,
whereby utf-32 codepoints are local expandable with 1) blocks
in planes OR 2) whole planes... And intuitively
i had in mind the utf-8 is for downwardly compatible with us-ascii codespace. 
The "usecase" with bash script and us-ascii puts the same into
my mind. Q: Is bash script reading text files similar to binary? (when i am
not allow to use a BOM). Meant, is not using a charset encoding
applied by linux.

Then partition tables, which should be readable
on different systems, are encoded with utf-16/ucs-2.

Thus implied to me, UCS-2 is a new standard for independent
decentralized 2-byte charsets. And the UTF is the local interpreting
process... 

Finally, it doesnt matter - because the linux decoder seems to
be very rich of decision possibilities (e.g. creates 1-byte utf-8 file like us-ascii 
until i use an utf-16 codepoint) and therefore my files should
be readable with the 1byte utf-8 for my lifetime.
 
But attention! ...with the modern "android smartphone" philosophy 
i got brainwashed: At all cost - stay up-to-date with your software and hardware
systems, else you are not with us (community, life etc.) anymore. 
Then i got _paranoid_ when i still know there is a new charset encoding 
since years, and my system goes back to the deprecated one ... *take for fun*

sincerely
-kefko

... 
http://www.johannes-koehler.de

antoine.m...@gmail.com schrieb am Montag, 2. August 2021 um 13:59:26 UTC+2:
As some have said above, UTF-8 is a variable-length encoding, which
encodes 7-bit ASCII characters exactly like us-ascii, and characters
(codepoints) above U+007F in two or more bytes, each of them with the
high bit set. Originally Unicode was foreseen to be able to go as far
up as U+3FFFFFFF, but when UTF-16 was crafted and surrogate codepoints
were assigned it was decided that codepoints higher than U+10FFFF
would never mean anything (and U+F0000 to U+10FFFF are "for private
use" anyway, i.e. transmitter and receiver have to agree on the
values, which are not defined by Unicode). The Wikipedia page about it
is well-written and I recommend reading it.

The so-called "byte order mark" U+FEFF ZERO-WIDTH NO-BREAK SPACE
should more appropriately be coded an "encoding mark" : it can
discriminate most Unicode encodings and endiannesses from each other,
including UTF-8, which has no byte-order ambiguity. At the head of a
UTF-8 file (e.g. an HTML file or CSS script, whose syntaxes expressly
support it), it means "This is UTF-8". However some programs which
expect only US-ASCII will choke if they get a file headed by a BOM:
for instance a #! "executable script" header will not be recognized if
it is preceded by a BOM, so if you want to start your first line by
#!/bin/bash or #!/bin/env python the file may be in UTF-8 (which
encodes the 128 ASCII characters just like us-ascii) but without BOM.

See:
https://en.wikipedia.org/wiki/Unicode
https://en.wikipedia.org/wiki/UTF-8
and beware that the Microsoft Windows documentation usually says
"Unicode" when what it means is "UTF-16" which represents each
codepoint in one, or sometimes two, 16-bit words.

Best regards,
Tony.

--
--
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

---
You received this message because you are subscribed to the Google Groups "vim_use" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vim_use+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/vim_use/714b5bfe-9f5b-4a96-8b2b-66701f331073n%40googlegroups.com.

No comments: