Rebbe Malachi: Re: using regexp to search for Unicode code points and properties

Warning: this reply is in UTF-8.

On 20/08/09 15:47, Brian Anderson wrote:
>
> I'm interested in learning how to use regular expressions in Vi(m) to
> search for Unicode code points.
>
> In a book about regexp, it describes how to search for Unicode code
> points by various means, and for various programming languages.

There are various kinds of regexps, for various languages (such as perl)
and programs (such es egrep). Vim has its own. They are each subtly
different.

The only authoritative documentation about Vim is the help that comes
with Vim. It is extremely complete. The "chapter" about regular
expressions (usually named "patterns" in Vim parlance) is accessed by

:help pattern.txt

IMHO the most useful part of that starts at

:help pattern-overview

>
> The book describes searching for a specific Unicode code point as \u2122
> or \x{2122}.
>
> From what I've seen in the Vim help files, \u is to identify uppercase
> characters, not Unicode code points, and \x is for hexadecimal digits.

Yes and no. In Vim, you can search for, let's say, a Cyrillic lowercase
soft sign with any of:

/ь
(using the soft sign directly e.g. if you have a Russian keyboard or use
a keymap)

/\%u044c

(see ":help /\%u")

/[\u044c]

(see not far below ":help /\]", and usually only valid in 'nocompatible'
mode)

Or the same with ? instead of / to search backwards, or fь Fь tь or Tь
(with no carriage return needed), to search only within the current line.

>
> The book also talks about using Unicode property or categories in the
> search. The book indicates there are 30 Unicode categories, grouped into
> 7 super-categories.
> For example, \p{Ll} would find any lowercase letter that has an
> uppercase variant, and \p{Lo} any letter or ideograph that does not have
> lowercase and uppercase variants.
>
> Unicode blocks are defined as \p{IsGreekExtended}. Blocks consist of a
> single range of code points. Example: searching for any code point
> between U+0000...U+007F can be found with \p{InBasicLatin}.
>
> Unicode script is \p{Greek}. Each Unicode code point is part of only one
> Unicode script. So if I wanted to search for any Greek letter, I'd use
> \p{Greek}.
>
> Unicode grapheme is \X or \P{M}. This would be either single codepoints
> (U+00E0 Latin small letter a with grave accent) or combined codepoints
> (U+0061 Latin small letter a + U+0300 combining grave accent).
>
> Help on any of these, either in examples or where to look in the help
> files, welcome.

None of these apply to Vim. To search for anything between U+0000 and
U+007F inclusive, you could use

/[\x00-\x7F]

though (with Vim) it might be more prudent to use

/[\x00\x01-\x7F]

since the NUL byte is usually represented by 0x0A to avoid having it
terminate the C string representing the line. Vim cannot search for
collections of more than 256 (or is it 257?) consecutive byte values,
however, so this construct is useless if you want to search for "any CJK
hanzi / kanji / hanja".

To search for either U+00E0 or U+0061 U+0300, use

/\%xE0\|\%x61\%u0300
or even
/à\|a\%u0300

where \| means "or" (so this tells Vim explicitly to search for _either_
a precomposed a-grave _or_ a small-a followed by a combining-grave). If
in the middle of a longer regexp, you may wish to wrap that within \%(
and \) to avoid having the \| apply to what comes before or after the
a-grave.

Or you may wish to run a "normalization pass" first, such as

:%s/a\%u0300/à/g

but it might be unwieldy if there are many different such possible
characters to normalize.

The only pattern atom specifically relevant for combining characters is
\Z, which tells Vim to ignore any possible combining characters found in
the text. This might be useful for Arabic or Hebrew text, where short
vowels (represented by combining chracters) are optional, but it would
be less useful for Latin, where "combined" a-acute, a-grave,
a-circumflex, a-umlaut, a-ball, a-ogonek, a-macron and a-breve all would
be lumped together with plain a, but "precomposed" á à â ä å ą ā ă would
each be regarded as different (from each other and from plain a).

>
> Thanks.
>
> Brian

Best regards,
Tony.
--
You buttered your bread, now lie in it.

--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_use" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---

Rebbe Malachi

Sunday, August 30, 2009

Re: using regexp to search for Unicode code points and properties

No comments:

Post a Comment