Thursday, August 20, 2009

Re: using regexp to search for Unicode code points and properties

Hi,

Brian Anderson wrote:
> I read through the help files on /\%u, but now I have a question about
> searching for composing or combining characters.
>
> I have a Cyrillic text, using UTF-8 as the encoding, and the characters
> are appearing correctly on the screen.
>
> When I select a character and press ga, it gives me the decimal (1073),
> hex (0431), and octal (2061) numbers. I can then use /\%u0431 in a
> search, to find this code point.
>
> When I press g8 on that same character, it shows 'd0 b1'. I understand
> that the combining character follows the base character, so 'd0' is the
> base and 'b1' is the combining, but how would I search for:
> - only the base character (d0), whether there are any combining
> characters or not
> - only the combining character (b1), attached to any base character
> - both the base + combining character (d0+b1)

no, g8 shows the UTF-8 encoding of the character under the cursor. If
the character is composed from one base character and one or more
composing characters they are separated by plus sign. So the 'd0 b1'
you see are just the two bytes of a simple character.

> I've tried /\%ud0b1, /\%uD0B1, /\%ud0/\%ub1, and several others, but
> nothing has worked.

I don't think you can search for the UTF-8 sequence representing a
character if encoding is set to UTF8 -- at least /\%xd0\%xb1/ did not
work for me.

Regards,
Jürgen

--
Sometimes I think the surest sign that intelligent life exists elsewhere
in the universe is that none of it has tried to contact us. (Calvin)

--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_use" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---

No comments: