Saturday, August 18, 2012

Re: Searching for combined characters

Dominique Pellé <dominique.pelle@gmail.com> wrote:

|Dominique Pellé <dominique.pelle@gmail.com> wrote:
|> On Fri, Aug 17, 2012 at 9:48 PM, Steffen Daode Nurpmeso
|> <sdaoden@gmail.com> wrote:
|>> On the unicode@unicode list there was a thread on combining characters,
|>> (Why no combining-character form for U+00F8?), and it turns out that
|>> vim(1) isn't capable to perform a normalized search either!?
|>> E.g., given a file
|>>
|>> |é
|>> |é
|>> |e
|>>
|>> which is (except empty lines stripped)
|>>
|>> |00000000 0a c3 a9 0a 65 cc 81 0a 65 0a 0a |....e...e..|
|>> |0000000b
|>>
|>> then with 'Vi IMproved 7.3 (2010 Aug 15, compiled Jan 7 2011 14:27:00)',
|>> old but never failed, pretty stripped, but very Unicode friendly,
|>>
|>> /\%xe9\|e\%u0301\|e
|>>
|>> finds the first and the last, and
|>>
|>> /[=\%xE9=]
|>>
|>> finds the second and the third, which is wrong.
|>> Searching for \%u0301 will find the second, but \.\%u0301 won't.
|>> \Ze will also find the second and the third.
|>>
|>> Should i update? Or what is the state of Unicode normalization
|>> support for searching and replacement? Will it be implemented?
|>> Am i missing something?
|>> Thanks you and ciao,
|>>
|>> --steffen
|>
|>
|> Maybe you're interested in this patch:
|> ---
|> Patch 7.3.259
|> Problem: Equivalence classes only work for latin characters.
|> Solution: Add the Unicode equivalence characters. (Dominique Pelle)
|> Files: runtime/doc/pattern.txt, src/regexp.c, src/testdir/test44.in,
|> src/testdir/test44.ok
|> ---

A big one!

|>
|> In your example, all 3 lines match with Vim-7.3.633 when I do:
|>
|> /[[=e=]]

Yes, indeed - also for my old vim!
(Which is 3744RSS after one week of work, whereas .633 is 6004RSS
after startup, compiled with the same flags. And the vi which
creates this is 756, which is a very unfair comparison. Anyway.)

|> See :help \[==\]
|>
|> -- Dominique
|
|I suppose that you wanted to match only the 1st and 2nd lines
|ignoring only combining character differences (rather than ignoring
|all diacritics in my above suggestion). I'm not sure we can do that.
|But it would be useful addition.
|
|I saw ":help /\\Z" which seems close, but it does not do that.

I've reread the pattern.txt now, and it's in fact clearly
documented (cat\Z examples after :help E68, line 536).
So i was too stupid to type [[=e=]], but typed [=e=]..
Still i don't understand why i can't search for "e\%u0301"?
Searching "\%xe9\|\%u0301\|e" finds all three?
Isn't that a bug? And "\%u0301WHATEVER" still finds the second!

|-- Dominique

Again, thank you, and happy hacking!

--steffen

--
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

No comments: