Tuesday, June 23, 2015

Re: Search for character that doesn't have a combining character?

2015-06-24 0:02 GMT+03:00 Ben Fritz <fritzophrenic@gmail.com>:
> On Tuesday, June 23, 2015 at 3:35:50 PM UTC-5, Ben Fritz wrote:
>> I'm working on a custom command to add strikethrough to text, using the Unicode COMBINING LONG STROKE OVERLAY, 0x0336.
>>
>> In this command, I want to apply a strikethrough to a character, only if it is not already present.
>>
>> This pattern fails because it doesn't match *anything* with regexpengine set to 2, it does not match an unadorned character immediately before a struck-through base character, and it *does* match the last combining character in a word for some reason:
>>
>> [^\u0336]\%u0336\@!
>>
>> This pattern also fails, because it matches already struck-through base characters for some reason (although it does the same thing in both engines):
>>
>> [^\u0336][^\u0336]\@=
>>
>> What is the correct way to do this?
>>
>> Full command (attempted):
>>
>> '<,'>s;\%#=1\%V[^\u0336]\%u0336\@!;\=submatch(0)."\u0336";g
>>
>> Note, how I'm also limiting to a visual selection; so I'm trying to use the :s command for simplicity.
>
> My next attempt is to do two passes, first to remove the combining character from everywhere in the visual selection, and then to add it to the entire visual selection.
>
> But, my patterns for this task either don't match at all, or they remove the base character along with the combining character! Even this doesn't work, it removes the base character:
>
> echo join(split(getline('.'), "\u0336"),"")

I was about to suggest to use tr(), but:

echo tr("o\u0336", "o", "t") is# "o\u0336"
echo tr("o\u0336", "\u0336", "t") is# "o\u0336"
echo tr("o\u0336", "o\u0336", "t") is# "t"

apparently tr() thinks that character is a unicode codepoint *with*
all of the following combining characters.

I would say here that

1. Regexp engines need proper `\p` support from Perl/PCRE. Or, at
least, the opposite of \Z which tells RE engine to treat all unicode
codepoints in the same way.
2. tr() must *always* use "one character is one unicode codepoint"
when &encoding is unicode. It is too low-level tool to care about
character classes, and especially to join codepoints together.

>
> --
> --
> You received this message from the "vim_use" maillist.
> Do not top-post! Type your reply below the text you are replying to.
> For more information, visit http://www.vim.org/maillist.php
>
> ---
> You received this message because you are subscribed to the Google Groups "vim_use" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to vim_use+unsubscribe@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

--
--
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

---
You received this message because you are subscribed to the Google Groups "vim_use" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vim_use+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

No comments: