Tuesday, June 18, 2013

Re: Dealing with empty strings in regexp.

LCD 47 <lcd047@gmail.com> a écrit:
> On 18 June 2013, Paul Isambert <zappathustra@free.fr> wrote:
> > Hello all,
> >
> > The following issue has been recently discussed on the Lua mailing list:
> > http://lua-users.org/lists/lua-l/2013-04/msg00812.html
> >
> > (It has also been independantly raised on the LuaTeX list:
> > http://tug.org/pipermail/luatex/2013-June/004418.html)
> >
> > If I understand correctly, any string can be represented with
> > interspersed empty substrings. E.g. "abc" is really "ϵaϵbϵcϵ", where
> > "ϵ" is the empty string. Now, there seems to be two ways to deal with
> > those empty strings in regexps, especially regarding the "*" operator:
>
> You're making up a metaphysics of empty substrings. I humbly submit
> that there is no such thing in the programming languages you mention
> (don't know about Lua though).

As I've already said, the empty strings were just meant to capture the
differences between languages. I did not mean to imply that those
substrings have any kind of reality.

> > - The Perl way: "X*" matches as many "X" as possible, and does not
> > include the following empty string.
>
> $ echo -n abc | perl -pe 's/[ac]*/($&)/g'
> (a)()b(c)()
>
> The key to understanding this is to keep in mind that:
>
> (1) "*" is greedy; and
> (2) "/g" is defined as "Global matching, and keep the Current position
> after failed matching."
>
> Try something like this if you want the gory details:
>
> $ echo -n abc | perl -Mre=debug -ne 's/[ac]*/($&)/g'
>
> > - The Python (or sed) way: "X*" matches as many "X" as possible, and
> > includes the following empty string.
> >
> > Starting empty strings are always included. So, the Perl way gives (I
> > use Ruby, since I can't speak Perl):
> >
> > puts 'abc'.gsub(/[ac]*/, '(\0)')
> > # returns "(a)()b(c)()", really "(ϵa)(ϵ)b(ϵc)(ϵ)"
>
> Same thing with Ruby: there's a current position pointer, keeping
> track of the current match.
>
> > And the Python way:
> >
> > import re
> > print re.sub(re.compile('(a*)'), '(\\1)', 'abc')
> > # returns "(a)b(c)", really "(ϵaϵ)b(ϵcϵ)"
>
> With Python, re.sub() "return[s] the string obtained by replacing
> the leftmost non-overlapping occurrences of pattern in string by the
> replacement repl". It's the same thing, except for an optimisation:
> "empty matches are included in the result unless they touch the
> beginning of another match".
>
> > (Note that adding "$" to the patterns doesn't change anything.)
> >
> > Now, VimL works in the Perl way, except that "*" includes the empty
> > string if it is the last one in the string:
> >
> > echo substitute('abc', '[ac]*', '(\0)', 'g')
> > " returns "(a)()b(c)", really "(ϵa)(ϵ)b(ϵcϵ)"
>
> Again the same thing, except the optimisation above is applied only
> at the end of the string.

Yes. My question simply was: is it consistent to optimize only at the
end?

> > As far as I'm concerned, I find the Perl way quite counter-intuitive,
> > but what I'm interested in here is whether VimL is consistent or not.
> > I.e., shouldn't it work clearly one way or the other?
>
> You came up with the concept of "ϵ", you fix its limitations. :)

The "metaphysics of empty substrings", the "concept of ϵ"... please, I
know I'm French, but that doesn't mean I subscribe to French Theory! :)

> My conclusion to the above comparison is that Vim should apply the
> same optimisation in full, that is, kill the empty matches that touch
> the beginning of another match. As far as I can tell, that would be
> safe for both the old and the new regexp engines.

I prefer it that way too. But I'd prefer no optimization rather than
conditional optimization, as is the case now.

Best,
Paul

--
--
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

---
You received this message because you are subscribed to the Google Groups "vim_use" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vim_use+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

No comments: