Monday, December 28, 2015

Re: Apparent regex bug involving `\_.' (and other newline-matching constructs)

On Mon, Dec 28, 2015 at 12:20 PM, Bram Moolenaar <Bram@moolenaar.net> wrote:
>
> Brett Stahlman wrote:
>
>> > > Given a file containing the following 2 lines...
>> > > 1a3
>> > > 123xyz
>> > >
>> > > ...try the following tests, and note the unexpected results.
>> > >
>> > > Case 1.1:
>> > > call cursor(1, 1)
>> > > echo searchpos('\%(\([a-z]\)\|\_.\)\{-}xyz', 'pcW')
>> > > => [1, 1, 2]
>> > >
>> > > Case 1.2:
>> > > call cursor(1, 2)
>> > > echo searchpos('\%(\([a-z]\)\|\_.\)\{-}xyz', 'pcW')
>> > > => [2, 1, 1]
>> > > Question: Why does the \_. not permit earlier match at cursor pos (1, 2)?
>> > > Note: Clearly, submatch should be 2, not 1, but this error is simply a
>> > > consequence of the first error: since match doesn't begin on 1st line,
>> > > the "a" at cursor pos can't be captured.
>> >
>> > This is because of the 'c' flag in 'cpoptions'. The Vi-compatible way
>> > of searching is to start at the first column and skip over the match.
>> > Then take the first match after the start position.
>>
>> If this is how it works, then I would have assumed it would have skipped the
>> match it returned for Case 1.1 (at starting position 1,1). But perhaps not
>> skipping the match at column 1 had something to do with (from help on 'cpo')
>> "...but not further than the start of the next line"? If so, the help text
>> isn't very clear in this case. It seems to be describing search
>> "continuation", and my tests were for an isolated search beginning at an
>> arbitrary buffer position. Also, the term "next line" is a bit misleading:
>> in this case, it seems to refer to what would have been the next line of a
>> *previous* search. But I guess the Vi designers didn't want to complicate
>> the implementation by maintaining the state needed to differentiate between
>> a subsequent search for the same pattern without intervening cursor movement
>> and a new search...
>>
>> >
>> > > Case 1.3:
>> > > call cursor(1, 3)
>> > > echo searchpos('\%(\([a-z]\)\|\_.\)\{-}xyz', 'pcW')
>> > > => [2, 1, 1]
>> > > Note: Why isn't a match found at cursor pos (1, 3)?
>> > >
>> > > Repeat these tests with a \zs in the pattern, and note how the capture
>> > > is matched unconditionally...
>> > >
>> > > Case 2.1:
>> > > call cursor(1, 1)
>> > > echo searchpos('\%(\([a-z]\)\|\_.\)\{-}\zsxyz', 'pcW')
>> > > => [2, 4, 2]
>> > >
>> > > Case 2.2:
>> > > call cursor(1, 2)
>> > > echo searchpos('\%(\([a-z]\)\|\_.\)\{-}\zsxyz', 'pcW')
>> > > => [2, 4, 2]
>> > >
>> > > Case 2.3:
>> > > call cursor(1, 3)
>> > > echo searchpos('\%(\([a-z]\)\|\_.\)\{-}\zsxyz', 'pcW')
>> > > => [2, 4, 2]
>> > > Note: Submatch should be 1, not 2, here. It's as though the \zs forces the
>> > > capture to match unconditionally.
>> > >
>> > > Points to note... Originally, I thought the error had to do with the 'p'
>> > > flag, but that appears not to be the case: the submatch errors are simply a
>> > > consequence of the incorrectly determined start locations. Also, it appears
>> > > the results would have been the same with * as they were with \{-}.
>> > > Finally, the unexpected behavior is not limited to \_., but is seen even
>> > > when (e.g.) explicit \n is used.
>> >
>> > After removing 'c' from 'cpoptions', does it work as you expect?
>>
>> Not as I expected, but the first 3 tests, at least, work as I now expect.
>>
>> Case 3.3, however, makes no sense to me now. It returns...
>> => [2, 4, 2]
>> ...even though there's nothing to match the [a-z]. If I change the "1a3" to
>> "123", it returns...
>> => [2, 4, 1]
>> ...which tells me that the parens were capturing the "a" *before* the start
>> position, in spite of the 'W' flag prohibiting wrap. This tells me that the
>> search must be starting before the cursor position, most likely at the start
>> of the cursor line. I would not have expected that a forward search with no
>> lookbehind of any sort could find anything prior to the starting cursor
>> position. But I guess it's not really finding a match prior to the cursor
>> position - just checking to see what needs to be skipped? But with &cpo no
>> longer containing 'c', and the 'c' flag passed to searchpos(), why would it
>> even need this sort of "skip-over" test prior to cursor position?
>
> The search always starts in the first column. Then when a match is
> found and it's before the cursor, another search is done at the next
> position.

Interesting. So IIUC, that could result in a lot of redundant
searches, when the pattern appears multiple times on the same line
prior to start position: e.g., with the following text and cursor
position...

123 123 123 123 <cursor> 123 123 123

A search for "123" would have to try and discard 4 matches before
finding one to return; a subsequent search from the new location would
have to discard 5 matches, a subsequent search would discard 6
matches, and so on... Although this could be inefficient in certain
pathological, long-line scenarios, the bigger issue is the effect it
has on the returned 'submatch' value when the 'p' flag is used.


> Vi compatible is to continue after the matched pattern. When
> removing 'c' from 'cpo' it searches from the next column.
>
> With the \zs the search in the first column returns a position after the
> start position, thus it's a match. Without the \zs the column would be
> the first column.
>
> I can see this is not what you expect or what you want. We can add
> another flag to actually start at the search start position.

I guess that makes sense; either that, or perhaps alter the existing
implementation to ensure that a capture can't capture anything before
the starting location (unless the capture occurs in a look-behind
context).

Thanks,
Brett S.

>
> --
> Computers are not intelligent. They only think they are.
>
> /// Bram Moolenaar -- Bram@Moolenaar.net -- http://www.Moolenaar.net \\\
> /// sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
> \\\ an exciting new programming language -- http://www.Zimbu.org ///
> \\\ help me help AIDS victims -- http://ICCF-Holland.org ///

--
--
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

---
You received this message because you are subscribed to the Google Groups "vim_use" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vim_use+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

No comments: