Friday, January 29, 2016

Re: "Exploding" a paragraph into individual lines



fredag 29 januari 2016 skrev Salman Halim <salmanhalim@gmail.com>:


On Jan 28, 2016 11:56 PM, "Chris Collision" <cfcollision@gmail.com> wrote:
>
>
>
> On Thu, Jan 28, 2016 at 8:04 PM, Rik <amphiboly@gmail.com> wrote:
>>
>> On Thursday, January 28, 2016 at 10:56:59 PM UTC-5, Chris Collision wrote:
>> > Well-formatted text isn't the same thing as natural language; I do not think this is doomed to fail.  Is there a way to use paragraph / sentence motions to do this "exploding"?  I have played around for a few minutes but have not had much luck.  Perhaps an expert can take this farther.
>> >
>> >
>> > On Thu, Jan 28, 2016 at 7:40 PM, Rik <amph...@gmail.com> wrote:
>> > On Wednesday, January 27, 2016 at 12:49:39 PM UTC-5, Chris Lott wrote:
>> >
>> > > I'd like to take a paragraph like the following:
>> >
>> > >
>> >
>> > > This is a paragraph. Wow! What do I do now?
>> >
>> > >
>> >
>> > > And break it into individual lines, ala:
>> >
>> > >
>> >
>> > > This is a paragraph.
>> >
>> > > Wow!
>> >
>> > > What do I do now?
>> >
>> > >
>> >
>> > > StackExchange revealed this regex that seems to work well matching the proper lines:
>> >
>> > > [.!?][])"']*\($\|[ ]\)
>> >
>> > >
>> >
>> > > So I can do this:
>> >
>> > > :%s/[.!?][])"']*\($\|[ ]\)/XXX\r\r/g
>> >
>> > >
>> >
>> > > But obviously need something where XXX is!
>> >
>> > >
>> >
>> > > c
>> >
>> >
>> >
>> > Regex cannot handle the complexity of natural language, and thus you are doomed to fail, Mr.
>> >
>> >
>> >
>> > Lott.
>> >
>> >
>> >
>> > --
>> >
>> > Rik
>> >
>> >
>> >
>> >
>> >
>> > --
>> >
>> > --
>> >
>> > You received this message from the "vim_use" maillist.
>> >
>> > Do not top-post! Type your reply below the text you are replying to.
>> >
>> > For more information, visit http://www.vim.org/maillist.php
>> >
>> >
>> >
>> > ---
>> >
>> > You received this message because you are subscribed to the Google Groups "vim_use" group.
>> >
>> > To unsubscribe from this group and stop receiving emails from it, send an email to vim_use+u...@googlegroups.com.
>> >
>> > For more options, visit https://groups.google.com/d/optout.
>> >
>> >
>> >
>> >
>> >
>> > --
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > -Collision
>> > @cfcollision | http://idontevenownatelevision.com/ | http://tinyletter.com/collision | 503.997.1907
>>
>> 1. Please do not top-post.
>>
>> That would be parsed by the proposed regex as
>>   1.
>>   Please do not top-post.
>>
>> Doomed to fail, Mr. Lott.
>>
>> That would be parsed as:
>>   Doomed to fail, Mr.
>>   Lott.
>>
>> Do you see any problem here?
>>
>> --
>> rik
>>
>> --
>> --
>> You received this message from the "vim_use" maillist.
>> Do not top-post! Type your reply below the text you are replying to.
>> For more information, visit http://www.vim.org/maillist.php
>>
>> ---
>> You received this message because you are subscribed to the Google Groups "vim_use" group.
>> To unsubscribe from this group and stop receiving emails from it, send an email to vim_use+unsubscribe@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>
>
>
> Please forgive me for top-posting: I forgot that gmail does that by default.  Won't happen again.  Would this incredibly difficult counterexample of yours be solved by the venerable convention of using two spaces after sentence-ending punctuation?
>
>
>
> --
> -Collision
> @cfcollision | http://idontevenownatelevision.com/ | http://tinyletter.com/collision | 503.997.1907

The convention of two spaces after sentences isn't something you can hold most people to these days. It might be easier to just rejoin lines that end in common abbreviations such as Mr., Dr., etc.

Of course, the case of the "etc." is more interesting because it could be the end of a sentence or it may not. Typically, if it's not the end of a sentence, it's followed by a comma. And, if it is the end of a sentence, it is followed by whitespace and a capital letter.

You may also have to contend with Ave., Blvd., St. (Saint or Street). I think it will require more logic than is afforded by a single substitute operation, as suggested before.

Suddenly, the two-space option looks pretty good. :)


Most of those abbreviations begin with a capital letter and are only <= 4 letters long. While a sentence certainly may end in something like "Bob." that may be an acceptable overkill. Then you need to look out for "e.g.", "i.e.", "viz." and a few more which I can't remember off the top of my head. That should be possible with lookbehind, something like (untested since I'm AFC)

s#\v%(%(%(\u\l{1,3}|i\.e|e\.g|viz)\.?)@<![.?!]\)?)@<=#\r\r#g

--
--
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

---
You received this message because you are subscribed to the Google Groups "vim_use" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vim_use+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

No comments:

Post a Comment