Sunday, September 30, 2012

Re: Matching/Sorting line terminations

#!/usr/bin/env python
import re
from sys import stdin, stdout, stderr, exit
from optparse import OptionParser
word_re = re.compile(r'\w+')

parser = OptionParser(
usage="%prog [options] [infile [outfile]]",
)
parser.add_option("-i", "--ignore-case",
help="Ignore case",
dest="ignore_case",
action="store_true",
default=False,
)
parser.add_option("-w", "--words",
help="Number of words to require as the same",
dest="words",
type="int",
action="store",
default=0,
)

options, args = parser.parse_args()

if args:
infile = file(args.pop(0))
else:
infile = stdin

if args:
outfile = file(args.pop(0))
else:
outfile = stdout

lines = {} #{reverse-sorted-words: [original-lines]}
for line in infile:
key = reversed(word_re.findall(line))
if options.ignore_case:
key = (s.upper() for s in key)
key = tuple(key)
if options.words:
key = key[:options.words]
if key in lines:
lines[key].append(line)
else:
lines[key] = [line]

items = lines.items()
items.sort()
for key, lines in items:
if len(lines) > 1 or not options.words:
for line in lines:
stdout.write(line)
On 09/30/12 11:14, jbl wrote:
> The problem is this: I have a large file of poetry in alphabetical
> order sorted on the last term in each line, I post an except in
> sample1 below. I want to sort it so that lines that share, say, the
> last two terms (on the right) with the last two terms of any other
> line are in one group, those lines that share the last three terms in
> another and so on

Well, a quick little Python script seems to do the grunt-work for me:

##############################
import re
r = re.compile(r'\w+')
print ''.join(sorted(
(line for line in file("raw.txt")),
key=lambda s: tuple(reversed(r.findall(s)))
))
##############################

That is case-sensitive. It's a small bit more if you want it
case-insensitive:

##############################
import re
r = re.compile(r'\w+')
print ''.join(sorted(
(line for line in file("raw.txt")),
key=lambda s: tuple(w.upper() for w in reversed(r.findall(s)))
))
##############################

> and so on up to seven places. But it must be possible to generalize
> that somehow.

One of the tricky aspects of this is how you treat (or ignore)
differing punctuation. If you want the same words, but allowing for
varying punctuation, it's a lot more complex. That said, it sounded
like a fun afternoon challenge, so I threw together & attached a
quick program that accommodates all your options and
case-insensitivity needs :-)

It can be called on a pair of files, or you can pipe stdin and it
will return on stdout in case you want to call it from Vim with

:%! python revsort.py

to just operate on a sub-range of your file. Alternatively, from a
command-line, you can use

python revsort.py infile.txt outfile.txt

or, if you only want those where the last N words match, you can do
things like

python revsort.py -w 3 infile.txt outfile.txt

It was kinda fun, and hopefully the code is easy to follow. This
does assume that you have Python installed on your machine. It was
tested on 2.6, but should run on 2.4-2.7, and possibly on 3.x as well.

-tim






--
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

No comments: