Trick for analyzing Project Gutenberg texts

mdl | Dec 13 2006

Hi everyone,

I've always been intrigued by Project Gutenberg's online texts, but until now have not known how to digest the massive hunks of plain text that each file represents. It may be great to have Emerson's , Thoreau's [URL="http://www.gutenberg.org/etext/205"]Walden[/URL], or Samuel Pepys' [URL="http://www.gutenberg.org/etext/4200">Diary in plain text on my hard drive, but there's no way I'm going to sit at my computer and read them. And you will never find me sitting on the subway, squinting at a Project Gutenberg text on my iPod's small screen. So these neglected texts have languished on my hard drive, gathering dust.

But recently, and to my great surprise and delight, I've discovered the ancient and elegant technology of the UNIX command line. And I've come to realize that a few basic commands (especially grep and tr) can make mincemeat of these massive bodies of text, offering all sorts of ways to search and analyze them.

The simplest is grep. Let's say I have the Diary of Samuel Pepys on my hard drive (pepys.txt) and I want to search it for all appearances of the word "ale," I would simply type:

grep -inw ale pepys.txt | less

-i = case insensitive
-n = print line numbers next to each result
-w = searches just for "ale" as a single word; will not include other words that contain the pattern "ale," such as "alehouse" or "hale"
less = allows one to scroll through the results one page at a time

The result is a printout of all the lines in the etext in which the word ale appears, together with their line numbers. If line 2555 looks interesting, I can jump to it in the text (from within less) by typing:

!less +2555 pepys.txt

I can browse the passage of the text. When I quit out of this view, I return to my search results.

Let's say I want a little more context with each search. Then I would simply add the number three (3) to my search options:

grep -3inw ale pepys.txt | less

This will print out each line containing ale together with the three lines above and below it (for a total of seven lines of context).

What about indexes? Let's say I want a complete index of all the words in Pepys' diary, together with their frequency of occurrence. For this I can type:

tr 'A-Z' 'a-z' < pepys.txt | tr -cs a-z '\012' | sort | uniq -c > pepysindex.txt

In a few seconds, I get a new text file with a list of all the words that appear in Pepys' diary together with their frequency. I can browse this for search ideas. Here's a very small excerpt of the results:

 101 alderman
  20 aldermen
   9 aldersgate
   7 aldgate
   2 aldrige
   1 aldworth
 109 ale
   1 alehoofe
  71 alehouse
   1 alehouses

It's easiest to make the index command an alias (this one's in my .tcshrc file):

alias textindex "tr 'A-Z' 'a-z' | tr -cs a-z '\012' | sort | uniq -c"

Then, to create an index of Pepys' diary, I simply type:

cat pepys.txt | textindex > pepysindex.txt

And now, for the icing on the cake, I can combine the index alias with the grep function to create subindexes. So to get an index of every word that appears within three lines of the word "ale," I could type:

grep -3iw ale pepys.txt | textindex | less

or 

grep -3iw ale pepys.txt | textindex > pepysaleindex.txt

The results suggest new boolean searches--e.g. passages in the text where the words "ale" and "headache" occur within, say, 3 lines of each other. (Don't know if they actually do - just a hypothetical suggestion.)

Anyway, thought the academics out there might be interested in these little plain text hacks.

2 Comments

POSTED IN:

5779 reads

TOPICS: Life Hacks

mdl, this is great. You're...

Submitted by terceiro on December 13, 2006 - 8:28pm.

mdl, this is great. You're becoming a real treasure-trove of CLI wisdom for us literary-types. Thanks!

» POSTED IN:

parent

EXPLORE 43Folders

THE GOOD STUFF

43 Folders

Trick for analyzing Project Gutenberg texts

mdl, this is great. You're...

Search 43F

Ads via The Deck

43f Hosting by A2

Merlin Elsewhere

Popular
Today

Popular
Classics

Recent
Posts

Cranking

Scared Shitless

43 Folders

Trick for analyzing Project Gutenberg texts

mdl, this is great. You're...

Search 43F

Ads via The Deck

43f Hosting by A2

Merlin Elsewhere

PopularToday

PopularClassics

RecentPosts

Cranking

Scared Shitless

Popular
Today

Popular
Classics

Recent
Posts