Merlin’s weekly podcast with Dan Benjamin. We talk about creativity, independence, and making things you love.
Merlin’s weekly podcast with Dan Benjamin. We talk about creativity, independence, and making things you love.
”What’s 43 Folders?”
43Folders.com is Merlin Mann’s website about finding the time and attention to do your best creative work.
Trick for analyzing Project Gutenberg texts
mdl | Dec 13 2006
Hi everyone, I've always been intrigued by Project Gutenberg's online texts, but until now have not known how to digest the massive hunks of plain text that each file represents. It may be great to have Emerson's , Thoreau's [URL="http://www.gutenberg.org/etext/205"]Walden[/URL], or Samuel Pepys' [URL="http://www.gutenberg.org/etext/4200">Diary in plain text on my hard drive, but there's no way I'm going to sit at my computer and read them. And you will never find me sitting on the subway, squinting at a Project Gutenberg text on my iPod's small screen. So these neglected texts have languished on my hard drive, gathering dust. But recently, and to my great surprise and delight, I've discovered the ancient and elegant technology of the UNIX command line. And I've come to realize that a few basic commands (especially grep and tr) can make mincemeat of these massive bodies of text, offering all sorts of ways to search and analyze them. The simplest is grep. Let's say I have the Diary of Samuel Pepys on my hard drive (pepys.txt) and I want to search it for all appearances of the word "ale," I would simply type:
-i = case insensitive The result is a printout of all the lines in the etext in which the word ale appears, together with their line numbers. If line 2555 looks interesting, I can jump to it in the text (from within less) by typing:
I can browse the passage of the text. When I quit out of this view, I return to my search results. Let's say I want a little more context with each search. Then I would simply add the number three (3) to my search options:
This will print out each line containing ale together with the three lines above and below it (for a total of seven lines of context). What about indexes? Let's say I want a complete index of all the words in Pepys' diary, together with their frequency of occurrence. For this I can type:
In a few seconds, I get a new text file with a list of all the words that appear in Pepys' diary together with their frequency. I can browse this for search ideas. Here's a very small excerpt of the results: 101 alderman 20 aldermen 9 aldersgate 7 aldgate 2 aldrige 1 aldworth 109 ale 1 alehoofe 71 alehouse 1 alehouses It's easiest to make the index command an alias (this one's in my .tcshrc file):
Then, to create an index of Pepys' diary, I simply type:
And now, for the icing on the cake, I can combine the index alias with the grep function to create subindexes. So to get an index of every word that appears within three lines of the word "ale," I could type: grep -3iw ale pepys.txt | textindex | less or grep -3iw ale pepys.txt | textindex > pepysaleindex.txt The results suggest new boolean searches--e.g. passages in the text where the words "ale" and "headache" occur within, say, 3 lines of each other. (Don't know if they actually do - just a hypothetical suggestion.) Anyway, thought the academics out there might be interested in these little plain text hacks. 2 Comments
POSTED IN:
About mdl |
|
EXPLORE 43Folders | THE GOOD STUFF |