Tokenizing text is something that I tend to do pretty often, typically as the beginning of an NLP workflow. The normal workflow goes something like this:

Build a generator to stream in some sentences / documents / whatever.
Perform some set of transformations on the text.
Feed the result into whatever it is I'm doing.

Step two is one of the more complex steps, since it's not always clear what transformations you want to do from the outset and whenever you're doing a series of transformations on anything (particularly when regex is involved) weird order of operations things can happen.

What follows will be the pattern I used when building out prototypes of things. As I tend to like examples, we'll be doing some extremely basic word frequency evaluation over Moby Dick.

Setup

First things first, we need to get our text and make sure we've loaded any supporting libraries. Typically I'd have a host of tools available, but we'll keep things relatively simple.

import requests
import re
from collections import Counter

Next we'll need to get our text from Project Gutenberg.

response = requests.get("http://www.gutenberg.org/cache/epub/2701/pg2701.txt")

Gutenberg texts have a lot of front matter, as well as licensing information at the end. For our purposes we just want the chapter text, which we can grab with some regex. We'll also save it to disk so we don't have to worry about losing it or hitting their servers multiple times.

rex = re.compile(r'CHAPTER \d+\..*?(?:\r\n)+(.*?)(?:\r\n)+(?=(?:CHAPTER)|(?:Epilogue))', flags=re.DOTALL)

chapterText = re.findall(rex, response.text)
chapterText = [re.sub(r'(\r\n)+', ' ', y) for x, y in a]

with open('moby_dick.txt', 'w') as myText:
    for chapter, text in b:
        myText.write(chapterText.encode('utf8') + '\n')

Finally let's build a quick generator that will read from our file on disk and return each chapter, one after the other.

def chapters():
    with open(TEXT, 'r') as myFile:
        for chapter in myFile:
            yield chapter

Awesome, let's get to tokenizing.

The First Sentence

The first sentence of our text goes like this:

Call me Ishmael.

Okay, that's pretty boring (and classic!). Let's do the first two sentences.

Call me Ishmael. Some years ago--never mind how long precisely--having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world.

Better. There are plenty of interesting problems in this tiny bit of text. For our current purposes, all we really want to know is word frequencies. That is, how many times does a given word occur in our text. To do this we're going to need to break these strings down into words, and the easiest way to do this is to split our strings on spaces. This is the essence of a basic whitespace tokenizer. Let's make one.

testSentence = "Call me Ishmael. Some years ago--never mind how long precisely--having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world."

def whitespace_tokenizer(string): return string.split(' ')

whitespace_tokenizer(testSentence)
>>> ['Call', 'me', 'Ishmael.', 'Some', 'years', 'ago--never', 'mind', 'how', 'long', 'precisely--having', 'little', 'or', 'no', 'money', 'in', 'my', 'purse,', 'and', 'nothing', 'particular', 'to', 'interest', 'me', 'on', 'shore,', 'I', 'thought', 'I', 'would', 'sail', 'about', 'a', 'little', 'and', 'see', 'the', 'watery', 'part', 'of', 'the', 'world.']

Super easy. Of course we have some problems that are going to throw off our frequency counts:

Punctuation: "purse," != "purse", "world." != "world", "ago--never" is weird
Capitalization: "Call" != "call", etc.

Let's make some little functions to fix those too.

def remove_punctuation(string): return re.sub('\W+', ' ', string)
def lower_case(string): return string.lower()

whitespace_tokenizer(remove_punctuation(lower_case(testSentence)))
>>> ['call', 'me', 'ishmael', 'some', 'years', 'ago', 'never', 'mind', 'how', 'long', 'precisely', 'having', 'little', 'or', 'no', 'money', 'in', 'my', 'purse', 'and', 'nothing', 'particular', 'to', 'interest', 'me', 'on', 'shore', 'i', 'thought', 'i', 'would', 'sail', 'about', 'a', 'little', 'and', 'see', 'the', 'watery', 'part', 'of', 'the', 'world', '']

Let's see what our word frequencies look like at this point.

frequencies = Counter()
for chapter in chapters():
    frequencies.update(whitespace_tokenizer(remove_punctuation(lower_case(chapter))))

frequencies.most_common(10)
>>> [('the', 14053), ('of', 6452), ('and', 6299), ('a', 4627), ('to', 4534), ('in', 4067), ('that', 3039), ('his', 2493), ('it', 2491), ('i', 2108)]

Hmm... that's pretty boring. Let's do two more things:

Remove English 'stopwords'
Remove extremely short and extremely long words

But first let's look at our code. Specifically, let's look at this:

frequencies.update(whitespace_tokenizer(remove_punctuation(lower_case(chapter))))

This is terrifying, and it's only going to get worse as we continue to tack things on. Of course we could refactor it into something like:

def tokenize(string):
    string = string.lower()
    string = re.sub('\W+', ' ', string)
    string = string.split(' ')
    return string

But there is something nice about having each operation be it's own little function, and the ability to piece them together into a processing pipeline. We can do better.

Text Pipes

Let's consider what we actually want to do. We want a function to take a string and then perform a list of transformation operations on that string in order, returning the final results. Bonus points if we can easily modify the sequence of transformations applied, and extra bonus points if we can easily inspect the result at each stage for testing purposes. Python delivers.

def transform(string, operations):
    return reduce(lambda value, op: op(value), operations, string)

And that is all. What the hell is this doing? Let's have a look. Our transform function takes a string and a list of operations. It passes the operations and the original string to reduce, which iterates through the list of operations, applying each one to the output of the previous one and then returns the final result. We can call it like so.

transform(testSentence, operations=[remove_punctuation, 
                                    lower_case, 
                                    whitespace_tokenizer])
>>> ['call', 'me', 'ishmael', 'some', 'years', 'ago', 'never', 'mind', 'how', 'long', 'precisely', 'having', 'little', 'or', 'no', 'money', 'in', 'my', 'purse', 'and', 'nothing', 'particular', 'to', 'interest', 'me', 'on', 'shore', 'i', 'thought', 'i', 'would', 'sail', 'about', 'a', 'little', 'and', 'see', 'the', 'watery', 'part', 'of', 'the', 'world', '']

Now we can build some new functions and add them into our operations pipeline easily.

stopwords = {'all', 'just', 'being', 'over', 'both', 'through', 'yourselves', 'its', 'before', 'herself', 'had', 'should', 'to', 'only', 'under', 'ours', 'has', 'do', 'them', 'his', 'very', 'they', 'not', 'during', 'now', 'him', 'nor', 'did', 'this', 'she', 'each', 'further', 'where', 'few', 'because', 'doing', 'some', 'are', 'our', 'ourselves', 'out', 'what', 'for', 'while', 'does', 'above', 'between', 't', 'be', 'we', 'who', 'were', 'here', 'hers', 'by', 'on', 'about', 'of', 'against', 's', 'or', 'own', 'into', 'yourself', 'down', 'your', 'from', 'her', 'their', 'there', 'been', 'whom', 'too', 'themselves', 'was', 'until', 'more', 'himself', 'that', 'but', 'don', 'with', 'than', 'those', 'he', 'me', 'myself', 'these', 'up', 'will', 'below', 'can', 'theirs', 'my', 'and', 'then', 'is', 'am', 'it', 'an', 'as', 'itself', 'at', 'have', 'in', 'any', 'if', 'again', 'no', 'when', 'same', 'how', 'other', 'which', 'you', 'after', 'most', 'such', 'why', 'a', 'off', 'i', 'yours', 'so', 'the', 'having', 'once', 'would', 'could', 'should', 'like', 'one', 'upon', 'though', 'yet'}
def remove_stopwords(tokens): [t for t in tokens if t not in stopwords]
def remove_extremes(tokens): [t for t in tokens if 2 < len(t) < 12]

transform(testSentence, operations=[remove_punctuation, 
                                    lower_case, 
                                    whitespace_tokenizer, 
                                    remove_stopwords, 
                                    remove_extremes])
>>> ['call', 'ishmael', 'years', 'ago', 'never', 'mind', 'long', 'precisely', 'little', 'money', 'purse', 'nothing', 'particular', 'interest', 'shore', 'thought', 'would', 'sail', 'little', 'see', 'watery', 'part', 'world']

I think that's pretty neat, pretty readable and pretty useful. It also qualifies for bonus points, since it's extremely easy to add or remove operations from our transformation pipeline.

One interesting thing to notice here is the particular use of Python's dynamic typing. Both remove_punctuation and lower_case take a string and return a transformed string. Our whitespace_tokenizer takes a string and returns a list of strings. Finally, remove_stopwords and remove_extremes both take lists of strings and return lists of strings. Dynamic typing means that we can just throw all of this into our transform reduce and it all just works, so long as we make sure that the input of each operation accepts the output of the previous operation. This is pretty awesome, but quite a lot of responsibility, and can lead to some issues if we're not careful. As an example:

transform(testSentence, operations=[lower_case, remove_stopwords])
>>> ['c', 'l', 'l', ' ', 'm', 'e', ' ', 'h', 'm', 'e', 'l', '.', ' ', 'o', 'm', 'e', ' ', 'y', 'e', 'r', ' ', 'g', 'o', '-', '-', 'n', 'e', 'v', 'e', 'r', ' ', 'm', 'n', 'd', ' ', 'h', 'o', 'w', ' ', 'l', 'o', 'n', 'g', ' ', 'p', 'r', 'e', 'c', 'e', 'l', 'y', '-', '-', 'h', 'v', 'n', 'g', ' ', 'l', 'l', 'e', ' ', 'o', 'r', ' ', 'n', 'o', ' ', 'm', 'o', 'n', 'e', 'y', ' ', 'n', ' ', 'm', 'y', ' ', 'p', 'u', 'r', 'e', ',', ' ', 'n', 'd', ' ', 'n', 'o', 'h', 'n', 'g', ' ', 'p', 'r', 'c', 'u', 'l', 'r', ' ', 'o', ' ', 'n', 'e', 'r', 'e', ' ', 'm', 'e', ' ', 'o', 'n', ' ', 'h', 'o', 'r', 'e', ',', ' ', ' ', 'h', 'o', 'u', 'g', 'h', ' ', ' ', 'w', 'o', 'u', 'l', 'd', ' ', 'l', ' ', 'b', 'o', 'u', ' ', ' ', 'l', 'l', 'e', ' ', 'n', 'd', ' ', 'e', 'e', ' ', 'h', 'e', ' ', 'w', 'e', 'r', 'y', ' ', 'p', 'r', ' ', 'o', 'f', ' ', 'h', 'e', ' ', 'w', 'o', 'r', 'l', 'd', '.']

What happened! First we converted the string to lowercase, then we tried to remove stopwords. Internally Python treats a string as a list of characters, so our remove_stopwords function happily took our character list, and stepped through removing anything in the stopword list (i.e. every 'a' and 'i'). With great power comes great responsibility, so we'll need to be careful.

Luckily we can also extend this pattern a bit to make it easy to test our work before we settle on the pipeline we want.

Testing with Decorators

Python decorators are an extremely useful and often overlooked part of the language specification. It is beyond the scope of this little tutorial to dive into all of the things you can do with decorators; however, one of the primary uses of decorators is to temporarily augment the behavior of your functions. Let's define a decorator to test our operation functions.

def test(function, *args):
    def do(args):
        print("-- Applying {}".format(function.__name__))
        print("Result: {}".format(function(args)))
        return function(args)
    return do

Let's break this down a bit. Our decorator test is a function which takes a function and one or more arguments and returns the function do. The do function first prints the name of the function argument passed in, then prints the result of applying that function to the passed in arguments, and then just returns the result of applying our function to the arguments.

Imagine that we pass one of our functions, like lower_case into test. The result would be a new function that does the same thing that lower_case already does but also prints a few debugging lines for us first. We can apply our decorator to our functions either using decorator syntax like this:

@test
def my_function(): ...

Or, if we're just doing a temporary test it's often easier to just apply it to our operations with map when we call transform, like so.

result = transform(testSentence, operations=map(test, [remove_punctuation,
                                                       lower_case,
                                                       whitespace_tokenizer]))]

-- Applying: remove_punctuation
Result: Call me Ishmael Some years ago never mind how long precisely having little or no money in my purse and nothing particular to interest me on shore I thought I would sail about a little and see the watery part of the world 
-- Applying: lower_case
Result: call me ishmael some years ago never mind how long precisely having little or no money in my purse and nothing particular to interest me on shore i thought i would sail about a little and see the watery part of the world 
-- Applying: whitespace_tokenizer
Result: ['call', 'me', 'ishmael', 'some', 'years', 'ago', 'never', 'mind', 'how', 'long', 'precisely', 'having', 'little', 'or', 'no', 'money', 'in', 'my', 'purse', 'and', 'nothing', 'particular', 'to', 'interest', 'me', 'on', 'shore', 'i', 'thought', 'i', 'would', 'sail', 'about', 'a', 'little', 'and', 'see', 'the', 'watery', 'part', 'of', 'the', 'world', '']

Now transform outputs our debug lines for us to inspect.

And the Top Word Is

Okay, now that we have things working, let's find out what the top words are.

frequencies = Counter()
for chapter in chapters():
    frequencies.update(transform(chapter))

frequencies.most_common(10)
>>> [('whale', 1139), ('man', 525), ('ship', 507), ('ahab', 504), ('old', 446), ('sea', 435), ('head', 334), ('time', 332), ('long', 330), ('boat', 329)]

Something about whales, ships and ahab. That sounds about right.