Data - Null Pointer

Sign in Subscribe

Data

The business of data science and data engineering.

Python Pipes for Text Tokenization

Tokenizing text is something that I tend to do pretty often, typically as the beginning of an NLP workflow. The normal workflow goes something like this: 1. Build a generator to stream in some sentences / documents / whatever. 2. Perform some set of transformations on the text. 3. Feed the result

The Data Lake

I spend a lot of time thinking about how to deal with incoming data. I do this partially because it's my job, and partially because I have some kind of workflow OCD. I've been reading an increasing amount of market garbage about Data Lakes and decided