I spend a lot of time thinking about how to deal with incoming data. I do this partially because it's my job, and partially because I have some kind of workflow OCD. I've been reading an increasing amount of market garbage about Data Lakes and decided to throw in my thoughts.

Imagine Beer

As I find it somewhat easier to discuss data things by example rather than in the abstract let's invent a new company. We're going to found a startup that aggregates data and ratings about varieties of beer.

We go talk to some investors and say something like We're the yelp of beer!, they give us $10M and we get a swanky website with an exotic country code like beer.iq. Now we just need a product. Let's get started.

What Is A Data Lake?

Fundamentally, it's just a raw data store. That's basically it. Data comes in and is deposited into the data lake in a raw or mostly raw form. As much as software vendors want you to believe otherwise, there is no specific data lake technology stack, though there are a few things you'll want to consider to get maximum value.

  • It should be expandable. One of the things that differentiates between a data lake and a reguar old data warehouse is that your data lake is where all of your data lives. Unless you have a very good reason to archive stuff, your data starts off in the lake and stays in the lake, so it's essential that you can cheaply expand your storage.
  • It should be schema agnostic. Data lakes hold raw data or at least mostly raw data. The less you need to know about your data in order to load it the better. There is a reason that file systems like HDFS or schemaless document stores like MongoDB are used for this purpose. If your data lake requires you to define table schemas, you're missing the point and one of the big values of having a data lake in the first place.
  • It should be accessible. Flexibilty is key here. One of the selling points of a lot of proprietary big-data technology is the ability to interact with the data using someting like SQL. This is nice, but neither necessary nor sufficient. The more methods and apis your data lake technology makes available to access and use your data the less workarounds and apis you'll be building yourself.
  • It should be scalable. A data lake is not a database. Production database requirements like transcational processing, millsecond response times and detailed indexes are not appropriate here. What is imporant is that it scales gracefully. You're going to be storing and processing a lot of stuff in here, distributed processing and parallelism that doesn't break the bank is important.

In my personal experience, I've worked with data lake systems which utilize the file system approach (Hadoop/HDFS), the document store approach (MongoDB) and the more relational table approach (various propriety solutions). The distrubuted file system approach is, without question, superior for this task. Period. The Apache hadoop stack in particular, as much of a pain in the ass it is to set up, hits literally every point hard. It's cheaply expandable and scalable, has a wealth of accessibility options (e.g. Hive, Pig, MapReduce, Spark, etc.) and being a file system is about as schema agnostic as you can get. It also happens to be open source.

Hadoop is the worst platform for your data, except for all the others.

Document stores, like MongoDB, are passable assuming that your data is not too big, or too crazy (i.e. at least partially structured). If you're using a relational data store for this sort of thing you're wasting a ton of money on developer time to get your data in, on server hardware to make your systems scale, and probably on a license for a proprietary big-data database. Seriously, go check out the Apache Hadoop and related projects.

Okay let's get some data.

At our imaginary headquarters we hire a ton of engineers to start pulling in whatever they can find about beer, and dumping it to HDFS. After a few months we have a terabyte of messy, unformatted random information about beer. Welcome to the data lake.

Come on In, The Water's Fine

If you're accustomed to a more standard ETL process you're probably thinking that this entire thing is just insane. Why would anyone want a gigantic mess of unusable raw data? As it turns out, there are a lot of very good reasons. Let's have a look.

Quality Assurance

At beer.iq we pull price data for our beer from a 3rd party API. One day we get a complaint that our listed price for a pint of hopocalpyse IPA is incorrect.

Our QA engineers take a look and decide that the cause of the problem is either that:

  • Our 3rd party partner sent us something stupid
  • Our transformation code did something stupid (probably because our 3rd party partner sent us something stupid)

Since we keep all of the raw data transactions in our data lake we can just go and see what data was fed through the system at the time of the report and make a decision about how to permanently fix the problem.

Additionally, if the problem is critical enough, the raw data can be used to provide a fact-based analysis of the current and historical extent of the issue and the reliability of the 3rd party data vendor. You could even build an outlier detection system that uses this historical data to automatically detect and flag these kinds of problems in the first place.

Data Flexibility

One of our product managers wants to build a feature which shows the historical price of each variety of beer.

The original purpose of our data was to give someone the price of beer now, not to display a historical trend. However, since we've been keeping an archive of our historical data state this is something that we can do.

You cannot predict the future use cases for your data. Assuming otherwise can cost you. If we had only kept our operational data, our new feature would either require waiting to collect enough data to build a usable feature or paying a 3rd party that was smart enough not to throw that data away.

Source Quality

On of our investors notices that we spend $10,000 a month to get a daily data dump from a major beer distributor. He wants to know if it's worth it?

Is this a terrifying question? If you're a data company it shouldn't be. While on-boarding new data sources can be a bit of a crapshoot, evaluating old ones should be an integral part of your business.

At beer.iq our imaginary data scientists periodically evaluate our raw data for each of our sources to provide a health check for our systems and feedback to our data providers about how they can improve their offerings or fix issues that we notice.

They take a look at our beer distributor data and discover that while it provided a lot of valuable information six months ago, since then we've built a bunch of web-crawlers that make most of the information it provides redundant. Additionally, the day-to-day change is inconsequential, making a daily dump a waste of compute time. Maybe it's time to renegotiate our contract.

Data Agility

There's a new craze on Twitter with people talking about craft beer using the hashtag #omgbeer. It's kind of a big deal, and product wants to start building features off of it yesterday.

Twitter data! Crap. What kind of schema do we use for tweets? How do we want to use them? What are the business requirements? These are all very important questions and at this stage our data engineers don't have to care. The product team can be dreaming up the next set of twitter-based features, the engineering team can be putting together production SQL databases and mocking up UIs, and the data team can just focus on getting the data into the ecosystem. Transformation and schematization happen later, when we know what we've got and how we're going to use it.

By the time product has decided on what to do with the data, it's already there waiting in the data lake, hopefully with some level of automation and quality control. Most importantly, the data scientists focused on analyzing it and the engineers focused on using it don't need to worry about how it gets there. When the time comes to productionalize the data or some model built on the data, everyone has a single source of truth to work from.

Schema Schmema

Our success with the twitter data was so awesome that we've decided to crawl a bunch of beer related blogs. After a few weeks we have thousands of pages of raw text.

Where is your relational-database diety now? Data takes on many forms, not all of which easily fit into a relational structure. Actually I'm going go a bit further and say that most of the valuable data out there won't fit easily into a relational structure. Even in the cases where you can load it into a set of tables, you're usually just making it more of a pain in the ass for your data scientists to use the data. Ten times out of ten, I'd rather work from the raw text files than whatever JOIN monstrosity results from trying to shoehorn unstructured data into a relational database.

Separation of Concerns

A popular beer website has changed their layout, causing a bunch of our crawlers to go crazy.

Welcome to the data business. Data ingestion is messy and prone to unexpected breakage. In a lot of data pipelines that I've seen, this issue would become apparent at about the same time that production started breaking. It would then be blamed on whatever crawler code was responsible, as if the developer was just too stupid to make the crawler sentinent. This is rage inducing. Only crazy people release untested code into production, but data often gets a dedicated line.

QA your data. Seriously.

Luckily at beer.iq our data guys are, well, data guys. Their focused responsibility on data ingestion, storage and retrieval means that their automated quality systems noticed this issue before it broke production, and they're already working on updating the crawlers. Meanwhile product and UI can focus on whatever they need to into order to deal with the temporary lack of fresh data.

You don't specifically need a data lake for quality control, but it helps. Noticing your crawler just pulled in 100 records isn't meaningful unless you know that it usually pulls in 100,000 records. Additionally, having the raw data lake means you'll be able to see all of the (terrifyingly broken) records that you retrieved, allowing you to more rapidly diagnose the problem and implement a fix. In a traditional ETL system a situation like this often results in the crawler failing, nothing being written into the database, and some poor bastard digging through log files to find out why.

Final Thoughts

So this has already gone on longer than I'd expected, and there is still a lot more I could say. Generally I think the idea of having a raw-data store provides a wealth of utility to a data pipeline. Maximizing that utility, however, does require some different ways of thinking about things like data validation, ETL processes, Business Intelligence, and the role and purpose of Databases, but I'll leave that for a later post.