Addressing the biggest data problem
As you may have read on the Enigma blog recently, we announced ParseKit: a platform we created to simplify the data ingestion process and ETL (Extract Transform Load). ETL systems are a key component of an organization’s ability to integrate data from multiple sources.
If you google the “biggest problems for data scientists,” you will find many discussions about how difficult it is to ingest, clean, and work with disparate sources of data. Based on my experience, these online conversations ring all too true: While data can be a powerful resource, any modeling or analysis is useless if that data is bad or worse, inaccessible. The Enigma team built ParseKit after realizing there were huge gains to be made by formalizing the tools for data integration. We use it internally at Enigma to maintain one of the largest repositories of public data available anywhere.
Too much time is spent trying to access data, instead of using it to make data-driven decisions. I discovered this at my first internship at a 7-person startup where I worked on data analytics. Much of the data that I worked with had to be collected manually from a number of different systems including Salesforce, Google Analytics, Optimizely, etc. When data was needed for a report, blog post, sales meeting, etc., I had to manually combine data in Excel sheets to complete my analysis or create visualizations. Even worse, we didn’t have programmatic infrastructure around our data resources. Daily performance of A/B tests and site visits was not readily available without manually checking each source every day. Those experiences made me realize early on just how crucial good data ingestion processes are.
ParseKit is not the first solution I’ve seen for creating a robust data pipeline. In the past, I’ve used a custom framework before moving to Luigi to handle daily ETL. That custom framework, while functional, had evolved over the years into a dense codebase that no one fully understood.
Luigi, while powerful, introduced other problems for the team. Luigi and its Open Source compatriot Airflow are focused around Directed Acyclic Graphs (DAGs) to create a network of a task dependencies. While technically sound, this method of creating data pipelines can quickly become complicated, difficult to debug, and hard for less-technical users to understand. Combining disparate data sources became a bit of a chore using Luigi. When the company has app metrics in Flurry, web metrics in Google Analytics, and native logs in Hive, what is the best way to create a task graph? Is it better to have each task deal with one data point? One file? One table? One geographic region? Throwing Hive into the mix made planning especially difficult because it is so much data it cannot be stored in memory.
I don’t think about a data pipeline as a complex web of back-propagating dependencies; rather, I know that I need to download some data, clean some points, and then output standardized data to whomever needs it. ParseKit approaches ETL the same way I approach ETL, reflecting a user’s thought process. It simplifies the pipeline by making it a series of linear steps, written in relatively-natural language. For each step, data goes in, and data comes out. ParseKit steps are executed in the order they are written, one record at a time. ParseKit simplifies without reducing power, introducing simple and standard ways to handle general problems.
Simplicity should not come at the cost of flexibility. There are going to be problems that cannot be solved with the exact tools in front of you. One of the best aspects of using ParseKit is that users have the ability to write their own steps that adapt to edge cases. Anyone who does a lot of data integration or ingestion work will realize that the 80/20 rule tends to hold. 80% of what is needed to process data can be generalized and solved by libraries or packages, but 20% of the work requires a solution completely unique to that one case. For example, if you look at the Census Time Series Indicators, the data is stored in easy-to-parse CSVs except for one annoying deviation: The code definitions and table metadata are stored in cells above where the table begins. Writing and integrating custom work to address this kind of problem is challenging when working with today’s enterprise ETL tools. With ParseKit, I can limit my custom code to address that one deviation and the rest of my pipeline can use the same steps as I use on any other CSV dataset. As an added bonus, ParseKit pipelines are written in plain text files, utilizing no proprietary formatting. ParseKit can therefore be easily written in any editor/IDE and I am not locked into a restrictive and proprietary development environment, or worse, forced to learn a GUI.
Additionally, one of the huge benefits of ParseKit is its flexibility around architecture. At previous employers, ETL jobs were all stored in one large repository, around which all development and code review would be centered. One of the reasons the custom framework failed was because the components of the system were too interlocked. The core infrastructure for the ETL framework, such as the scheduler and configurations, existed in the same repository as every single job. With ParseKit, teams have the choice of how to architect their ETL. At Enigma, we give every pipeline its own git repository to ensure modularity. Architecting our pipelines this way prevents merge conflicts from becoming a constant headache and allows for semantic versioning of each component of our system.
Quality Assurance & Standardization
While flexibility is key, I also rely on the simple and standard tools of the ParseKit platform. My past experience with bespoke ETL frameworks taught me that it’s difficult to prevent every pipeline from being readable only to the person who wrote it. With ParseKit, every custom step must conform to a common specification. This not only keeps everything readable, but also self-documenting.
As I mentioned, keeping separate pipelines in different repositories is crucial to our workflow at Enigma. Version control at the pipeline level improves the quality-assurance process, and regular code review is a team standard. Many organizations don’t understand how to, or don’t have the tools to, implement code review. In those scenarios, jobs are written ad-hoc and are considered complete when the data gets wherever it needs to be. In 2016, I should not need to email project files/notebooks to get a peer to review my work. I should not be testing jobs on production servers. I should not be manually running jobs instead of scheduling them. Yet these are all practices that I’ve encountered, and those experiences make me appreciate the features of the ParseKit platform.
No Data Left Behind
Another common practice I’ve encountered is the use of aggregations, computations meant to summarize data, directly in the pipeline. On my last team, we wrote ETL jobs to do certain computations during the job rather than after the data was combined, such as averaging a metric. Aggregations were thought to save memory and potentially make jobs faster. However, having pipelines yield aggregations introduces new deficiencies. A prime example would be when a new aggregation was requested, such as a maximum. Since the source data had been changed during ingestion, the original data was not available in our finished database to compute the aggregation. Due to this, an entirely new job had to be created, and then backfilled for previous days in order to be used. Because none of the jobs were modular and therefore independent of each other, altering existing jobs to add one more aggregation was hard in many places without breaking other analytics tools.
“No data left behind” has become the mantra at Enigma. Our belief is that too much emphasis is placed on aggregation in ETL. With every transformation applied, detail about the data is potentially lost. We use ParseKit to focus on de-siloing data and standardizing it without compromising the data’s meaning. Changes to the data are appended to the source, not overwritten. Once all the data is one place, derivatives can always be generated from transformations, but the option to have the underlying source data remains.
I hope this post has given you a good sense of my day-to-day experiences using ParseKit. Feel free to get in touch if you have questions!