Data at Inkling
It’s by now a cliche mantra in tech companies: make data-driven decisions. Each data-driven decision, however, is founded on gathering clean, useful data. At Inkling, we handle millions of events every day that originate from both user-facing products and backend systems. To add to the complexity, these events are constructed by different teams with their own goals and metrics in mind. How do you develop a system that all can use at scale?
In some cases, gathering simple metrics via counters using a system like StatsD is sufficient. Many more cases, however, require context around a metric (e.g. user agent, currently active view), and this is where things get interesting: every single one of our events is decentralized. No master Thrift data structure exists describing an event type. No repo must be shared across systems and linked into applications. Everything boils down to JSON at runtime with only two requirements: at the base level there must exist an “event type” field and an “id” field. Anything else goes! If you want to add nested objects or just a single number, it all works. In order to then run analytics on that event type, we create a schema describing the data, including what fields to expect (filtering of bad or unexpected events) and how those fields should map semantically (aggregation of different events across applications.)
Thrift always seems like a good idea until it’s not
I love Thrift as a data definition language (DDL). It is incredibly easy to work with, compiles into any language, and has very few bugs. Plus, it’s modular and really enforces the D.R.Y. philosophy. Once you install and integrate it into your build system , little else is needed. You create an “event.thrift” file that contains a handful of powerful data structures to track metrics like user agent, page context, referrer , and load time. Your DDL requires all of these fields be present, so analyzing data is a breeze. So far, so good.
Then you start to grow. Soon you have two or three different applications running, each chasing different success metrics from entirely new contexts. In a perfect world, all products and services would implement similar abstractions and pieces would fit together snugly; however, things are rarely that smooth. Realistically, your DDL will grow quickly. After a transition to a systems-oriented architecture, native and web-based apps, and public facing APIs, only a few people will have enough context to understand the schema in its entirety . The downsides become apparent in this environment and quickly.
The problem with this approach is threefold:
- A single point of entry for all analytic events leads to syntactic merge conflicts and, even worse, semantic merge conflicts. Both of these slow down the speed of development in the best case and in the worst can give incorrect data.
- Generating and updating analytic events becomes extremely arduous and, likely, a shirked responsibility for non-critical code paths. An engineer who does not fully understand the global DDL and becomes intimidated by a massive centralized analytic tracking system is less likely to submit events herself.
- There’s a very steep learning curve for anyone processing this massively complex data stream.
What about NoSQL or a search engine solution?
Given that our events would fit well into a NoSQL structure, it begs the question: why not go down that route? In fact, we did create a large MongoDB cluster that we dump events into for quick processing. This allows any engineer to immediately analyze their datasets which can be quite helpful for quick turnaround on bugs or new features’ performances.
The two major downsides to this approach as a long-term solution are:
- There is no good way to create curated, up-to-date documentation about the data in the system. Although documentation can exist in parallel to the actual data, this approach very quickly becomes unmaintainable during rapid iteration cycles because documentation falls behind implementation. Users lose confidence in the documentation because it does not represent the state of the system. The most appropriate solution: centralize documentation as a global DDL (see above).
- There is no good way to map together different event types with the same semantic meaning but different representations. If two teams describe their metrics differently, they need to aggregate these concepts together offline using MongoDB map reduce or in the case of ElasticSearch, crazy online merges. Thus, aggregate information must be described in a global DDL (see above).
Dynamic offline event curation
The first approach is what I like to call centralized, online event curation because each event contains its full semantic meaning at runtime. At Inkling, we take the opposite approach: fire whatever you want, and tell the system what to do with it offline. Each event type is curated via offline schemas that configure our data pipeline’s processing for that event. Thus, the event may be fired online and then we introduce rigid requirements offline.
For instance, if the iOS team decides to track signups differently than the web team, it’s not an issue: the schema describing the “com.inkling.ios.SignupRequest” event maps the fields properly to the “com.inkling.SignUp” dataset. As mentioned above, this is done by first telling the data pipeline what an event looks like so it may filter out malformed or malicious data and then describing what each field means semantically. The final product then becomes a nice decentralized definition both syntactically and semantically. A neat side effect is that each of these schemas provide documentation to each individual event and its corresponding dataset representation.
The final product is Phase 1, or the sanitization phase, of Inkling’s data pipeline. At this point, all data flows through and is filtered and mapped using these schemas to properly sanitize the input. Once data emerges and it becomes usable, we now have confidence that the data is well-formed and that each field means what is defined in its schema.
A decentralized event generation system along with a set of well-defined schema-driven datasets allows for faster iteration and the ability to create new streams of data without cross-team planning during development sprints. This allows us to move faster. Furthermore, this system allows for analyzing itself because all we must do is have it emit events and describe them with schemas. This recursive analysis of data processing is only one example of where this system has taken us. We continue to iterate every day on the data team, coming up with new and powerful ways to inspect and curate our data streams. If this sounds like something you’re interested in, we’re hiring! Also, follow us on twitter @InklingEng.
 – If you were able to install Thrift and integrate it into your build system without needing a DDL for the 4 letter words spewing all over the place, mad respect.
 – Or referrer to not be ambiguous.
 – It’s not uncommon to hear about a 250 field Thrift object where all new events require meetings with dozens of people to determine what data goes where and what it means. All of this happens before the first event can even fire!