ticdat — The pythonium shield for your model

Mohit Mahajan
5 min readNov 3, 2020
source: https://images.app.goo.gl/abTxrxkipSp4EsPW9

Data products in real life are like the infinity war. Prediction and optimization models are the avengers of this war, fighting different scenarios and defeating the problems. But would avengers have won the war without Wakanda’s vibranium shield?

Hmm.. probably not!
- Dr. Strange ;)

Why?

Understand Data Integrity

Effective algorithms and models are designed to function on a predefined data model. The developers create some regulations that input data needs to follow in order to satisfy model dependencies and assumptions. These are the rules that are NOT meant to be broken. For instance, imagine you’re designing a prediction engine with blood group as a feature. This feature can take only discrete values A-,B-,A+,B+ etc. or may be you’re designing a production planning model which intakes minimum and maximum inventory values such that minimum ≤ maximum.

But sometimes things go unexpected, users are inadvertently careless. They throw something in the inputs that violate the assumptions. Like minimum inventory > maximum inventory or a blood group value appears to be D. Occurrences like these would break your algorithm flow or may be just spit out garbage. This exactly is the Achilles heel of your procedure.

A user of the product can do all kinds of messing up with the input data and as a good engineering practice one must resolve/handle such vulnerabilities before the input data would actually touch the model. In technical terminology, such a routine is called data integrity check. It is just like a vibranium shield, only good things (integral datasets) are allowed to enter Wakanda (model)

Why should you care?

In an application development scenario, besides providing a great documentation and a nice UI it’s equally important to supply a proper integrity checking piece. The product or solution users do not directly ask for such a module, but it’s important for them to know the problems in their dataset. They need to be highlighted in order to build greater confidence in the solution. It’s well appreciated by the users once the model is deployed in production.

Problem with “data integrity check”

Software engineers and data scientists over time have adopted several ways to address the data integrity layer, either by composing an ORM structure for the data model or using SQL procedures else one would have figured out her own way with pandas and numpy. Such modules demand significant effort on development and testing. Organizations struggle to maintain standardization and document within such a piece of code. With continuous improvements and feature enhancements, the code base also evolves which in turn makes the maintenance way more complicated.

“ticdat is a solution”

ticdat solves this problem! ticdat provides a modern and standard way to perform data integrity checking for the model. It abstracts out rules (I’ll talk about this later in this article) from the data instance, meaning, datasets can be anything arbitrary but for a successful execution it necessarily need to comply with a set of predefined rules.

As I say, it is a pythonium shield, means, Ticdat is only supported on python3. With more and more data applications being developed in python, there’s already a vast addressable market.

Ticdat module provides has several ways that allows the developer to create a set of data integrity rules, which is the expectation from the data instance. Then, we have different ways to source the data and pass the obtained data through the defined rule. In this process ticdat’s routine finds the records of dataset which doesn’t comply with the defined rules (call this incompliance as violations). A developer can choose to do whatever is desired to handle these, either impute or raise exceptions.

You’re probably thinking…okay this sounds very good! How does it work?
Following is a structured way to think about data integrity

Remember 2 things — “Data Rules” and “Data Sourcing”

Data Rules
These are a set of constraints that input data is expected to follow. For example, say your model inputs are several pandas dataframes then compliance that you’d probably require are no-duplicates, fixed domain of a column (like blood group: A,B, O, AB, percentages: 0–100 etc.), set a data type (integer/float/string), no nulls, defaults etc.

Data Sourcing
Different ways in which the data is submitted to the model. The source of the data could be - API, SQL, CSV, XLS etc.

A good software design entails that no matter what the data source is, the inputs must always comply with the exact same rules. Therefore, abstracting away the data integrity rules from the data sourcing is a better design choice for the input data pipeline. It makes the code modular as well as easy to maintain. Ticdat follows the same paradigm, there are 2 things — “Factory” and “Connectors”

Factory
Factory lets the users define data rules of the model

Connectors
Connectors get the data from a data source and then collects all of it in an object. Will talk in detail about these in the sequel of this article.

Now, using these two entities from above here’s how the flow is orchestrated in the integrity layer using ticdat:

  1. Define the dataset structure & rules (ticdat does this)
  2. Source the data (ticdat — optionally does this)
  3. Find integrity failures (ticdat does this)
  4. Transform & Transfer the data to the model

What else?

Besides, integrity checking ticdat also provides a solid GUI integrity error reporting and messaging that powers users to be self analyze the problems in their dataset. Because of that users are capable to own and solve the data problems by themselves which in the long run helps support and development teams to focus more on the important feature building stuff.

In summary, data integrity is important for a data product because a model is something that has a been designed with certain assumptions and certain expectations from inputs therefore it needs to be protected from the impurities of the outside world. Ticdat provides a great framework for developers and data scientists to address data integrity in a modern and standard way.

There’s an upcoming post which deep dives into code and provides an end to end picture of how to make a data integrity later using ticdat using a simple case. Stay tuned!

Resources

Find ticdat here:

--

--