ticdat — Chapter-1: Data Rules

Mohit Mahajan
11 min readJun 12, 2021

This article talks about how to set up a basic integrity check for a model using the ticdat library. This is a sequel to the previous article — tidat — The pythonium shield for your model. Here, beginner level of python programming and concepts of data modeling are requisites.

Image Source: https://www.springboard.com/blog/data-analytics/data-quality/

Now, we know what “data integrity” is and why it’s important to have such a protection layer in a data science workflow. Suppose we’re provided with a survey of 100 individuals. Assume, the collected data has the following fields:

SSN | Mobile No. | Sex | Date-of-Birth | Hand-preference | Income

Think about the stated/unstated rules on the values in each of the fields in that dataset.

  • SSN: Cannot be the same for two individuals (💡it is a primary key!) and it must be a 9-digit number
  • Mobile No: integer and must be more than 9-times-9 and less than 10-times-9
  • Sex: It’s either Female, Male, Other. Any value besides these 3 isn’t valid but assume nulls are allowed
  • Date of Birth: These are date-like values but detecting the format is tricky, isn't it?
  • Hand-preference: Right or Left
  • Income: should be a positive number

Now, take a minute and think, if you were to use such a dataset in your program and write a routine to check if all your data points are acceptable. How would you do that?

Let me guess what you’re thinking now, You want to read the dataset as a pandas.DataFrame() and then create a function that executes on each row to check if all the fields have acceptable values. In your function have things like, df.duplicated("SSN"), df["Sex"].isin(["Male","Female","Other"]), and so on. You gotta check for dtypes , domains, duplicates and nulls for each fields. I am glad you thought about a try and except statement here. Isn’t it? Then, you are planning to collect all the indices of invalid values for each field and type of error and then throw them out in some classified manner. Let me ask you to also allow nulls for Income and Date-of-Birth too, on top of that let’s say if the Mobile No. is null, then replace it with 10–9’s as a default.

Things are getting a bit tortuous now, right? Besides the functionality did you consider the following:

— Imagine, how do you document that function?

— Did you think about unit tests and maintenance?

— Is there a possible way to generalize this function? you don’t wanna rewrite the whole piece for new every data frame in your workflow, correct?

Urgh! I know that looks messy.

This way of verifying the validity of input data is painful and not-so-cool. What if I told you there was a super clean way to achieve this?

Some background information and concept building is needed before I show how we do it. Let’s jump in!

The input dataset is some table with certain fields and each field has their properties and constraints. Properties like, primary key, foreign key, data type, nullable, the domain of values, etc. As we already defined formally, these are known as data integrity rules. (you totally need not know the deep concepts, I might sometimes refer to the terminology for conciseness)

We also discussed that a structured way to think about such kind of protective layer is to make a distinction between “Data Rules” and “Data Source”. Define data rules in a module and then let the data sourced pass through these rules and find what’s good and what’s not.

Ticdat’s way of integrity checking is somewhat similar. The design is -

  1. Define Data Rules
  2. Source the data
  3. Let the sourced data pass through the defined rules
Fig. 1

In terms of a computer program, Recall that the previous article talked about something called — “Factory”. To restate, ticdat provides a “feature” that enables the developers to define data integrity rules on the input dataset (Fig.1) We shall unveil, How to do Step-1: Defining Data Rules in this article. The next article on this topic shall focus on step-2 and step-3.

This feature is accessed through a class invoked as ticdat.PanDatFactory
(for now ignore “PanDat” )

A good way to think about factory is to draw analogies from data modeling.

Think about a data schema, There are several tables that are received as inputs and each of these tables has fields and obviously, each field has its own integrity rules.

Table_1
— field_1 (Rules shall be defined at field level)
field_2 ….
Table_2
— field_1
— field_2 ….

In ticdat, the constructor of PanDatFactory is a schema and the methods define other integrity rules that the fields of the table are expected to hold.

Integrity rules comprise of data types, max-min, nullable properties, foreign keys, user-defined rules etc.

Bear 2 things in mind —

  1. Defining the schema
  2. Defining the integrity rules on a schema

STEP 1: Defining The Schema

Any schema is defined in the following way

# General Syntax for defining a schema >> 
from ticdat import PanDatFactory
my_schema = PanDatFactory(
my_table_1 = [[<primary-key-fields>],[<other-data-fields>]],
my_table_2 = [[<primary-key-fields>],[<other-data-fields>]],
....)

Relating to the running example of the survey. In the input schema, we just have one table, (call it survey) and for this, we have SSN as the primary key field. Rest all are the other data fields in the survey table. So, just putting that in the syntax, required schema is defined as follows:

from ticdat import PanDatFactoryinput_schema = PanDatFactory(
survey = [["SSN"],["Mobile No.", "Sex", "Date-of-Birth",
"Hand-preference", "Income"]]
)

Provided, if we had more tables we’d just add them as kwargs to the constructor.

STEP 2: Defining the Logical Conditions

Once we’ve defined the schema we can set integrity rules and add constraints on the schema. This is achieved using the methods of PanDatFactory() There are 4 generic methods which are listed as follows:

"""
Method-1 : set_data_type()
"""
my_schema.set_data_type(
<table_name>, # A table in the schema
<field_name>, # A field in the corresponding tablenumber_allowed=True, # Is this "field" in this "table" is numericmin=0, # If numeric, then what is the min value?inclusive_min=True, # Is min value is inclusive? max=inf, # If numeric, then what is the max value?inclusive_max=False, # Is max value is inclusive?must_be_int=False, # If field is numeric, is it also an integer?strings_allowed=(), # If the field is "not numeric", then is this a
# string? If yes then, specify the list of
# allowable strings here. Or simply pass "*" to
# allow all strings here.
nullable=False, # Is this a nullable field? datetime=False. # Is this a datetime field?
)

Note that, the set_data_type()method actually does much more than setting dtypes it also specifies the domain of the particular field. This method is very relevant to the context of running example. The rules as identified in data description can be translated to integrity rules as illustrated in the following:

  • SSN: Must be 9-digit integer number
input_schema.set_data_type("survey", "SSN",
number_allowed=True,
min = 100000000,
max = 999999999,
must_be_int = True)
  • Mobile No: Integer and must be more than 9-times-9 and less than 10-times-9
input_schema.set_data_type("survey", "Mobile No.",
number_allowed=True,
min = 999999999,
inclusive_min = False,
max = 9999999999,
must_be_int = True)
  • Sex: It’s either Female, Male, Other. Any value besides these 3 isn’t valid but assume nulls are allowed
input_schema.set_data_type("survey", "Sex",
number_allowed=False,
strings_allowed=["Female","Male","Other"],
nullable=True)
  • Date of Birth: These are date-like values but detecting the format was tricky, right? No worries, ticdat got you covered!
input_schema.set_data_type("survey", "Date-of-Birth",
datetime=True)
  • Hand-preference: Right or Left
input_schema.set_data_type("survey", "Hand-preference",
number_allowed=False,
strings_allowed=["Right","Left"])
  • Income: should be a positive number
input_schema.set_data_type("survey", "Income",
number_allowed=True,
min=0)

This is super clean, notice that most of the dirty checking be knocked-down just by using the versatile set_data_type() method. Later as requested in our story, if we want to allow nulls for Income and Date-of-Birth fields, we’d simply add the nullable arguments to the corresponding method call.

That was clean, isnt it? I told you :D

Remember, we’re still making the rules, we’re not yet checking if any dataset confirms these rules. We’ll eventually come to that. To keep things simple and avoid confusions, I’ve decided to discuss some parts in the subsequent article. Believe me, you’ll have a better context once you know how the data is passed through the defined rules.

Now, suppose we’re in a situation where we want to be friendly with the non-confirming points and simply want to substitute faulty values with defaults. ticdat’s PanDatFactory has a method that lets the developer define default values for a particular table and field.

"""
Method-2 : set_default_value()
"""
my_schema.set_default_value(<table_name>, <field_name>, <value>)

Mind that, the default value must be a valid data type for the corresponding value. Meaning, it’s not legal to have a string default for an integer field. We could potentially set defaults for all the fields in the schema. Remember in the running example, we wanted to replace null‘s in the field Mobile No. with a number equal to 10-times-9.
Look again at the data type definition of Mobile No. :

input_schema.set_data_type("survey", "Mobile No.",
number_allowed=True,
min = 999999999,
inclusive_min = False,
max = 9999999999,
must_be_int = True)

The field Mobile No.isn’t nullable, so the null is an invalid data type and 10-times-9 is a legal value. Therefore, the following statement should just do the required.

input_schema.set_default_value("survey", "Mobile No.", 9999999999)

Readers might have an obvious question, Does ticdat automatically replace the defaults for invalid values? — No, ticdat simply finds these invalids, if we want to do the replacements, we need to explicitly to ask ticdat (call a method) to do so. I shall talk about these methods in more details once we have the sufficient context.

Here, take a pause and review example-1 in this repository to get a feel for what we just did. You could set it up on your own system and play around with it. The instructions are available in the readme files as well as on the script docstrings.

Note: For now, do not worry about how the data is sourced. Just focus on how schema is defined and integrity rules are composed for different tables and corresponding fields in the schema.

At this stage, we have covered the concepts required for the example case. Ticdat has much more to offer in terms of data rules, we’re just looking at the tip of the iceberg, believe me!

Foreign Keys and Data Row Predicates

Data Products are complex and big they simply cannot operate on a single table, just like in the example, this is why, in data modeling, having foreign key relationships on the schema is fundamental. Thus it is quintessential to have this type of integrity check in your data workflow to avoid any unexpected failures downstream.

Image Source: https://www.littlekendra.com/2016/09/22/indexing-foreign-keys-guidelines/

For instance, suppose we only wanted to process the survey data for the SSN’s registered under a “premium plan”. The registered SSN’s are available in another table that goes by the name premium

Again, take a moment and think, if you were remove the undesired SSN entries, what routine you’d have followed?

You’re probably thinking to, you have got 2 data frames as the function input and you wanna make an inner_join or may be something like pd.Series.isin() etc.

Hm! Let’s not do that, how would you know what rows did you eliminate?

Ticdat way of thinking about this is in terms of foreign keys, if you add a field that is referred to as a foreign key from another table, ticdat does the job of looking for undesired records. No joins, no dirty pd.Series.isin() ‘s! Simple and sweet.

"""
Method-3 : add_foreign_key()
"""
my_schema.add_foreign_key(<native_table>, <foregin_table>,
[<native_field>, <foregin_field>])

If the syntax above looks esoteric to you, the running example shall help you understand what’s what. So here’s how we compose this rule. (In case you missed it, we have an additional table in the schema)

from ticdat import PanDatFactoryinput_schema = PanDatFactory(
survey = [["SSN"],["Mobile No.", "Sex", "Date-of-Birth",
"Hand-preference", "Income"]],
premium = [["SSN"],["Date Registered","Tenure"]])

input_schema.add_foreign_key("survey","premium",["SSN","SSN"])

Let me unveil some interesting things here.

  • There might be a possibility that the field having SSN values were labeled differently between the two tables. This is very practical and there is indeed a provision for that.
  • [For advanced users] Can ticdat handle compound/composite foreign keys? — Yes! say we had such an example then -
input_schema.add_foreign_key("native","foreign",
[("native_field_1", "foreign_field_1"),
("native_field_2", "foregin_field_2"), ... ])

Does ticdat automatically remove foreign key violations? — No, just like in data type, again ticdat doesn’t remove the undesired records it just finds them. We need to explicitly ask ticdat (call a method) to filter them out.

Now, suppose there were more complicated data rules that won’t fit into the standard form of integrity checking. For instance, say we had another field in our survey say Monthly Expenditure (Assuming no credit is available) We’d find a natural validity rule that Monthly Expenditure ≤ Income period. From the knowledge that we have, we only know that we can constrain individual fields we do not have something that let us define such custom rules. Well, if you’re following me here, this is exactly the purpose of data row predicates. Say hi to one of the most powerful methods of PanDatFactory

Data Row Predicates can be used to check for sophisticated data integrity problems that can’t be handled with a data type rules.

"""
Method-4 : add_data_row_predicate()
"""
my_schema.add_data_row_predicate(
<table_name>, # Target Table
<predicate>, # A boolean return function that accepts rows
# of target table as an argument.

predicate_name=<predicate name>, # If someone were to identify what
# kind of failure occured
)

To clarify things, if we were to write the predicate for our running example, Here’s how we shall do the Monthly Expenditure ≤ Income check.

from ticdat import PanDatFactoryinput_schema = PanDatFactory(
survey = [["SSN"],["Mobile No.", "Sex", "Date-of-Birth",
"Hand-preference", "Income",
"Monthly Expenditure"]],
premium = [["SSN"],["Date Registered","Tenure"]])
input_schema.add_data_row_predicate(
"survey",
lambda row: row["Monthly Expenditure"] <= row["Income"],
predicate_name = "Expense cannot be greater than income")

A few interesting facts about data row predicates:

  • Can you add multiple predicates on the same table and perform different checks on its fields? — Yes!
  • Does the predicate always needs to be a lambda function? Well in python we could treat functions like object, so short answer — No. If your predicate is not as simple as a lambda function, you could simply pass the function name, as long as it returns a bool it is legal.
  • [for advanced users] What if you want to supply additional items in your predictate arguments? Well, there’s something called as predicate_kwargs_maker please read through ticdat’s documentation for details.

Data row predicates certainly open doors for a lot of possibilities. Sky is the limit! Now, you’re in a good shape to take a look at example-2 from this repository. I have added second table along with the foreign key and predicte rule. See how ticdat blocks the integrity failures to protect your processing engine.

Again, feel free to edit the code on your own add more wired rules, to get comfortable with the tidy style.

At this stage, we have certainly talk about most of the basics and the readers should have certain level of comfort with composing data rules. The following short-note is a good summary of this article.

The modular way to think about integrity is to draw separation between 3 things.

1. Data Rules
2. Data Sourcing
3. Passing the sourced data through the data rules and find the invalids

Data rules, once abstracted out must be followed by data coming from any kind of data source. This article focused on how to define the dataset and compose rules that the dataset is expected to satisfy. We drew analogies/structured thought process from data modeling lessons and discussed 4 gatekeeping methods.
— The versatile, set_data_type()
— The tolerant, set_default_value()
— The sniper, add_foreign_key()
— The muscular, add_data_row_predicate()

If you want to deep dive into each of the methods discussed and also find more interesting stuff, do checkout ticdat’s documentation. The upcoming article shall dive deep into data sourcing using both:

  • a“easy and automated” — ticdat way
  • a “self-designed” rogue way

You could choose your own path but in either of those tracks, we’d attach a shield of integrity checking that keeps “dirtyness” away from the model.

Stay Tuned!

--

--