Data Quality Checks

This module supports writing data quality checks on pandas dataframe.

from genie_pkg.dqc import QualityChecker
 check_spec = """
         apply checks {
             row_count > 10
             has_columns(["name", "dob"])
         }
         """
 df = pd.DataFrame([{'name': 'foo',
                     'dob': '1970-01-01'}])

 # if you pass ignore_column_case=True, column names in dataframe will be case insensitive
 check_results = df.dqc.run(check_spec, ignore_column_case=False(default))
 failures = list(filter(lambda x: not x[2], check_results))
 assert len(failures) == 0

 # To check if your spec is valid or not. If it is bad, will return (error, False) else (None, True)
 QualityChecker.validate_spec(check_spec)

Available simple checks

  • row_count (> | < | ==) <rhs>
  • has_columns([“c1”, “c2”…], ignore_case=False|True default is False)
  • is_unique(<column_name>)
  • is_not_null(<column_name>)
  • has_one_of(<column_name>, [“c1”, “c2”…], ignore_case=False|True default is False))
  • is_positive(<column_name>)
  • quantile(<column_name>, <percentile>) (> | < | ==) <rhs>
  • is_date(<column_name>, pass_percent_threshold=<1..100> default 100, ignore_nulls=False|True default is False)
  • percent_of_values_have_length(<column_name>, pass_percent_threshold=<1..100> default 100, ignore_nulls=False|True default is False) == <rhs> (handy for data like post_code or ipv4)

Available complex checks

  • Apply checks on multiple columns of rows identified by the condition (supports strings only)
# Check column values using row identification. c1, c2, c3 and c4 are column names
check_spec = """
                apply checks {
                    when row_identified_by {
                        "c1": "v1",
                        "c2": "2"
                    } then {
                        "c3" == "v3",
                        "c4" == "v4"
                    }
                }
                """

df = pd.DataFrame({"c1": "v1", "c2": "2", "c3": "v3"}, index=[0])
# Returns [(check_name, None|column_name, True|False)]
result = df.dqc.run(check_spec)