Data Quality Checks ==== This module supports writing data quality checks on pandas dataframe. .. code-block:: python from genie_pkg.dqc import QualityChecker check_spec = """ apply checks { row_count > 10 has_columns(["name", "dob"]) } """ df = pd.DataFrame([{'name': 'foo', 'dob': '1970-01-01'}]) # if you pass ignore_column_case=True, column names in dataframe will be case insensitive check_results = df.dqc.run(check_spec, ignore_column_case=False(default)) failures = list(filter(lambda x: not x[2], check_results)) assert len(failures) == 0 # To check if your spec is valid or not. If it is bad, will return (error, False) else (None, True) QualityChecker.validate_spec(check_spec) **Available simple checks** - `row_count (> | < | ==) ` - `has_columns(["c1", "c2"...], ignore_case=False|True default is False)` - `is_unique()` - `is_not_null()` - `has_one_of(, ["c1", "c2"...], ignore_case=False|True default is False))` - `is_positive()` - `quantile(, ) (> | < | ==) ` - `is_date(, pass_percent_threshold=<1..100> default 100, ignore_nulls=False|True default is False)` - `percent_of_values_have_length(, pass_percent_threshold=<1..100> default 100, ignore_nulls=False|True default is False) == ` (handy for data like post_code or ipv4) **Available complex checks** - Apply checks on multiple columns of rows identified by the condition (supports strings only) .. code-block:: python # Check column values using row identification. c1, c2, c3 and c4 are column names check_spec = """ apply checks { when row_identified_by { "c1": "v1", "c2": "2" } then { "c3" == "v3", "c4" == "v4" } } """ df = pd.DataFrame({"c1": "v1", "c2": "2", "c3": "v3"}, index=[0]) # Returns [(check_name, None|column_name, True|False)] result = df.dqc.run(check_spec)