Robust Data Pipeline Design

Data-informed pipeline for cleaning untidy data from various sources in insurance

Data quality issues negatively affect downstream machine learning performance

Our client, a leading health insurance data science company, provides cutting-edge predictive and prescriptive analytics solutions. However, they noticed that bad data quality such as missing data, outliers, and mismatched records were affecting the model performance significantly.

Historically, they had taken a reactive, manual approach to addressing data quality issues. This often required a lot of effort and only rectified specific data issues on a one-off basis. As part of a data quality initiative, we partnered with them to design a new data pipeline that is robust to poor quality data.

Optimal Imputation-powered data quality improvements

In the new pipeline, we integrated quality tools powered by Optimal Imputation in two major ways:
  • Outlier detection in data validation

    Some fields in the data are highly correlated with other characteristics. Optimal Outlier Detection, using Optimal Imputation as the core engine, learns this logic and automatically flags observations where certain values are out of the expected range.

  • Optimal Imputation on missing data

    After validation, if some fields are still missing, the Optimal Imputation module automatically fills in the value by considering all other relevant inputs, providing a logical and accurate guess on what the value should be. The experts can then verify and choose to accept or reject the values.

Illustrative Optimal Imputation with explanations

Smart matching of records

Another key driver of their model quality is the accuracy of linked datasets. The health insurance industry is well-known for inconsistency in data capture, and there is no simple rule for merging data from difference sources especially when there are no unique identifiers.

We took a data-driven approach, where we learned Optimal Decision Trees to separate the unique records from the rest, allowing the linking of records to reach extremely high discovery rate while ensuring no individuals were matched incorrectly.

Illustrative Optimal Decision Tree predicting whether two records from different sources are the same person

Immediate model performance gain from the new data pipeline

The client's system now produces much more reasonable reports, and significantly better predictive models, just through a more data-driven ingestion pipeline, and without changing any of their analytics approach. The new pipeline also reduces the amount of manual review required by orders of magnitude, simply by highlighting data problems early and fixing them automatically where possible.

Unique Advantage

Why is the Interpretable AI solution unique?

  • Captures relationships in data

    Optimal Imputation and Outlier Detection can learn the logic within the data data and make reasonable recommendations

  • Detect issues and fix the root cause

    Warnings are thrown at the earliest place where issues arise, allowing pinpointed fixes

