Robust Data Pipeline Design
Data-informed pipeline for cleaning untidy data from various sources in insurance

Data quality issues negatively affect downstream machine learning performance
Historically, they had taken a reactive, manual approach to addressing data quality issues. This often required a lot of effort and only rectified specific data issues on a one-off basis. As part of a data quality initiative, we partnered with them to design a new data pipeline that is robust to poor quality data.

Optimal Imputation-powered data quality improvements
-
Outlier detection in data validation
Some fields in the data are highly correlated with other characteristics. Optimal Outlier Detection, using Optimal Imputation as the core engine, learns this logic and automatically flags observations where certain values are out of the expected range.
-
Optimal Imputation on missing data
After validation, if some fields are still missing, the Optimal Imputation module automatically fills in the value by considering all other relevant inputs, providing a logical and accurate guess on what the value should be. The experts can then verify and choose to accept or reject the values.
Illustrative Optimal Imputation with explanations
Smart matching of records
We took a data-driven approach, where we learned Optimal Decision Trees to separate the unique records from the rest, allowing the linking of records to reach extremely high discovery rate while ensuring no individuals were matched incorrectly.
Illustrative Optimal Decision Tree predicting whether two records from different sources are the same person
Immediate model performance gain from the new data pipeline

Unique Advantage
Why is the Interpretable AI solution unique?
-
Captures relationships in data
Optimal Imputation and Outlier Detection can learn the logic within the data data and make reasonable recommendations
-
Detect issues and fix the root cause
Warnings are thrown at the earliest place where issues arise, allowing pinpointed fixes