Optimal Imputation

Unlock the full power of data with missing values or quality issues

Better data preprocessing leads to stronger predictive performance

The quality of predictions and insights derived from data depends critically on the quality of data itself. Optimal Imputation is a preprocessing tool designed to address data quality issues. Experiments show that better imputation with this method leads to 2% improved accurary in classification and 0.05 improved R-squared in regression.

Exploit feature relationships

Traditional imputation approaches compromise the quality of data, resulting in biases and limiting the predictive power. Optimal Imputation uses global optimization to find the best imputed values by exploiting the relationship across features. In the following example, we observe the following behavior:
Complete cases

Left with only 50% observations; biased and less powerful.

Mean impute

Imputes 25 years of employment; nonsensical given age.

Optimal Imputation

Uses age and gender to estimate the years of employment. More sensible, accurate and leads to better final predictions.

Age Gender Years of Employment
60 Male 30
45 Female 20
24 Male ?
26 Female ?

Imputations that reflect the ground truth

The improvements in the end predictions come from the fact that the imputations recover the underlying ground truth in all missing data scenarios, whether it is at random or not. Compared against benchmark methods across 84 datasets, Optimal Imputation achieves the best imputation accuracy in the majority of datasets under all scenarios, with a significant reduction of 10-15% in imputation errors.

Understandable and trustworthy imputed values

Optimal Imputation doesn’t just give you a number alone. It also provides an explanation behind each imputation, allowing the user to verify and validate the recommended values based on the explanations.

A pipeline we designed for health insurance used Optimal Imputation as the key backbone for assessing data quality, identifying gaps in the data processing and enabling humans to efficiently fix issues.

Example of Optimal Tree-based Imputation with explanations

Outlier detection

With Optimal Imputation as the core engine, we also provide a context-aware outlier detection algorithm. It identifies what feature entries are significatively different from the norm, controlling for all the other features.

While most outlier detection solutions only identify the data entries that deviate from the rest, our approach pinpoints specific feature values within those entries and explains why these are outliers.

Want to try Optimal Imputation?
We provide free academic licenses and evaluation licenses for commercial use.
We also offer consulting services to develop interpretable solutions to your key problems.

© 2020 Interpretable AI, LLC. All rights reserved.