Optimal Imputation

Unlock the full power of data with missing values or quality issues

Better data preprocessing leads to stronger predictive performance

The quality of predictions and insights derived from data depends critically on the quality of data itself. Optimal Imputation is a preprocessing tool designed to address data quality issues. Experiments show that better imputation with this method leads to 2% improved accurary in classification and 0.05 improved R-squared in regression.

Exploit feature relationships

Traditional imputation approaches compromise the quality of data, resulting in biases and limiting the predictive power. Optimal Imputation uses global optimization to find the best imputed values by exploiting the relationship across features. In the following example, we observe the following behavior:

Complete cases: Left with only 50% observations; biased and less powerful.
Mean impute: Imputes 25 years of employment; nonsensical given age.
Optimal Imputation: Uses age and gender to estimate the years of employment. More sensible, accurate and leads to better final predictions.

Age	Gender	Years of Employment
60	Male	30
45	Female	20
24	Male	?
26	Female	?

Imputations that reflect the ground truth

The improvements in the end predictions come from the fact that the imputations recover the underlying ground truth in all missing data scenarios, whether it is at random or not. Compared against benchmark methods across 84 datasets, Optimal Imputation achieves the best imputation accuracy in the majority of datasets under all scenarios, with a significant reduction of 10-15% in imputation errors.

Understandable and trustworthy imputed values

Optimal Imputation doesn’t just give you a number alone. It also provides an explanation behind each imputation, allowing the user to verify and validate the recommended values based on the explanations.

A pipeline we designed for health insurance used Optimal Imputation as the key backbone for assessing data quality, identifying gaps in the data processing and enabling humans to efficiently fix issues.

Read the case

Example of Optimal Tree-based Imputation with explanations

Outlier detection

With Optimal Imputation as the core engine, we also provide a context-aware outlier detection algorithm. It identifies what feature entries are significatively different from the norm, controlling for all the other features.

While most outlier detection solutions only identify the data entries that deviate from the rest, our approach pinpoints specific feature values within those entries and explains why these are outliers.

Related publications

From predictive methods to missing data imputation: An optimization approach.

Dimitris Bertsimas, Colin Pawlowski, and Daisy Zhuo

The Journal of Machine Learning Research, 2017

The original publication by the co-founders pioneering Optimal Imputation. The paper formulates the missing data imputation problem as a joint optimization problem and presents a scalable method to solve it to optimality, establishing superior performance to the state of the art.

Applied Informatics Decision Support Tool for Mortality Predictions in Patients With Cancer

Dimitris Bertsimas, Jack Dunn, Daisy Zhuo, and others

JCO Clinical Cancer Informatics, 2018

Applying Optimal Imputation to electronic medical records data, the authors were able to infer values of important features such as gene expression and lab measurements that were not available in the raw data, and built more powerful predictive models with this augmented data.

Want to try Optimal Imputation?
We provide free academic licenses and evaluation licenses for commercial use.
We also offer consulting services to develop interpretable solutions to your key problems.