Optimal Imputation
Unlock the full power of data with missing values or quality issues
Better data preprocessing leads to stronger predictive performance

Exploit feature relationships
- Complete cases
-
Left with only 50% observations; biased and less powerful.
- Mean impute
-
Imputes 25 years of employment; nonsensical given age.
- Optimal Imputation
-
Uses age and gender to estimate the years of employment. More sensible, accurate and leads to better final predictions.
Age | Gender | Years of Employment |
---|---|---|
60 | Male | 30 |
45 | Female | 20 |
24 | Male | ? |
26 | Female | ? |
Imputations that reflect the ground truth
Understandable and trustworthy imputed values
A pipeline we designed for health insurance used Optimal Imputation as the key backbone for assessing data quality, identifying gaps in the data processing and enabling humans to efficiently fix issues.
Example of Optimal Tree-based Imputation with explanations
Outlier detection
While most outlier detection solutions only identify the data entries that deviate from the rest, our approach pinpoints specific feature values within those entries and explains why these are outliers.

Related publications
From predictive methods to missing data imputation: An optimization approach.
Dimitris Bertsimas, Colin Pawlowski, and Daisy Zhuo
The Journal of Machine Learning Research, 2017
The original publication by the co-founders pioneering Optimal Imputation. The paper formulates the missing data imputation problem as a joint optimization problem and presents a scalable method to solve it to optimality, establishing superior performance to the state of the art.
Applied Informatics Decision Support Tool for Mortality Predictions in Patients With Cancer
Dimitris Bertsimas, Jack Dunn, Daisy Zhuo, and others
JCO Clinical Cancer Informatics, 2018
Applying Optimal Imputation to electronic medical records data, the authors were able to infer values of important features such as gene expression and lab measurements that were not available in the raw data, and built more powerful predictive models with this augmented data.