Optimizing Data Acquisition

Identifying the right data to acquire using Optimal Feature Selection

Acquiring data to train a machine learning model

In a real-world machine learning application setting, the data available on site may not hold sufficient predictive power for the problem we want to solve. To boost performance, it is often necessary to acquire more data.

In a 2017 survey, 92% of firms agreed that they need to increase their use of external data sources. However, acquiring data to train machine learning models is a costly task, which involves several uncertainties, including:
  • Will the new data actually improve performance?
  • How much new data do we need?
  • How to weigh the trade-off between the cost of acquisition and the performance boost?

Credit card spending patterns for insurance risk prediction

We partnered with an insurance specialist to solve the problem of predicting the cost associated with risky individuals. Initially, our partner only had access to household-level demographic data, which led to poor risk predictions.

We decided to evaluate the potential of enhancing these demographics with credit card spending patterns. To this end, we used Optimal Feature Selection to identify the right set of credit card features that will maximize the performance of our prediction problem.
  • Selecting the optimal set of features...

    Acquiring hundreds of features (such as healthcare expenditures, credit line information, etc.) can quickly become expensive, and most of these features will not carry sufficient predictive power to be used by our final machine learning model. Optimal Feature Selection can select the best set of features that our partner will need to acquire going forward.

  • ...by only acquiring data for a subset of households

    In order to identify this optimal set of credit card features, we only needed to purchase all available features for a small subset of households. Optimal Feature Selection can learn from limited data, thus further restricting research and development costs.

Optimizing data acquisition cost

We found that augmenting our partner's database with credit card data indeed led to a lift in predictive performance. More importantly, Optimal Feature Selection discovered that we could achieve this lift by adding only three new features.

This finding allowed the insurance specialist to purchase only these three relevant features out of the 100+ that they initially planned on acquiring for the remaining households, thus saving 75% in data acquisition cost.

Increased transparency in handling sensitive data

In the context of assessing the risk to insure individuals, we want to make sure that we understand what data is used by the model to make predictions. Optimal Feature Selection provides transparency to the insurer, who can thus verify that the modeling is conducted in accordance with ethical and regulatory constraints.

By identifying a limited selection of relevant variables, Optimal Feature Selection eliminates the need to obtain regulatory approval for hundreds of features, and thus simplifies compliance processes.

Unique Advantage

Why is the Interpretable AI solution unique?

  • Cost reduction

    Identify the right features to acquire and thus eliminate unnecessary data acquisition costs

  • Reduced data engineering burden

    Less data to acquire also eliminates the need for complex data engineering pipelines

  • Simple and transparent

    Fewer features selected means a simpler, more transparent, and auditable model

Want to try Interpretable AI software?
We provide free academic licenses and evaluation licenses for commercial use.
We also offer consulting services to develop interpretable solutions to your key problems.

© 2020 Interpretable AI, LLC. All rights reserved.