Transforming data is a key part of feature engineering, which involves the use of domain knowledge to create new features—a.k.a. predictors, variables, attributes, columns, or fields—in the interest of improving machine learning model quality. However, if you apply transformations incorrectly, you can actually reduce the accuracy of your models.
If you’re not a data scientist or are relatively new to machine learning, you may think “a transformation’s a transformation” no matter where or when it’s used. Let’s take a simple transformation as an example. To bin a column of numeric data, simply find the max and min values, divide into equal parts to determine bin boundaries, and you’re done. However, when we think about data transformations in preparing data for machine learning, there’s a little more we have to consider.
Let’s illustrate with an example. Say we want to build a machine learning model with a variable that represents a person’s age. In our example, the data we use to build the model—our training data—has an age range from 20 to 60. So, if we choose to bin this into 4 equal width bins, we’d get bin boundaries at 30, 40, and 50. All’s well, so we build the model.
We might stop here and just inspect model details, e.g., the rules of a decision tree model, the cluster definitions from a k-means model, or the coefficients of a support vector machine model. However, many models are used for scoring, that is, to make predictions. So, when we get a new data set for scoring, how do we prepare that data?
We know that we need to use the same transforms on the new data as we had for the training data, and so we’ll want to bin our “age” variable into 4 bins. But doing this blindly — if we simply apply the same equal-width binning transformation — will likely reduce the model’s accuracy. How? Consider that our new data, looking at the range of age values, has values from 10 to 70. Binning this data into 4 equal width bins results in boundaries at 25, 40, 55. But our model was built using age with bins at 30, 40, and 50. So while a person with age 54 was in the highest bin in our training data set, say bin 4, when we prepare the scoring data, a person with age 54 will be in bin 3. Clearly, there’s a problem.
The answer involves recognizing that some transformations are dependent on statistics not derived from the current data set, and this requires maintaining statistical metadata. When we compute statistics to support transformations on the training data , such as mean, mode, min, max, standard deviation, etc., we need to store those statistics as metadata and use those statistics when preparing data for scoring. This will ensure that the new scoring data is transformed the same way as the training data, even if the range of values is different in the scoring data.
While equal width binning is one example, there are many such transformations that rely on statistics computed from the training data. For example,
- Equal frequency binning –bin boundaries
- Supervised binning –bin boundaries
- Min-max scaling – min and max values
- Standard scaling – mean and standard deviation
- Outlier treatment –max and min values
- Missing value treatment – mean, mode values
- Categorical encoding (one-hot encoding) – distinct values
- Derived values that rely on training data statistics – a wide range of possible statistics
There are more, but you get the idea.
Of course, there are transformations that don’t require statistic as metadata – functions that do not depend on the distribution of the data itself. For example, mathematical functions, such as log, are data independent – the log of 25 is the same whether applied to the value 25 in the training data or the scoring data.
Some machine learning tools provide support for maintaining the needed metadata. For example, Oracle Machine Learning in-database models support automatic data preparation, so exploding categorical data or normalizing numeric data are not only done automatically, but the corresponding statistics and metadata are stored with the model. When scoring data, transformations are applied using the statistics from the training data. In addition, if users have explicitly prepared data, they can leverage embedded data preparation for in-database models. The user provides the transformation and their metadata, which are then stored with the model and automatically applied when scoring data. Oracle Data Miner, which provides a drag-and-drop user interface for constructing analytical workflows, allows users to copy the training data transform nodes that were already run, and paste them with the same statistics in a workflow to prepare data for scoring.
So, not all machine learning transformations are applied equally, especially when it comes to preparing data for scoring. Some require statistics gathered on the training data to be maintained and applied on the scoring data. If your model isn’t performing well on new data, there are many possible causes, one of which might be related to transformation statistics.