Handling Missing Values when Training a Machine Learning Model
Naïve Bayes models elegantly handle missing values for training and scoring bycomputing the likelihood based on the observed features. Because of conditionalindependence between the features, naïve Bayes ignores a feature only when its valueis missing. Thus, you do not need to handle missing values before fitting a naïve Bayesmodel unless you believe that the missingness is not at random. For efficiency reasons,some implementations of naïve Bayes remove entire rows from the training processwhenever a missing value is encountered. When missing is treated as a categoricallevel, infrequent missing values in new data can be problematic when they are notpresent in training data, because the missing level has had no probability associatedwith it during training. You can solve this problem by ignoring the offending feature in thelikelihood computation when scoring
In general, imputation, missing markers, binning, and special scoring considerations arenot required for missing values when you use a decision tree. Decision trees allow forthe elegant and direct use of missing values in two common ways:
- When a splitting rule is determined, missing can be a valid input value, andmissing values can either be placed on the side of the splitting rule that makes thebest training prediction or be assigned to a separate branch in a split.
- Surrogate rules can be defined to allow the tree to split on a surrogate variablewhen a missing value is encountered. For example, a surrogate rule could bedefined that allows a decision tree to split on the state variable when the ZIP codevariable is missing.
Missing markers are binary variables that record whether the value of another variableis missing. They are used to preserve information about missingness so thatmissingness can be modeled. Missing markers can be used in a model to replace theoriginal corresponding variable with missing values, or they can be used in a modelalongside an imputed version of the original variable.
Imputation refers to replacing a missing value with information that is derived fromnonmissing values in the training data. Simple imputation schemes include replacing amissing value in an input variable with the mean or mode of that variable’s nonmissingvalues. For nonnormally distributed variables or variables that have a high proportion ofmissing values, simple mean or mode imputation can drastically alter a variable’sdistribution and negatively impact predictive accuracy. Even when variables arenormally distributed and contain a low proportion of missing values, creating missingmarkers and using them in the model alongside the new, imputed variables is asuggested practice. Decision trees can also be used to derive imputed values. Adecision tree can be trained using a variable that has missing values as its target and allthe other variables in the data set as inputs. In this way, the decision tree can learnplausible replacement values for the missing values in the temporary target variable.This approach requires one decision tree for every input variable that has missingvalues, so it can become computationally expensive for large, dirty training sets. Moresophisticated imputation approaches, including multiple imputation (MI), should beconsidered for small data sets (Rubin 1987).
Mean imputation assumes that missingness is random. Imputation can be a morecomplicated issue when missingness is nonrandom, dependent on inputs, or canonical.
terval input variables that have missing values can be discretized into many binsaccording to their original numeric values to create new categorical, nominal variables.Missing values in the original variable can simply be added to an additional bin in thenew variable. Categorical input variables that have missing values can be assigned tonew categorical nominal variables that have the same categorical levels as thecorresponding original variables plus one new level for missing values. Because binningintroduces additional nonlinearity into a predictive model and can be less damaging toan input variable’s original distribution than imputation, binning is generally consideredacceptable, if not beneficial, until the binning process begins to contribute to overfitting.However, you might not want to use binning if the ordering of the values in an inputvariable is important, because the ordering information is changed or erased byintroducing a missing bin into the otherwise ordered values.
Source: Best Practices for Machine Learning Applications (Wujek, Hall, and Güneş 2016) SAS InstituteInc.