Many machine learning and predictive processes struggle when they encounter missing data; entire records are bypassed if one field value is missing in the algorithm. For example, in a decision tree, if no value exists for the field where the tree splits, that record is useless because the algorithm cannot say what tree branch the record needs to follow.
Most software implementations of machine learning processes get around this problem by offering the data scientist the option of ignoring missing value records or imputing a value. Often, the imputed value is used so as not to waste what is otherwise a good record. Most of the time an average, median, or similar generic value is used in place of the missing value. Null often looks like a missing value, too, and usually receives the same treatment by data scientists.
Most IBM i IT professionals are close enough to operations to know that average values across the entire database are unlikely to be good substitutes. Using domain knowledge, IBM i professionals can easily create levels or classes based on experience that better substitute for the missing values. This work is best done on IBM i before it gets to the data scientist.
Follow us on Twitter.
Subscribe to our blog