In machine learning, decision trees are a great algorithm family to work with business information. They are not the most precise nor are they considered cutting edge, but they are a first pass algorithm for many data scientists. Maybe in version two of a project, another algorithm family might create a better model for delivering a reliable model, but over most types of transaction or ERP data, decision trees as a class are where most data scientists start.
One of the great things for business use is that decision trees can be deciphered and understood by people. That capability lends them an air of credibility if managers and executives can look at the logic of the tree and follow how the final answer is made by tracking the tree at each branch.
It also lets IBM i programmers code the decision tree splits in familiar programming languages. Realistically, this is the only way decision trees are going to work with IBM i programs natively on the box without making calls out to other servers.
In reality, you'll want to make those calls out from your programs rather than code the decision tree. There are many reasons that go beyond just the simple work of coding hundreds or even thousands of decision points into a program. The easiest way to explain is to ask the question, “What happens when they change the model?”
It will happen. It always happens.
Many machine learning and predictive processes struggle when they encounter missing data; entire records are bypassed if one field value is missing in the algorithm. For example, in a decision tree, if no value exists for the field where the tree splits, that record is useless because the algorithm cannot say what tree branch the record needs to follow.
Most software implementations of machine learning processes get around this problem by offering the data scientist the option of ignoring missing value records or imputing a value. Often, the imputed value is used so as not to waste what is otherwise a good record. Most of the time an average, median, or similar generic value is used in place of the missing value. Null often looks like a missing value, too, and usually receives the same treatment by data scientists.
Most IBM i IT professionals are close enough to operations to know that average values across the entire database are unlikely to be good substitutes. Using domain knowledge, IBM i professionals can easily create levels or classes based on experience that better substitute for the missing values. This work is best done on IBM i before it gets to the data scientist.
Someone messaged me to point out my title of the “Green” Revolution last week might also refer to IBM i and its heritage with green screen terminal interfaces.
I think that idea is a valid and reasonable path of thought to go down this week. Machine learning and advanced analytics need real data to create meaningful and useful models. IBM i is at the heart of your real data – as in data that is really useful. IBM i contains transaction records, customer records, payment records, and other concrete data points for businesses.
While your IBM DB2 on i database probably does not store production machine sensor data, product environmental condition information, or other larger volume data flows, those same data flows almost always need to be tied to the transaction, product, and customer data from IBM DB2 on i to create useful machine learning models, advanced analytic visualizations, and so on.
IBM DB2 on i data is critical to the success of many commercial analytics projects. I wish IBM would give a nod to that heritage in its current marketing.
Good quality data is never a bad thing. For fueling analytic processes, it is a must. In order to maximize return on the investment in machine learning and predictive analytics, companies need clean data as a foundation for analysis. (My use of “green” in the title refers to making money for those outside the US.)
Facing some real facts, no one is going to do machine learning on IBM i — not going to happen.
However, for many companies IBM i holds important data which is needed for creating meaningful processes based on machine learning. Getting that data to a machine learning environment seems like a no-brainer; just extract the data and send it over. In the real world, many data fields in databases on IBM i need a little massaging to use effectively in other applications.
Big picture problems include multi-member files. Those are almost impossible for non-IBM i based tools to deal with. I have seen companies where the analysts didn’t know about a file being multi-member, so when they wrote an SQL statement to retrieve the data, only data from the first member was pulled. As a result, they wasted precious time trying to figure out the problem before they were forced to throw in the towel and talk to the IBM i people. Another common challenge is dates stored in non-date fields, or worse yet, stored in multiple fields — with one field for the century and year, another for the month, and another for the day.There are a few other pointers I will elaborate on in the next few weeks.
Follow us on Twitter.
Subscribe to our blog