By Anand S. Rao, Interested in cutting edge science, engineering, politics, and philosophy.
Figure 1: Data Scientists are from Mars and Software Developers are from Venus.
Mars and Venus are very different planets. Mars’s atmosphere is very thin and it can get very cold, while Venus’s atmosphere is very thick and it can get very hot — hot enough to melt lead! Yet, they are our closest sister planets. They have a number of similarities too. Both have a high concentration of carbon dioxide in their atmosphere and are exposed to solar radiation with no protective magnetic field.
Software Engineers and Data Scientists come from two different worlds—one from Venus and the other from Mars. They have different background mindsets and deal with different sets of issues. They have a number of things in common too. We will look at the key differences (and similarities) between them and why those differences exist, and what kind of bridge we need to create between them. In this blog, we explore the fundamental differences between software and models.
Software vs. Models
In traditional programming, one provides line-by-line instructions (often called an algorithm) for a computer to process the input data to produce the desired output that matches a given software specification. The line-by-line instructions can be in one of many computer languages, e.g., Lisp, Prolog, C++, Java, Python, Julia etc.
In data science, one provides the input data and, in some cases (e.g., supervised machine learning), a sample of the output data to build a model that can recognize the patterns in the input data. Unlike traditional programming, data science models are trained by providing input data (or input and output data) to recognize patterns or make inferences. When fully trained, validated, and tested, they perform predictions or inferences on new data.
A model is a formal mathematical representation that can be applied to or calibrated to fit data. Scott Page provides a number of examples of models and how they are used to make decisions. Models can be machine learning models, dynamic system models, agent-based models, discrete event models, or a number of other different types of mathematical representations. In this article, we will focus primarily on machine learning (ML) models.
Figure 2: Software and Model Definition.
Software and models differ by five key dimensions. These dimensions are shown in Figure 3.
Figure 3: Five key dimensions of difference between software and models.
The output of the software is certain. For example, consider an algorithm (e.g., bubble sort) for sorting an array of numbers. The bubble sort algorithm takes as input an array of numbers and iteratively goes through a series of steps to produce an output array that is sorted by ascending or descending order. Given any array, the bubble sort algorithm will always produce a sorted array as output. There is no uncertainty in its output. If the program has been properly tested, the algorithm will always produce the result, and the result will be 100% accurate.
In contrast, take a deep learning model that has been trained on a large number of images and is capable of recognizing different breeds of cats. When the model is provided as input a cat image, it uses the model to predict the breed of the cat. However, it may not always provide an answer with 100% accuracy — in fact, more often than not, the accuracy will be less than 100%. Figure 3 illustrates the input, the deep learning network layers, and the output of a deep learning model. The model predicts that the image contains a tabby cat with 45% accuracy and could be an Egyptian cat with 23% accuracy. In other words, the predictions from models are often uncertain. This uncertainty in predictions is a challenging concept for businesses to grasp.
Figure 3: Deep learning image recognition model.
The second dimension of difference between software and models is the decision space. What we mean by decision space is the context in which the software or model is used for making decisions. When we build software, we typically have a specification or a user need that is coded as an algorithm. The software is tested and, when fully tested, is available for use. The software gets executed to produce an outcome. This decision space is fixed or is static. If the needs of the user change, the algorithm has to be modified or rebuilt and tested. There is no promise of the algorithm learning or modifying itself. Figure 4 illustrates the context around how software gets used (an adaptation of how models are used from the book on Prediction Machines).
Figure 4: Software decision space.
When it comes models, the decision space is more dynamic. Consider a machine learning-based chatbot that has been trained to provide first-level support for queries related to smartphones. When the model was trained, it would have been trained with historical data on the make, model, accessories of different smartphones. Once deployed, the chatbot will be able to answer customer queries on all the smartphones in the market. Let’s assume that when the chatbot is unable to answer queries beyond a certain level of accuracy, it bounces the queries to a human customer service agent. Such a chatbot will work fine for a few months, but when new models and accessories are introduced in the market, the chatbot will be unable to respond to customer queries and will progressively transfer more and more calls to the human agent, eventually making the chatbot useless.
Models rely on historical data, and when that historical data is no longer relevant, they need to be refreshed with newer data. This makes the context in which models operate more dynamic. While user needs or software specifications can also change, they happen less frequently. Moreover, there is no expectation that the software will continue to function when the specification changes. In the case of models, there is a clear expectation that when the performance of the model deteriorates, at the very least, we get alerted to this deterioration (often called model drift), or at best, the outcomes and new data are used to continuously improve the model. Figure 5 illustrates the context around how models get used. Note the feedback loop from outcome to training (the red dotted lines have been added from the original diagram in Prediction Machines for emphasis). It is this feedback process that makes the decision space for models more dynamic.
Figure 5: Model decision space.
Traditional software typically uses a deductive inferencing mechanism, while machine learning models are based on inductive inferencing. As shown in Figure 6, a software specification acts as the theory from which the code is developed. The code can be viewed as the hypothesis that needs to be confirmed with the theory based on the observations. Observations are nothing but the output produced by the code that needs to be repeated tested against the specification.
Models are patterns derived from observational data. The initial model acts like a hypothesis that needs to be iteratively refined to ensure that the model best matches the observational data. The trained model, when validated, captures the theory underlying the data. Sufficient care needs to be taken to ensure that the model does not over-fit or under-fit the data.
Moving from theory to observation as done by software is arguably easier than moving from observation to theory as done by models. This also highlights the challenge around the certainty/uncertainty of the output as discussed in the first dimension.
Figure 6: Inference in software and models.
The process by which software is developed is also fundamentally different from the way models are typically developed. Software development typically follows a waterfall approach or an agile approach. The waterfall approach goes through a series of steps from specification to design to coding to testing and deployment. In the case of an agile approach, the software development process is iterative and often embodies a set of principles centered around user needs and self-organizing, cross-functional teams. Software is typically developed in one to four-week sprints. Each successive sprint encodes additional functionality leading to a minimum viable product that is released to users.
The model development process has to follow a somewhat different approach. The availability, quality, and labelling of data and the difficulty of estimating the desired accuracy or, more generally, the performance of the algorithm means that model development needs to take a portfolio approach. The data scientists need to develop a number of models, and a subset of these models may meet the performance criteria. As a result, the model development process needs to follow a more scientific process of experimentation, testing, and learning from these experiments to refine the next set of experiments. This process of hypothesis-test-learn does not fit well with an agile software development life cycle.
The four dimensions we addressed so far clearly separates the mindsets of those who build software and those who build models. Software developers typically have an engineering mindset — they work on architectural blueprints, connections between different components, and are typically responsible for production software. Software engineers typically have a computer science, information technology, or computer engineering background or education. They develop software products that create the data.
Model developers, on the other hand, have more of a scientific mindset — they work on experiments, are better at dealing with ambiguity, and are typically interested in innovation as opposed to production models. A data scientist is someone who has augmented their mathematics and statistics background with programming to analyze data and develop mathematical models. They use data to draw insights and effect outcomes.
Written with Joseph Voyles.
Original. Reposted with permission.