By Arthur Mello, Data Science Consultant at AVISIA.
Isn’t ML engineering just another buzzword?
ML engineering and data science are not the same things, and here’s why: you know when people say that data science is a mix between business knowledge, statistics, and computer science? Well, ML engineering is a lot more about computer science and less about statistics and business knowledge.
In practice, that means that a data scientist is more talented at creating new models, analysing and interpreting data, and understanding the mathematical basis for those models. He or she will usually come from a statistics background, some might have a PhD, and will be really good at math, while programming was something they learned in order to do math using a computer.
An ML engineer, on the other hand, will shine at building and optimising data flows, implementing models, and putting them in production. He or she will come from a computer science background, might have less formal education than data scientists, and will be good at programming and understanding cloud infrastructures, for instance. ML engineers, however, are not the same things as data engineers since they also need to be very good at tuning models (especially neural networks), understanding cross-validation, and feature engineering, for instance. Overall, they should be better than data scientists at data infrastructure and better than data engineers at machine learning.
If you are still not convinced ML engineering is a real thing, take a look at the Google Cloud Platform’s Professional Machine Learning Engineer Certification curriculum. It doesn’t have much to do with traditional data science curriculums, and it touches only the very basics of statistics, but it also doesn’t go deep into the rules for choosing databases, for instance.
Why is ML engineering the Future?
OK, and why one over the other? Well, it’s not one over the other. They both have their places in the data ecosystem. But one is starting to become saturated, while the other hasn’t even begun to get known. Not many people actually talk about ML engineering — at least compared to the amount of people talking about data science — and yet I believe the demand for ML engineers might surpass the one for data scientists.
We can see the amount of data scientists soaring all over the world, within companies of all sizes, while most of these people aren’t actually doing data science at all, just analytics. And many of those who are actually doing data science probably didn’t have to.
That means many organisations are hiring people to solve basically the same types of problems, over and over again, in parallel. There is just a lot of redundancy, and the quality of the people doing it varies substantially.
At the same time, we see companies like Google and Amazon, who have some of the best data scientists in the world, working on “ready-to-use” ML systems on their cloud platforms (GCP and AWS, respectively). This means you can plug your data into their systems to benefit from all that knowledge, and all you need is someone who knows how to make that connection and the necessary tuning–someone like an ML engineer.
To put it another way, if data science is not the core of a business, it is very unlikely that your data problem is completely novel, so it is more efficient to profit from the accumulated knowledge you can access from those cloud providers. More often than not, companies are working on the same kind of problems everywhere: sales forecasting, customer segmentation, scoring, recommendation, etc. These problems have been solved long ago by these tech titans, and you should probably profit from transfer learning instead of trying to solve them from scratch.
That is exactly where ML engineering comes in: you don’t need a PhD in statistics or mathematics to benefit from these platforms and adapt them to your needs. A very basic understanding of how algorithms work, how hyperparameters affect them, and how your data should be processed will do. You do need, however, to be able to choose which tools are right for your problem and what the most efficient settings are in terms of time and money.
Is it the end of data scientists?
Of course not. Data scientists will still be needed by many companies to solve new or more complex problems. But once the hype is over, there will be fewer “data scientists” making the work of data analysts or reinventing the wheel for problems that can be solved easily with pre-made solutions.
What should YOU do about it?
If you got to this point and are convinced that ML engineering skills will be crucial over the next few years, you are probably wondering how you can learn some of that.
I’ll list here a set of skills that will help you quickly adapt to this new reality.
You have probably used the command line for simple stuff, such as downloading Python packages, but you should get a little better at it. In order to use platforms like GCP and AWS, this will come in handy. Codeacademy has some good courses on the subject.
Although we tend to learn data science using Pandas, Spark will come in handy when you have too much data and need to run your algorithms in parallel. I think the most used version of Spark is Scala, but if you are more familiar with Python, learn PySpark instead.
The main three cloud providers are AWS, GCP, and Azure, but I advise you to choose one and stick to it. I can’t say which one is better, although my personal choice was GCP. Before choosing, try finding comparisons of them online–there are many. You can also look at the most used one in your country or the one your company uses. For now, no matter which one you choose, you will find companies that use it and that are looking for people with these specific skills.
Anyway, once you have chosen your platform, get really good at it. I mean, REALLY good. There is a lot you can do with those platforms, so focus on what you need for ML. Looking at the syllabus for their ML certifications will give you a very good overview of what to learn. You can access those here: GCP, AWS, Azure (DS or AI). If you want a study guide for the GCP one, I wrote one.
If you already work at a company that uses one of those on a regular basis, try getting the certification and then get to work on projects that involve it, so you get to practice. If your company doesn’t use one of those, the certification might help you get a position at another company that does. Either way: study, get the certificate, work on it. I can’t stress enough the importance of mastering one of these tools.
Feature engineering is at least as important in ML as it is in data science. I wrote an article going through some of the techniques you can use, so make sure you read it and then try to implement some of those in your work. To dig a bit deeper, there’s also this great book on the subject, which I highly recommend.
ML engineering has a lot to do with preparing data, detecting model drift, making data pipelines run smoothly and fast, etc. In order to put together everything else we have seen in this article, you have to study ML engineering itself. Coursera has a course on the subject. Although I still haven’t taken it, it was created by Andrew Ng and, given his track record, I would be surprised if this course were not great.
Original. Reposted with permission.