By Ben Rogojan, Data Science and Data Engineering Solutions Architect
Image Created By Author — PDF Source Here
Whenever I look to learn a new topic, I create some form of learning plan. There is so much content out there that it can be difficult to approach learning in the modern era.
It’s almost comical. We have so much access to knowledge that many of us struggle to learn because we don’t know where to go.
This is why I put together roadmaps and learning plans.
Below is my MLOps learning plan that I will be taking on for the next few months.
The focus will be on first taking a quick refresher in ML as well as taking an advanced Kubernetes course.
From there I will be focused on Kubeflow, Azure ML, and DataRobot.
Background Of MLOps
In 2014 a group of Google researchers put out a paper titled Machine Learning: The High-Interest Credit Card of Technical Debt. This paper pointed out a growing problem that many companies might have been ignoring.
Using the framework of technical debt, we note that it is remarkably easy to incur massive ongoing maintenance costs at the system level when applying machine learning. [D. Sculley, Gary Holt, etc]
Another way the researchers put this in a follow-up presentation was that launching a rocket was easy, but ongoing operations afterwards was hard. Back then, the concept of DevOps was still coming into its own, but these engineers and researchers realized that there were many more complications in terms of deploying a machine learning model vs. deploying code.
This is when the popularity of machine learning platforms began to rise. Eventually, many of these platforms adopted the term MLOps to explain the service they were providing.
That begs the question. What is MLOps?
What Is MLOps?
Machine Learning Operations, or MLOps, helps simplify the management, logistics, and deployment of machine learning models between operations teams and machine learning researchers.
From a naive perspective it is just DevOps applied to the field of machine learning.
But, MLOps actually needs to manage a lot more than what DevOps usually manages.
Like DevOps, MLOps manages automated deployment, configuration, monitoring, resource management and testing and debugging.
A possible machine learning pipeline could look like the image below.
Image Created By Author
Unlike DevOps, MLOps also might need to consider data verification, model analysis and re-verification, metadata management, feature engineering and the ML code itself.
But at the end of the day, the goal of MLOps is very similar. The goal of MLOps is to create a continuous development pipelines for machine learning models.
A pipeline that quickly allows data scientists and machine learning engineers to deploy, test and monitor their models to ensure that their models continue to act in the way they are expected to.
In order to better understand the various tools that exist in the MLOps market, let’s put together a quick study plan of what we should Review.
1. Data Science Math Review
We can’t get around it.
Math is a major part of machine learning and data science. Since it has been a long time personally, as well as for my readers who might be interested, here are some great resources for those of you looking to refresh your math skills for data science and machine learning.
One course that stuck out was John Hopkins statistics for data science.In particular, I am interested in statistical inference, only because that tended to be the more interesting courses.
Next many of you might prefer reading about your math vs. taking courses. Thats where Practical Statistics For Data Scientists comes in. Some day I should probably get an O’Reilly subscription but for now I am going to purchase the practical statistics book.
In particular, O’Reilly tends to be written by professionals. So I am looking forward to actual use cases for applied data science mathematics.
2. Review Machine Learning And Deploying Models Manually
Before diving into MLOps, let’s review machine learning and how to deploy machine learning models with Docker.
This will help us compare some of the future courses in the next steps because we can see multiple ways to deploy machine learning models.
In particular I found a course called “Deploy Machine Learning And NLP Models With Docker”. Since this is one way to deploy machine learning models, it will play a great baseline.
Kubernetes is an open-source container orchestration platform. Using Kubernetes allows developers to automate manual processes involved in deploying, managing, and scaling out applications.
In other words, you can cluster together groups of hosts running Linux containers, and Kubernetes helps you easily and efficiently manage those clusters.
So to start diving in Kubernetes I am going to dig into freeCodeCamp’s three hour course. One, it will provide a great piece of content. Two, freeCodeCamp generally does a great job of getting everything you need to know condensed into a quick few hour course.
Once you have finished going through freeCodeCamp’s three hour Kubernetes course you can now dive into advanced Kubernetes. This course will cover Logging using ElasticSearch, Kibana, Fluentd, and LogTrail, Packaging using Helm, Deploying on Kubernetes using Spinnaker and more.
Kubeflow is a machine learning platform that manages deployments of ML workflows on Kubernetes. The best part of Kubeflow is that it offers a scalable and portable solution.
This platform works best for data scientists who wish to build and experiment with their data pipelines. Kubeflow is also great for deploying machine learning systems to different environments in order to carry out testing, development, and production-level service.
Kubeflow was started by Google as an open source platform for running TensorFlow. So it began as a way to run TensorFlow jobs via Kubernetes but has since expanded to become a multi-cloud, multi-architecture framework that runs entire ML pipelines. With Kubeflow, data scientists don’t need to learn new platforms or concepts to deploy their application or deal with networking certificates, etc. They can deploy their applications simply like on TensorBoard.
Learning More About Kubeflow
In order to start learning about Kubeflow, let’s start by learning from Google’s course that they have focused on deploying machine learning models. Now this course only has one section devoted to Kubeflow, but hopefully it will be sufficent.
If Google’s course is not, then Amina A’s article on how to get started with Kubeflow should help round it out. In the article they go over several other resources worth digging into. So I look forward to parsing through all of those pieces of content as well.
Azure Machine Learning (Azure ML) is a cloud-based service for creating and managing machine learning solutions. It is a combination of both SDKs as well as offers a web portal in Azure Machine Learning for low-code and no-code options for model training, deployment, and asset management.
Now, enough marketing. I am curious about the low-code solution approach that some of Azure ML offers.
One of the first tools I learned was SSIS, which has similar feel. It’s a drag and drop automation solution. Obviously the focus is more on ETLs and data pipelines. So where should I begin?
To start with I always try to find courses that are supported or developed by the actual company.
Microsoft has a course that is meant to help prepare for the certification called AI-900: Microsoft Azure AI Fundamentals.
The purpose of this course is to teach the core concepts and skills that are assessed in the AI fundamentals exam domains. They seem to cover the automated machine learning that Azure offers as well as regression, classification and clustering.
In the end, it will be interesting to see how they deploy these models.
Finally, I usually like picking a few articles to see what other people and creators are doing. In this case I am going to see how Toptal used Azure ML to predict gas prices.
DataRobot an AI automation tool that allows data scientists to automate the end-to-end process for deploying, maintaining, or building AI at scale. This framework is powered by open source algorithms that are not only available on the cloud but also on-premise. DataRobot allows users to empower their AI applications easily and quickly in just ten steps. This platform includes enablement models that focus on delivering value.
DataRobot not only works for data scientists but also non-technical people who wish to maintain AI without having to learn the traditional methods of data science. So, instead of having to spend loads of time developing or testing machine learning models, data scientists can now automate the process with DataRobot.
The best part of this platform is its ubiquitous nature. You can access DataRobot anywhere via any device in multiple ways according to your business needs.
MLOps Starter Quest
In this course the focus will be to provide a foundation for using MLOps. But at this point, I assume I already know that. So I really do hope that DataRobot can provide a lot more than just some basics about MLOps.
In particular, I do hope it covers how to actually use DataRobot.
This does seem to be covered more in the second course which covers how to deploy DataRobot AutoML models with MLOps. You will learn about the different deployment options, along with how the different components in MLOps can be used to deploy, monitor, manage, and govern models.
I will let you know!
DataRobots White Paper
DataRobot has also put together a great looking PDF that seems to cover a decent amount in terms of how to best utilize DataRobot itself.
In addition, I can then condense this down to an interesting one-pager.
7. Next Steps — Machine Learning And MLOps
I foresee MLOps really picking in the next maybe 2–3 years.
Yes, if you’re in the data space, you’re probably already aware of the concept. However, I feel like there is a decent amount of time before adoption in large companies starts ramping up.
There are a few companies I have run into that are looking into these tools. Overall, most are still trying to manage their data pipelines. Which is why you might prefer learning more about data engineering vs MLOps.
Good luck on your journey!
Read/Watch These Next
I have spent my career focused on all forms of data. I have focused on developing algorithms to detect fraud, reduce patient readmission and redesign insurance provider policy to help reduce the overall cost of healthcare. I have also helped develop analytics for marketing and IT operations in order to optimize limited resources such as employees and budget. I privately consults on data science and engineering problems both solo as well as with a company called Acheron Analytics. I have experience both working hands-on with technical problems as well as helping leadership teams develop strategies to maximize their data.
Connect with Me on Social Network
✅ Website: https://www.theseattledataguy.com/
✅ LinkedIn: https://www.linkedin.com/company/18129251
✅ Personal Linkedin: https://www.linkedin.com/in/benjaminrogojan/
✅ FaceBook: https://www.facebook.com/SeattleDataGuy
Bio: Ben Rogojan is a data science and data engineering solutions architect with expertise in data architecture and statistics. He focuses on developing end-to-end data solutions that help take data from raw format into data products and analytics. He has delivered on projects for clients across healthcare, finance, SaaS, and technology and worked in Big Tech. He maintains a strong online presence creating content for data engineers and those curious in entering data engineering career paths and is known as the “Seattle Data Guy” online.
Original. Reposted with permission.