The world and data are not static. But most machine learning models are. Once they are in production, they become less relevant with time. The data distributions evolve, the behavioral patterns change, and models need updates to keep up with new reality.
The usual process is to retrain the models at defined intervals. It looks quite straightforward: take the new data, the old training pipeline, and fit the model again.
But how often should we do it? Should it be done daily, weekly, or monthly? Or every time you get a new batch of data?
Too often, the answer is based on a gut feeling or convenience. Someone picks a reasonable interval and schedules a regular retraining job.
Instead, we can approach it in a more data-driven way. To be more precise when planning the model maintenance, we can run a few checks in advance.
Depending on how critical the model is, you might go through all of them or only some.
Check #1. Is the model already good enough?
Before we make our retraining plans, it helps to check if we are done with training! Maybe, the model hasn’t reached its peak performance yet?
The simple way to do that is to look at good old learning curves.
How to proceed? We can fix the test set which we use to evaluate the model performance. Then, run a set of experiments by training the model on different parts of the training data.
We can use the random split method and iterate by changing the train size. This way, we focus on how the data volume impacts the performance.
There are two things we can learn as a result: 1) how much data we need to reach the peak performance 2) whether the model reaches this plateau with the available training data.
Sometimes we’d learn that the model doesn’t need all the data we have. For example, we have multiple years of sales data, but using just one year of training data gets the same quality.
We might drop the extra data to make the model more lightweight.
In other cases, we could see that the model quality keeps going up and up. The model is hungry for more data! There are more steps to be made to bring the model to its top shape.
Rather than think about model retraining to maintain its quality, we should plan for continuous improvement. As soon as we get enough new data, we can use it to reach better performance.
This first test also gives a sense of scale and “density” of signal in the data. Do we need 10, 100, or 1000 observations to see a meaningful impact on the model performance? Would it take us a day or a month to collect that amount of new data?
That’s a helpful thing to know!
Check #2. How quickly do things change?
When we create our machine learning models, we assume there is some stability in the real-world process. Otherwise, it would make little sense to learn from the past!
We also know that things change. When working with a dynamic process, we can also assume that there is a certain speed at which these new patterns accumulate.
We can then try to calculate this!
Let’s take our model and see how long it “lasts” if we simulate its application in the past.
We can train a model using some older part of the data and then “apply” it to the later periods. Just like we do with a hold-out set, but here we simply take several consecutive ones.
We can start with a single-point estimate and see how fast the performance degrades.
If we have enough historical data, we can repeat this check several times and then average the results. Just keep an eye on potential outliers and rare events!
Sometimes we’d learn that an “old” model performs as good as new. Some prefer to retrain the models often to keep them “fresh,” but it is not always justified.
Don’t fix what’s not broken!
If frequent retraining is not needed, you might go away with a lighter service architecture. You can also decrease the risks of technical errors that come with any change. The same goes for the organizational overhead, especially when new models require an approval process.
In other cases, you can learn that the model ages very fast! That is good to know in advance to set up proper monitoring and prepare the infrastructure.
You might decide to reconsider your training approach to make the model more stable. For example, change feature engineering or model architecture to make it a bit less performant on the test set but more stable in the long run. In other cases, you might train the model using a shorter training period but perform frequent calibration or consider active learning.
We might also face constraints in our ability to retrain the model.
That brings us to the next check.
Check #3. When do we get the new data?
Here, we look at the business process rather than the data.
Sometimes, we get real-world feedback almost immediately. For example, you recommend an article to read, and you quickly know if the user clicked on a link.
In other cases, the new data that you can use to retrain your model comes with a delay.
If you have a long prediction horizon, you have to wait to know if your prediction was correct. With other tasks, you need a separate labeling process. Sometimes, the limitations come from how the data is moved or generated. For example, manual data entry is done once per month.
We can find ourselves in one of the two situations.
In some cases, the model degrades before the new data arrives. It becomes a limitation.
If we do not get the data in time to retrain the model, we might need to reconsider the approach again. For example, create an ensemble of models with different retraining horizons or combine machine learning with rules or human-in-the-loop. As a last resort, we can also adjust our performance expectations and prepare to deal with a lowered model quality.
In other cases, the new data starts accumulating before the model decay. In this case, we have the luxury of choice.
We can, of course, simply initiate the retraining at any point after the data comes. If we want to be more precise, there is a way to do that.
Check #4. How much data do we need to see the improvement?
Say the new data arrives daily, but the model degrades only after 30 days. What would be the optimal action? Should we retrain daily, weekly, or once per month?
We can make a more precise judgment by checking if the new bucket of data brings the improvement we want.
The thing is, sometimes adding a small set of new data points does not change anything. There is some minimal required data size that has a visible impact on the performance.
We can evaluate this.
To do that, we choose a test set from a period of the known decay. We know when the performance goes down: we can then check if retraining on the new data improves it.
We can add data in small increments as it comes. Then, we see how it affects the test performance.
What often happens is that we have to wait a bit to collect a “useful” amount of data—for example, at least a week of it to capture the relevant seasonal patterns.
As a result, our actual choice of retraining window might be more narrow than it seemed! On the side, it is limited by the speed of data provision. On the other side, by the need to collect enough data for retraining to bring effect.
Within this time frame, you can pick the period based on what’s practical and makes more sense for the use case.
Check #5. Should we keep the older data?
There is one more question that often comes up. How should we retrain the model? Should we add some new data and drop some old? Should we continuously increase the training set?
This is a bonus check we can run.
We can imitate model retraining at the chosen interval and then check how things change if we start dropping the old data.
We can often see that leaving out the old data makes no difference. Then, it is probably a sane thing to do to keep the training more lightweight.
What’s more, sometimes it makes the model better! Keeping the old irrelevant data might cause the performance to go down if you have a dynamic use case. That is good to know in advance.
Of course, we might also see that it makes sense to keep all you have for the time being. You can probably repeat the check later on.
With these checks, you can come prepared for the model maintenance. You can set up your retraining pipelines to get ready for the updates at a chosen interval.
Still, we cannot simply rely on our schedule. We also need a reality check. This means, monitoring our models once they go live.
To get the visibility, we can build a monitoring dashboard or schedule regular checks to calculate the actual model performance (if we have the ground truth) or otherwise monitor the input data for statistical distribution drifts and outliers.
Even stable models can face data and concept drift or certain rare events. In this case, we might then need to intervene earlier than planned.
On the other side, if things look steady, and we can skip an update for longer than planned.
Planning for regular retraining and complementing this with continuous monitoring is usually an optimal strategy to make the model live up to its promised performance!
You can find an extended version of this article here.
Emeli Dral is a Co-founder and CTO at Evidently AI where she creates tools to analyze and monitor ML models. Earlier she co-founded an industrial AI startup and served as the Chief Data Scientist at Yandex Data Factory. She is a co-author of the Machine Learning and Data Analysis curriculum at Coursera with over 100,000 students.
Elena Samuylova is a Co-founder and CEO at Evidently AI. Earlier she co-founded an industrial AI startup and led business development at Yandex Data Factory. Since 2014, she has worked with companies from manufacturing to retail to deliver ML-based solutions. In 2018, Elena was named 50 Women in Product Europe by Product Management Festival.