the right time to train Machine Learning models

Aug 24, 2021 | by Radicalbit, Events, Fresh news

Datasets change over time, and models should adapt too
Building a performing Machine Learning model requires a significant amount of time to experiment. A data scientist tries different algorithms or different feature engineering strategies before getting the model right. Once the model is tuned to its best, it may be time to serve it in the production environment.

However, when data scientists optimise their models, experiments run against frozen datasets that do not evolve over time. The key assumption is that such a dataset is representative of the data distribution the model will encounter at prediction time. Unfortunately, this assumption might not be always guaranteed when the model is moved to production. Indeed, the validity of it depends on the data generation process and whether the latter evolves over time. If the data comes from users interacting with the UI, it changes in the UI itself or in other interaction patterns. Moreover, it may drive the evolution of the data distribution.

Therefore, monitoring the performance of a model and, at the same time, identifying potential drifts in the data is of paramount importance for checking the health status of a model in production and a pillar of MLOps best practices.

Drift modes and Detection Strategies

We call drift the general evolution over time of data distribution. It is possible to identify 3 root causes for drifts.

  • Data drift: data drift happens when the distribution of the input data the model is fed with changes over time.
  • Concept drift: concept drift occurs when the definition of the target value the model is trying to predict changes over time.
  • Upstream changes: upstream changes are not conceptual data changes. The term refers mostly to technical issues that might occur in the upstream phases of data handling (data collection, data preprocessing). This leads to changes like data schema or type changes, or some features to be collected missing.

Neglecting upstream changes, the real challenge of an efficient Machine Learning system in production is to cope with data and concept drifts. Typically, the model during evaluation is “static”. By static, we mean that its internal parameters, learned by means of the training process, do not change. Therefore, in case of data drift, it is likely that the model performances may drop whenever the input data has changed so much to appear as something previously unseen by the model. Moreover, the prediction concept a model has learnt during training is tight to the model’s parameters. Therefore, when concept drift occurs, it will keep providing predictions still bound to the original concept, and therefore in disagreement with the new expectations. In real-world applications, nothing prevents data and concept drift to occur at the same time, making them non-trivial to disentangle.

In order to tackle the drift problem, we devised various algorithms. Each of them has its own advantages, but they can generally identify variations over time in data distribution or in any associated statistics (e.g. the mean). The idea is to look at subsequent data windows, defined either by the number of datapoints or with time intervals, in order to have an “old” and a “recent” data sample to compare.

But how do such algorithms work? We can acknowledge two different approaches.

1. Looking at the input data only. The idea is to directly look at the stream of input data and try to detect changes by looking at the evolution of some statistics. This kind of models directly address the data drift issues and comprise techniques such as the ADaptive WINdowing (ADWIN) algorithm.

2. Evaluating model predictions. The idea is to look for drifts in the stream of evaluations of the model predictions. In other words, whenever a model receives an input data and makes a prediction, the observed ground truth of that prediction is provided in order to build the corresponding stream of the model errors. This latter stream is then investigated for drifts, with the underlying expectation that a data or a concept drift would likely induce a drop in model performances, hence more errors. The Drift Detection Method (DDM) algorithm and any variation of it rely on this approach.

Once the pipeline is built and tested (we provide a novel codeless debugging feature), users can deploy the topology on top of their favourite streaming engine. We provide out-of-the-box support for Apache Flink, Spark Streaming, and the Kafka streams library. Don’t worry! In the end, you will get a new horizontally-scaled deployment on Kubernetes. We come up then with two main deployment families:

1) Your streaming workload topology

2) A set of Seldon orchestrated ML trained models

While the deployment runs, the topology asks the models for predictions and the models respond with high-throughput and low-latency. The communication channel is Kafka, of course. Through this solution, we can exchange inference asynchronously using Kafka messages conveyed by Kafka topics; therefore, we can leverage fault-tolerant and highly scalable features. Consequently, it is possible to scale infinitely, the only limit is Kafka’s infrastructure availability.

Let’s talk about Continuous Training

For Radicalbit, “Continuous Training” means providing customers with a set of DevOps tools, in order to help them in establishing when it is the right time to retrain a model in production. We follow one key metric only: drift. For us, drift is not only an upstream change in the data structure; but a change in data behaviour and its domain evolution.

We identify two kinds of drifts in data. The first one is the so-called “Data Drift Concept”; at this stage, we can monitor input data behaviour. As you can see in the figure below, we track the value of the feature x‘s behaviour we are monitoring, and we do not consider the inference values at all. However, when the input data distribution changes in unprecedented ways, your model might not be good anymore. Therefore, you should perform a new training session. ADWIN (Adaptive Windowing) is a well-known method to spot these drifts by monitoring data streams and – in the figure below – the vertical red lines represent the ADWIN activation.

continuous-training
algorithm

However, this approach might not properly work in some cases. For instance, for stock market data, where the trend concept is as meaningful as the values the feature assumes. Where “how the values change over time” represents information as well data drift is not a perfect fit anymore. So we need to move to a new monitoring strategy, where the values of the inference are also important. Thus, we built a detection model for the second kind of drift we identified, the so-called “Concept Drift“.

By monitoring the prediction trend, we can understand how well we are predicting. However, in order to understand, we need feedback, thus a piece of information stating the true class for a determined and already emitted prediction. By this hint, the platform is then able to compute the error rate. Methods like drift detection model track down sudden changes in the error rate value (as in the figure below as vertical red lines).

If detected, the platform alerts the clients that it is time to retrain their ML model with fresh data. Radicalbit MLOps platform offers a tiered set of notifications or triggers useful to activate actions (e.g. execute remote continuous integration pipelines aimed to re-train a model).

Let’s talk about Online Learning
How do we train machine learning models with streaming data?

Like I eagerly anticipated in this article preface, Radicalbit aims to build the first production-ready platform for streaming machine learning models. This is a completely different approach for many reasons, but the obvious one is that it changes the input data that our models’ crunch (therefore, goodbye traditional datasets!) while they learn. We do not have historical and finished data, but an unfinished and continuous flow of information that changes over time. Therefore, the chances of making valuable a-priori assumptions on the input data are very limited. The objective is to make our models dynamic in behaviour so they can adapt to the input data changes over time.

On the other hand, concerning Supervised Learning, we are driven by the same requirement of drifts, which is feedback. In this situation, also the paradigm is changing; indeed, for online machine learning, the prediction is firstly computed, then delivered to the downstream, and finally, in reaction to incoming feedback, the models update the learning functions based on the measured loss.

The main difference with traditional batch learning tasks is that online models predict and train the algorithm at the same time. In our latest R&D project, we built one fit for an online neural network algorithm based on a hedge algorithm. The main feature of hedging lies in the expert-advice approach: each network layer is able to provide an output prediction, shallower layers will be more valuable in earlier stages of the algorithm lifetime; then, the more events the model crunches, the more credibility will be assigned to deeper layers output predictions.

Radicalbit also proposes a first distributed implementation for it, as shown in the picture below.

data sources

The main advantage of using streaming machine learning algorithms in our platform is that you can dramatically reduce the effort of periodically re-train models because they fix themselves over time. Additionally, by building the right serving layer for streaming models, we are about to prove that for many contexts streaming algorithms can outperform traditional batch learning applications.

Stay tuned and follow us to learn more!

How MLOps accelerates AI Model Deployment

How MLOps accelerates AI Model Deployment

MLOps is the bridge between machine learning and operations. A combination of methodology, tools and processes, it streamlines and automatizes the ML model lifecycle management, integrating ML workflows, pipeline and automation, continuous delivery and observability. This article explains how MLOps can be the conductor that makes all the difference.