Data Integrity MLOps

Sep 28, 2023 | by A. Conflitti

As Jane Austen would say, it is a truth universally acknowledged, that a machine learning model is only as good as the data it receives. This applies both to the data used for training it, and also to the data fed into it at inference time. The former case is self-evident; here we shall delve on the latter.

Say you have developed, trained, and deployed into production a good model which achieves good results on the right metrics: you have only done half of the work. In fact, you have to continuously monitor the performance of the model and be sure that it stays in top notch conditions since there are plenty of troubles which might creep in: the most common example is when Drift, both data and concept, occurs and the model performance start degrading over time, but other hurdles might pop up as well.

What is Data Integrity?

An aspect that is often overlooked is the quality of data sent as input to a model at inference time: if this data is corrupted or anyway not of good quality, even the best model will output poor predictions.
Radicalbit’s MLOps platform, has a specific solution for checking data integrity and always keeping a deployed model in spick and span conditions.

Let us talk about Data Integrity: what this is exactly, and how Radicalbit platform can help you keep it monitored at all times.

Given an array of features sent as input to the model, there are several aspects that have to be checked to ensure its integrity, including:

  • data type;
  • range and categories: no overstepping of bounds;
  • schema evolution;
  • rate of nulls, i.e. number of missing
  • values;
  • outliers.

In order to better understand all these conditions consider the following example: there is a chemical plant fitted with plenty of sensors which sends a stream of real-time readings to a model. The features sent as input to the model are:

  • temperature in Celsius degrees (variable of type double),
  • pressure (variable of type double),
    number of pellets of fuel fed to the boiler (variable of type integer),
  • level of risk for the chemical reaction in progress (‘green’, ‘orange’, ‘red’) and the output is how long the production process will take in hours.

Let us consider all the aforementioned aspects of Data Integrity.

Data type

If the thermometer reads a value of 20, but for some reasons it is sent to the model as ’20’ (viz. as a string) the data type is incorrect and the model is not able to use this information. An alert must be triggered to inform users about the data type problem in this specific record.
Likewise, if the value for the number of pellets is not an integer (e.g. 3.5) an alert with this specific information is triggered and sent to the users, who can hence take appropriate actions to fix the problem.

Range and categories: no overstepping of bounds

Imagine the thermometer sends a value of -300: this is clearly an incorrect reading since it is below 0° Kelvin. Likewise, if the level of risk has value equal to ‘cherry’, this is an incorrect reading as well, since the only admissible values are ‘green’, ‘orange’ and ‘red’.

In Radicalbit’s MLOps platform one can set up rules defining ranges for numerical variables and a list of admissible values for categorical ones: if a violation occurs an alert is triggered, with all relevant information sent to the specific alert channel.
Clearly these rules can be updated, and for instance ranges can be modified.

Schema Evolution

In our MLOps platform we offer a fully fledged solution for carrying out both data pre-processing (before model inference) and post-processing (after model inference), and therefore there is a complete AI data pipeline.

The data are encoded as an AVRO schema, which can evolve if new fields (which must be optional, i.e. nullable) are added after its creation. Now, imagine that the chemical plant adds new sensors for reading the humidity level and therefore the schema evolves. Clearly it is crucial to receive an alert in order to update pre and post-processing pipeline, and, if desired, to create a new version of the ML model trained with the new variable as well when enough training data is collected.

Rate of Nulls

So far we have dealt with data integrity related to a single data record, but more complex verifications can be carried out as well.
Radicalbit’s platform users can set up rules related to the rate of null values in a given data window, so for instance if in the last received 100 readings there are more than 10 null values, an alert is triggered and all information about the null values are sent to the chosen alert channel. This is especially important because an increase in the rate of null values can signal a problem with the sensors or over the communication lines, and therefore it is paramount to fix it as soon as possible.

Outliers

An outlier is a data point that differs significantly from the others, i.e. is far from the regions where the training data is the most concentrated.
The crucial difference between outliers and the data described in «Range and categories: no overstepping of bounds» is that the former is a data point that can conceivably be received, although with a very low likelihood, whereas the latter are data points with values that are surely wrong.
Detecting outliers is therefore important because, while they might be incorrect values, they might also be novel data, i.e. unusual but correct observations, and in this case they may indicate that anomalies are creeping in the data.
Inside Helicon, sophisticated techniques and algorithms are included to detect outliers and trigger alerts when they are spotted, to help you keep a full control of all data coming in.

If you are interested in knowing more about our MLOps Platform, its top notch monitoring solution and how it can help you with real time monitoring of data integrity, do not hesitate to reach out.

Don't miss any update

Sign up now to the Radicalbit newsletter and get the latest updates in your inbox.

ML Model Performance Over Time: How the Feedback API Can Help

ML Model Performance Over Time: How the Feedback API Can Help

The proof of the pudding is in the eating, and the quality of an ML Model is in its performance over time. You may have created the best predictive model ever, but then you need to keep it maintained and monitored over time, or you risk that it degrades and that its...