A checklist for ML Production Systems

4 minute read

This article summarizes the production pipeline scoring rubric provided by Google Inc. in “What’s your ML Test Score? A rubric for ML production systems” and places it in a csv file. It can easily be used across projects to make sure they meet the minimum deployment threshold.

Machine Learners at Google specify four categories to test: Data, Model, Infrastructure, and Monitoring.

Data

Test Type	Description	Sample Test
Data	Distribution of feature matches expectation	PSI, KS
Data	Test the relationship between each feature and the target, and the pairwise correlations between individual signals	Correlation coefficient
Data	Test the cost of each feature and compare versus percent improvement	Inference latency, RAM usage, upstream data dependencies
Data	Model does not contain any features that have been manually determined as unsuitable	Spreadsheet of banned variables
Data	Test that your system maintains privacy controls across its entire data pipeline
Data	Test the calendar time needed to develop and add a new feature to the production model
Data	Test all code that creates input features, both in training and serving	Unit-testing using PSI

Model

Test Type	Description	Sample Test
Model	Test that every model specification undergoes a code review and is checked in to a repository
Model	Test the relationship between model performance metrics (accuracy or AUC) and the actual business impact metrics (click-through rates)	This can be measured in a small scale A/B experiment using an intentionally degraded model to see the impact of a 1% improvement in performance on the business metric
Model	Test the impact of each tunable hyperparameter	Grid search or Bayesian Optimization for hyperparameter search
Model	Test the effect of model staleness	Compare predictions from last week/month/year
Model	Test against a simpler model as a baseline	Cost/benefit analysis against a linear (or equivalent) model with few features
Model	Test model quality on important data slices	Important slices might be users by country or by device
Model	Test the model for implicit bias

Infrastructure

Test Type	Description	Sample Test
Infrastructure	Test the reproducibility of training	Train two models on the same data, and observe any differences in aggregate metrics, sliced metrics, or example-by-example predictions
Infrastructure	Unit test model specification code	Useful assertions include testing that training results in decreased loss and that a model can restore from a checkpoint after a mid-training job crash
Infrastructure	Integration test the full ML pipeline	A good integration test runs all the way from original data sources, through feature creation, to training, and to serving. An integration test should run both continuously as well as with new releases of models or servers
Infrastructure	Test model quality before attempting to serve it	Useful tests include testing against data with known correct outputs and validating the aggregate quality, as well as comparing predictions to a previous version of the model
Infrastructure	Test that a single example or training batch can be sent to the model, and changes to internal state can be observed from training through to prediction
Infrastructure	Test models via a canary process before they enter production serving environments	This includes testing that a model can be loaded into the production serving binaries and perform inference on production input data at all. It also includes a canary process, in which a new version is tested on a small trickle of live data
Infrastructure	Test how quickly and safely a model can be rolled back to a previous serving version

Monitoring

Test Type	Description	Sample Test
Monitoring	Test for upstream instability in features, both in training and serving	As examples, what alert would fire if one datacenter stops sending data? What if an upstream signal provider did a major version upgrade?
Monitoring	Test that data invariants hold in training and serving inputs	For example, test if Feature A and Feature B should always have the same number of non-zero values in each example, or that Feature C is always in the range (0, 100) or that class distribution is about 10:1
Monitoring	Test that your training and serving features compute the same values	The codepaths that generate input features may differ for training and inference time, due to tradeoffs for flexibility vs. efficiency and other concerns. Make sure there is no “training/serving skew”
Monitoring	Test for model staleness	For models that continually update, this means monitoring staleness throughout the training pipeline
Monitoring	Test for NaNs or infinities appearing in your model during training or serving
Monitoring	Test for dramatic or slow-leak regressions in training speed, serving latency, throughput, or RAM usage	Dramatic regressions and slow regressions over time may require different kinds of monitoring
Monitoring	Test for regressions in prediction quality on served data

For each category above, one point is awarded for executing the test manually. A second point is awarded for automation of the testing. The final ML Test Score is the minimum of the scores aggregated across each of the four categories.

0 points: Research product
1-2 points: Serious holes in reliability
3-4 points: There’s been first pass at basic productionization, but additional investment may be needed.
5-6 points: Reasonably tested, but it’s possible that more of those tests and procedures may be automated.
7-10 points: Strong levels of automated testing and monitoring, appropriate for mission- critical systems.
12+ points: Exceptional levels of automated testing and monitoring.

Download CSV: Github

Share on

Twitter Facebook Google+ LinkedIn

Hasan Cheema

A checklist for ML Production Systems

Data

Model

Infrastructure

Monitoring

Share on

You may also enjoy

Feature Platform - Feature Store as a Service

Building a Hybrid Recommendation Engine using Python

Deploy DoctorGPT Gen AI app in under 5 mins on Google Cloud

Deploy a Containerized Machine Learning Model as a REST API with Docker