Reproducibility in Machine Learning

Akash Agnihotri
6 min readJul 10, 2022


Wheat Field with Cypresses by Vincent van Gogh (1889)

“Any device which takes an input and produces an (exact) output is a machine” — this has been the rough definition of any machine for a long time, but machine learning solutions have put it into question inside the industry when it comes to implementation. This arises from the fact that a lot of companies are still facing challenges in deploying end-to-end solutions involving machine learning and deep learning.

A Machine with predictable and constant behavior (Source)

As per a report by Venturebeat, nearly 87% of machine learning projects never reach the production stage, this is due to a whole host of issues including lack of MLOps framework, team structure, and overfitting ML to solve every problem, but one of the key issues that projects face when they move from a jupyter notebook on someone’s local system to a central environment is the reproducibility of results. This leads to results that drastically vary and the project losses business trust and eventually funding.

In my years as Machine Learning Engineer, I have identified 3 key variables that need to be tracked and versioned in sync to make machine learning results consistent and reproducible over time, these combine ideas from MLOps, DataOps, and DevOps to build a framework that allows projects to produce outputs that are consistent and reproducible.

The Model

An artificial neural network (Source)

The centerpiece of this puzzle but also the simplest part to build and implement. Yes, I believe most industry solutions implementing ML does not require a complex model and therefore are often the simplest thing to build from an engineering perspective.

The challenge of reproducibility in a model arises from 2 factors, incorrect model training pipeline which does not consider the pseudo-random algorithms which are part of ML, for example, model weight initialization, training data batch sequence, etc. these algorithms generate a non-deterministic output if the SEED variable is not fixed across multiple runs. SEED is a hyperparameter used during model training that allows pseudo-random algorithms to generate the same result on every run given that everything else remains the same.

Almost all major libraries provide some function to implement a global SEED that applies across the entire training instance. SEED should be treated like any other hyperparameter and should be registered along with the model inside experiment tracking systems like MLFlow.

The Data

Types of Data (Source)

Any machine learning system is as good as the data it is trained on, and the same is true when it comes to the reproducibility of the quality of data versioning and how it is implemented. In real-life data is never as clean as we see in Kaggle and building a robust feature engineering pipeline that produces consistent results is a big challenge for engineering teams.

As we have already seen how treating SEED as a crucial hyperparameter can solve the curse of pseudo-random algorithms we will not visit that again, lets's talk about how we can maintain data versions.

Mention data versioning to your platform or data engineering team and look at their reaction, that would tell you a lot about what are some of the biggest challenges with data versioning but for time’s sake let me list down some below.

  • Real data versioning means redundancy of data, and means exploding storage costs.
  • For index-based versioning what happens when the underlying data changes.
  • How do we deal with data retention or governance, what about access management?

There are tools available to assist you in this effort, for example, DVC, and Azure ML Studio but all of them have their limitation the most common being what happens when actual data being tracked is changed in the background or worse deleted.

Time travel, no not the HG Wells time travel, but data time travel is what I have found to be a cheap and reliable solution. So, as I mentioned earlier, most ML projects are not very complex that might need model training every few hours on a new dataset. It might be daily at best, for such use cases data versioning can be something as simple as the ASOF date for when the model was trained.

Most experiment tracking systems provide the details of the date and time when the model was trained, which provides the information one needs to find the right data version for reproducibility of results.

The Code

Code is the engine mixing data and model. (Source)

Now, this part might be the largest and most crucial part of product development, coding the actual solution, but the good news is it is also the most matured component. Code versioning is a well-established and well-understood topic. But the key here is to track the version of code used for model training, and most experiment tracking systems can register the git metadata used to create the pipeline, but the same is not true for the code running inside the pipeline itself.

To solve this problem, we can develop our solution as a python package that can be built into a wheel and maintained easily across versions. Training or inference pipelines can then track them as any other dependency for model training or inference. For more on building python packages for machine learning products, you can review my article on the topic.

The Experiment Tracking System

We must follow the scientific method of reproducible results when building great products. (Source)

“It was showing good results in my system” is the ML version of “It was working on my machine”. In the age of the cloud, please don’t develop an actual ML solution on your local systems and expect the engineering team to just “pick it up and deploy it”. It does not work like that.

Building products that can be deployed into production in a large organization means collaboration across teams and that is only possible if you publish your experiments on a central repo for all to see just like the code. MLFlow, Neptune, Weights & Biases, H2O, etc. are some tools available to track all your experiments under one roof.

An experiment tracking system allows you to bring all your key variables into one place and track them in sync, such that teams can review the results and reproduce them if needed. They track the model versions trained and deployed allowing you to roll back when something goes wrong, they also track the data and code/dependencies used to train the model.

A good experiment tracking system will have at least the following capabilities.

  • Model training and inference pipeline tracking.
  • Model versioning along with the associated artifacts.
  • Tracking the repo metadata building and publishing training or inference pipeline.
  • Tracking the data and environment used for pipeline execution.
  • Hierarchical experiment tracking for hyperparameter tuning.


Machine Learning is one of the most revolutionary technologies of our generation, but if it must find industry-wide acceptance it has to work under the same principles as all other technology and one of which is reproducible results that can generate trust among its users. There are many tools out there making brave promises on how they take care of all this and allow you to “Have fun with it”, beware of such promises. Building great products is not always fun and requires hard work and no tool in the world can compensate for human errors or lack of structure and discipline in development.

Hope you enjoyed the article.

You can find useful code on my GitHub

You can follow me on Linkedin

You can read my other articles on Medium