Design Ideas for Frustrated ML Engineers

Simple ideas to solve complex problems.

5 min readMay 18, 2022

The Night Café, 1888 by Vincent Van Gogh (Source)

“I like these calm little moments before the storm. It reminds me of Beethoven.” — Gary Oldman in Léon: The Professional.

This is how I feel every time one of the data science teams comes up with another brilliant idea to solve a problem with ML, cause it means long hours solving linting issues, writing test cases, and resolving platform issues.

ML System Diagram — A generic diagram of the production ML System. (Author)

In the above diagram, you can see how small is the actual model training component. In reality, we spend even less time focusing on model training, most of the time spent by an ML Engineer is building the package which contains the code to build data pipelines, training pipelines, prediction pipelines, writing test cases, and solving linting issues.

“The reality is often disappointing”, I know, but to see a final product is worth it. To see data seamlessly flowing through your model and being presented to users, who solve real problems, is all worth it.

So here I am going to suggest 3 approaches I take to solve some of the problems I have faced.

Keep it simple

The simplest solution is often the best, or at least it gets the job done. As ML went mainstream in the software industry, everyone expected software engineers to just pick up ML and start building applications, turns out, that doesn’t work.

The problem is not that they can’t pick up ML, the problem is they can’t drop their years of software training, they come with years of building applications where every line of code is tested and linted.

That doesn't always work in ML applications when that one line of code can run for 3 hours using 16 GB of GPU Memory. When a slight change in random state can change the results. How do you test that (you can!!). Not every ML library comes with linting standards.

Not every ML application needs that level of scrutiny. A simple document tagging system quietly working in the background tagging new documents using ML does not need to pass all of mypy or flake8’s standards, or need a 100% test coverage.

Simple python project repo for ML — Start with the simplest solution. (Author)

You don’t always need to start by defining abstract classes and datasets or implement 15 different linting hooks, a simple package that can be containerized along with simple pipelines to train and deploy models could solve half of your ML problems. Also, it can be developed very quickly, it's a great starting point to show your idea in action.

This design works very well in the below cases.

Single use-case problem, there is 1 dataset, 1 model, and 1 prediction.
Using classic models like logistic regression or random forest.
Want to reach the minimum viable product stage quickly.

As you build, new components can be easily added like hyperparameter tuning or model monitoring.

Stand on the shoulders of greats

The modern machine learning landscape is built by selfless people in our open source community, if not for NumPy or Scikit-Learn, we all will be hitting our heads on keyboards typing MATLAB code starting loops from 1.

Whenever possible use the tools available in open source, you do not have to develop your own BERT model and train it. PyTorch and HuggingFace provide a whole host of models that are more than enough powerful to solve your problem, ranging from computer vision to language models.

There is an existing tool for almost everything an ML application needs, below are some suggestions. No need to implement all components through tools, but also no need to develop every tool in-house.

Model Inventory: PyTorch, HuggingFace, TensorFlow.
Model Training and Evaluation: PyTorch Lightning, FastAI
Data Versioning: DVC
Dependency Management: Conda, pip, poem
Experiment Tracking: MLFlow, KubeFlow
Model Registry: MLFlow

A lot of toys out there to play with. (Source)

The pro is that you get production-grade well maintained code, available for plug and play. The con is maintenance can become very difficult with so many dependencies.

Do not go gentle into the good night

Now if none of the above 2 ideas work for you, because your use-case is really unique or really big then first, I feel sorry for you, next I have a diagram for you.

A diagram to help build the large applications. (Author)

The above diagram has helped me when no one could. Sometimes we come across products that are not just a simple use case, but a platform that would include numerous and often undefined use cases, in such case you need to bring the big guns. This is the kind of stuff that requires the baggage we discussed earlier carried by software engineers.

This would allow you to add new models and new use cases as you scale, there is no hiding from linting or testing now. This level of solution requires a high level of linting and testing standards using mypy, flake8, pydocstyle, isort, pytest and etc.

In the above diagram, every component is a module of code and has a complimentary test module. Some guidelines to go along with the diagram are:

Do not hard code any configuration, pass it as an argument or environment variable.
Define your own training and prediction loops to keep control.
Define a callback mechanism that allows you to control training and logging.
Log everything, and build an extensive logging system with proper formatting. Debugging can become hell otherwise.
Build extensive exception handling and implement tests for them. You can define very very small datasets to only be used to test code modules related to training or tuning to avoid long test times.
Build pipelines to bring models, data, and training together for each use case.

Conclusion

The ideas here are from my practical experience and do not cover all the possible situations, these ideas have helped me build solutions that I could be proud of, and I hope you can use them to solve some of your problems.

The great thing about ML is that the landscape is still very unexplored and there are endless possibilities. Find your own way of doing things and make it work.

If you enjoyed the article

Leave a comment
Connect with me on Twitter, LinkedIn, Github, or Medium.
Share the article with your network.