What are ML Pipelines and how do they come to be?
this is Rinat Abdullin with a newsletter on the intersection of programming, ML and product development.
Ahmed asked a question about Machine Learning (ML) pipelines:
> I'm interested in the subject, nothing in particular but wanted to know if you documented your research and finding and your conclusions, what worked and what not etc, hopefully there will be a blog series or newsletter series covering your findings unless I missed them ?
Here is what I've learned on this topic in the past 4 years.
Machine Learning focuses on creating ML models that can automate specific tasks. It could be anything: detecting fraudulent transactions, identifying parking lots on satellite images, translating text from one language to another, or suggesting a price to deliver cargo between two locations.
> Humans can do some of these tasks better than machines. But machines can do these task fast, non-stop and without getting bored.
As a programmer, you can think of ML model like of a function that takes arguments and returns an answer. Except instead of being coded by a human - it is a black box derived from a lot of data. ML model itself is just a lot of data wrapped by code to deserialise that data into a function (with a lot of variables inside).
This analogy makes ML Pipelines similar to another form of pipelines: Continuous Integration/Delivery pipelines in software. Both compile sources into some form of executable artefacts. Except:
- In software we mostly work with code, while in ML we work mostly with data
- Software can be tested well, build either passes or fails. ML models are always inaccurate to some degree.
- Data can be wrong, it can also get stale. Models that are derived from that data - can also become stale or wrong.
> Here is an example of models becoming stale. Around 2020 all models for suggesting price of cargo delivery between two locations started returning bad responses. Why? Brexit! Businesses in UK decided to start stocking up on supplies before borders went up. Unprecedented demand drove the prices up. Models that were trained on the historical data remained clueless.
So ML Pipelines are a sequence of transformations that convert data between different representations until we end up with a trained and verified ML model at the end.
Remember: ML model is just a lot of data with a thin code that describes how to load and use it.
Here is how ML Pipelines could look like in a simple form:
1. Download data from some source, usually it will be a set of rows or records - dataset.
2. Convert data to a format suitable for training: select features (arguments for the model), remove noise and bad records. Some fields in that dataset will be inputs and some other - desired outputs that we want to predict.
3. Define the model format (smell the wind and say “this big equation with a lot of variables will get the job done”) and train the model on data (tweak formula variables in semi-random way until the model starts accurately guessing results given inputs).
4. Package the model into a durable format (e.g. a Docker container with some binary blob).
5. Optionally, deploy the model as a service with an API.
Normally these transformations are codified as workflows (workflow as a code), versioned and deployed. In simpler projects one can implement them with bash or python scripts. Larger projects and teams might need to use something that is more documented and understandable than a custom script.
Long story short. ML Pipelines are codified workflows that ingest data, transform and derive reusable models from it. Their goal is similar to CI/CD pipelines in software engineering: automate, ensure repeatability and scale processes. Implementation details differ from CI/CD because they ML Pipelines work mostly with data.
How do ML Pipelines come into an existence?
ML projects can start with prototypes. Researchers write scripts that gather data, clean it up and train models. Models can then be manually packaged and uploaded to some API.
> Building ML-driven products is also another area where normal software development heuristics don't really apply. There is too much R&D and uncertainty. On the bright side - product development is well suited for being data-driven.
This normally works for a team of one researcher and only on their machine. All dependencies are implicit and the pipeline process exists only as a sequence of steps to be executed manually.
As soon as ML prototype gains traction and needs to become a product, this setup will no longer work:
- You need to introduce more team members to collaborate. Pipelines have to work on their machines as well.
- There should be human-independent way to train and deploy models. At least in two flavours: DEV for the new stuff and PROD for the stable customer-facing.
- You need a way to quickly roll back to a previous stable version, if things go bad.
- Observability and telemetry to track how models are working out in real time, how much nonsense are they spitting out.
- Use as much standard stuff as possible, so that it will be easier to bring in new team members.
Common way to get this done is to package everything as Python projects (most of ML research is done in Python) that have interactive documentation in Jupyter notebooks. Then add an additional layer of packaging in form of Docker files, to capture system dependencies (Python version, CUDA libraries, gcc, zip, standard C++ libraries etc) and entry points.
> Python ecosystem, being vibrant doesn't make standard packaging an easy thing. There is too much variety. Says the person who was digging through the migration from setup.py to pyproject.toml in the last few weeks.
These containers become platform-independent way to execute data preparation and training logic. There will also be an inference container that loads a trained model and uses it to answer some requests.
Then the MLOps team would add a workflow orchestrator tool to run these containers in a proper order. For example, Kubeflow is one of the most popular and widely used tools for setting up ML workflows out there. Google Cloud adopted it as “Vertex AI Pipelines”.
> From the programming perspective, things get really messy at this point! At some point I have discovered a Kubeflow component definition in YAML. This YAML defined how to generate a JSON object by concatenating strings. Resulting JSON was to be passed as a command-line argument to a container running in Kubernetes. There was even a weak type system on top of that!
As the name suggests, the entire stack runs on Kubernetes, which sends the overall complexity through the roof. Data scientists get to live with that.
Obviously, all of that creates friction and slows down the original research. So while introducing ML Pipelines, the it is important to pick system boundaries that support team collaboration while keeping the original research agile. Do it wrong, and researchers will either flee to another company or the product will die out. Do it right and teams will be able to deliver a lot more value.
I'm already past the 1000 words mark in this newsletter. I have a few more words on this subject, if you are interested. What do you think?
Until next week!
With best regards,