Fancy Stateful Metaflow Service + UI on Google Colab ?

Community Article Published October 14, 2024

The future for greater enterprise adoption of AI lies in many small specialized models and in continuous training, which consists in enabling a model to adapt and improve over time as new data becomes available.
Our objective is to help more machine learning engineers enjoy the journey of scalability, automation, performance monitoring, and continuous integration. Quite a mouthful, I hear you say. Also, it's not quite as boring as it may sound.

Out of frustration with complexity over the years, we decided to ever more broadly try and help democratize MLOps (Machine Learning Operations) and anything related to the industrialization of ML processes. We're of the opinion that there shouldn’t be barriers everywhere, especially between ML experimentation and ML productionalization (Note that I'm not talking about research here, though I secretly hope that someday industrialized research will become widespread — but that's a story for another time).

To get started, we thought that a good first topic would be frameworks. Those nice little things that make our lives easier. One of them is Metaflow. There are other solutions, such as Kubeflow or ZenML, both of which I also like.

Disclaimer: No, this isn’t going to be full of technical jargon or overwhelming instructions. If you're a tinkerer, bear with us, you might just enjoy the ride. Let's dig in.

What is Metaflow ?

I’ll keep it brief. Metaflow is an open-source framework under the Apache 2 license for data science and machine learning workflows. It was created internally at Netflix in 2017 and was open-sourced in 2019. We can say it has strong roots.
For each workflow step, it offers environment isolation through the management of conda or pypi dependencies, as well as options for remote tasks and scalable computing like Ray, Kubernetes, or AWS Batch. You can also specify the resources to allocate for each step, such as memory size or the number of CPUs and/or GPUs. Our favorite part : it does all this with a user-friendly API that makes development and deployment easy.

It’s quick to adopt and use.

Why a Stateful (Yet Disposable) ML Framework Instance in the First Place?

After all, Google Colab, the second big player in our little story, is just a temporary runtime. So why bother tracking and logging anything there?

Well, the whole point of making stateful yet disposable ML framework instances available isn’t really about tracking yet, it’s all about the convenience of quick drafting. Developers like to whip up their drafts fast and appreciate the ability to start fresh, tossing aside all those messy, intermediate attempts to get something working. Who hasn’t had to rerun a draft script because of a missing comma or an indent problem ? Clearly, storing training artifacts from those dry runs isn’t desirable.
And seriously, why keep track of stuff like versioned dataset splits and trained model versions when you’re just drafting your retraining pipeline ? Storing all this on the cloud in your enterprise environment costs a fortune, especially when we’re talking about thousands of dry runs, even for small teams.

In an enterprise setting, what developers and ML engineers love is the freedom to spin up a VM, install a local Metaflow there, draft their retraining pipelines over several days if needed, and then discard the whole host once they're done (and iterate at will). All that while using version control for the code of their retraining pipeline, and that’s really the only thing that matters at this stage.

Oh, and just a heads up : the “stateful” part of our Metaflow/Colab story indeed is targetted at the hustlers and involves Google Drive, but we’ll get into that in a bit.

Now that we’ve got it covered, let’s dive in!

How is Metaflow Published ?

AWS offers a simple managed version of Metaflow that is easily deployable there. Outerbounds, the Netflix spin-off now behind Metaflow, also provides tons of enterprise-grade services and support around it on AWS.

As a general statement, if you're aiming to use it in a self-hosted manner, for instance on-prem, there are 2 GitHub repos to clone if you want to install Metaflow : Metaflow Service & Metaflow UI. The latter is home to a single Docker container, so that's pretty straightforward. The former, on the other hand, is home to a small constellation of Docker containers under the umbrella of a Docker Compose-managed network. The inventory looks like :

We wanted to 

  • circumvent the need for an AWS S3 bucket, which was not an easy ride
  • streamline the deployment with a single setup entrypoint
  • allow for Google Colab hosting, thus the need to overcome the absence of Docker support there

Haters gonna hate !

What does the Google Colab Metaflow release we offer look like ?

First of, we dropped the metaflow-migration container altogether, mostly because it's really huge (almost as large as all other containers combined) and clearly overkill. We just build the db schema at initial start by rolling the sql queries retrieved from source code and that's it.

Then, instead of Docker, we rely on UDocker. This is the wizard of all things. Everyone reading this article should straight go give them a star on github. It has some restrictions compared to Docker of course but, nothing we can't overcome. One of these restrictions is that it can't build images from Dockerfile. So, we did it and pushed images to hub so UDocker can pull them. No biggy.

From there, what we did was :

  • create a single bash script to do 100% of the setup, all the heavy lifting, all the tidying up. Worry free.
  • mount a central volume where everything sits. The PostgreSQL data-files, the log for all conatiners, and of course the datastore.
  • in addition, we implemented a reverse proxy. For this, we took inspo from @radames (here).

Remark : the default value for MF_ROOT mount is /data but can be set via the eponym environment variable to any local directory.


Remains to make this setup "Google Colab friendly".
  • Setting export MF_ROOT=/content/drive/MyDrive/Metaflow would be ideal to establish statefulness if it were not for a last tiny detail. PostgreSQL (the database technology employed by Metaflow) uses Write-Ahead-Logging (WAL) to write to disk and, WAL isn't supported by Google Drive. So, for statefulness, we had to go the creative way and add the optional PGDATA_DIR environment variable which we set to /content/pgdata.
    Now, to maintain statefulness for the database as well as the rest, we monitor that location via inotifywait (GPL2 license) and synchronize it via rsync (GNU license, comes with Ubuntu).
  • To make this Statefull Metaflow Service+UI available to another Google Colab notebook, one that would have a GPU runtime for instance, was left to expose that 7860 port (externally to the [CPU] Google Colab notebook hosting Metaflow). For that, we related to Cloudflare Quick Tunnels.
    They're great, free to use, and don't even require an account. Dream come true, if you ask me.

The result looks like that :

We packaged it all in 2 Google Colab notebooks :

  • metaflow_service.ipynb, where the steps are there for you to execute in order to spin-up your personal Statefull Metaflow Service+UI 
    • mount your Google Drive
    • initiate the inotifywait process
    • generate the url of your Cloudflare tunnel for external access
    • download and run our install bash script
    You can start from scratch or restart a runtime from an existing Google Drive snapshot. Easy-peasy.
  • remote_local_metaflow.ipynb, where all is set for you to run the Metaflow "Hello World" flow that tracks and logs all into your personal Statefull Metaflow Service+UI.
    Ideal for you to start drafting your retraining pipeline for an ML model that may require a GPU the Google Colab free tier makes available. With that second Google Colab notebook, you're in the clear.

Last Words

We've seen a bit about the standard Metaflow architecture and how we can install and use it without the need for an S3 bucket. How cool is that ?
You can find a condensed version of this article, plus a little bit extra with yet an alternative way to install and use it on your local machine on this Github page.

Like all good things, building takes a little time. This here is just the preamble to something really cool I've been building over the summer and which I'll post about here very soon so, send good vibes and, stay tuned !