ML Model Training Is An ETL
The core activity performed by a machine learning engineer or data scientist to add value to your business is model training. This is the process of determining the parameters and data structures of an ML model by optimizing the model’s performance against a known data set and success criteria.
The inputs of model training can be any structured or unstructured data values that a particular training algorithm can ingest and convert into a format suitable for this optimization. The output is a parametric or non-parametric data structure containing a representation of the data that has been processed such that it is usable for prediction on new data instances. A key insight is that this process is not fundamentally different from any other type of ETL. Consider a few examples from both non-ML and ML workloads.
In each case there is a source of data, a transformation and an output artifact that a downstream use case may depend on. It’s easy to see how a task that trains an ML model acts just like any other transformation step. The only wrinkle is that the output artifact is a data structure that corresponds to the prediction algorithm of the model instead of a more ubiquitous form like a database table or a new data format. The output in the ML model training case could be as simple as an index inside a familiar database system or as exotic as a special language-specific serialization format storing values that will be loaded into an object in C++ or Python to define special business logic or algorithms for prediction. Examples could range from modern neural network frameworks and formats to pipeline based object models such as in scikit-learn.
There are several reasons why it’s useful to use the concept of ETL to describe machine learning model training.
It’s an established DevOps best practice to automate ETLs and monitor up-time, resource usage, runtime and success rate. This helps to bridge the gap between traditional observability tooling (things like the SRE “golden four” signals visualized or scraped with tools like Grafana and Prometheus) vs. diagnostics for machine learning, which may require much richer visualizations (e.g. Tensorboard) to detect training convergence, overfitting, accuracy and model explanations. If you can fit these needs into the general framework of ETL observability you are more likely to get help from SRE and DevOps engineers in addition to leveraging best practices of existing in-house tools. Just as the “golden four” signals mentioned above need to be monitored for many kinds of ETL workloads, analogous diagnostic results that matter for machine learning need to be monitored and visualized for the same reasons. The framework of ETLs helps all parties (ML teams, SRE teams, operations engineers) to get on the same page about the priority and support needed for solving these problems in connection with ML model training tasks.
ETLs have contracts, resources and business use cases. Units that are pluggable into ETL systems can more easily rely on versioning, compliance resources, shared responsibilities and enforced obligations on upstream dependency steps of a pipeline. If “model training” is mistakenly depicted as some type of prima donna special case or exotic computation then there’s a risk it won’t get funded by stakeholders or that general resourcing and maintenance for model training won’t be a shared concern of all the teams that own ETL infrastructure. It can result in a “fend for yourself” afterthought scenario. Responsibilities for creating and maintaining complex end-to-end ETL infrastructure and resolving associated observability and compliance work may get foisted onto ML or data science teams that are totally not staffed in a way to realistically support it.
Depicting ML model training as “just a special kind of ETL” solves problems of communication up the food chain. Directors and engineering executives understand ETLs and can reason about them in terms of business cases, budget justification and engineering objectives. If you explain the need to spend engineering time on ETL requirements then it is much less likely to draw pushback or require extraneous salesmanship to persuade people to approve than a project involving lots of technical jargon about model training or overfitting. In a perfect world this wouldn’t matter. Support and business justification for improving model training tasks ought to be an uncontroversial, everyday thing, no different than enhancing a database sharding system or adding better dashboards to a user clickstream pipeline. But such is life - especially in organizations with poor machine learning culture. If framing your model training task in the tools and lingo of ETLs will win you more support, that’s a no-brainer.
Many ML technologies, both tried and true older systems and new paradigms making waves, can be understood in terms of the way they map ML concepts to ETL concepts. Being able to fluidly shift between the two ways of conceptualizing model training tasks is helpful for communication as well as architecture or strategic decision making. A few examples are discussed below.
Kubeflow: schedule containerized training programs on compute resources managed by Kubernetes. Any way of mounting or fetching data to a Kubernetes pod can serve as the “extract” step. The executed container program is the “transform” and storing the resulting artifact to a cloud storage bucket or volume mount is the “load” step along with any stored evaluation or visualization artifacts.
Airflow or Luigi: virtually the same as Kubeflow except they don’t make assumptions about containerization or the compute resource orchestrator they run on top of. It could be Kubernetes or various other tools for mapping tasks to compute resources. These tools use the concept of a directed acyclic graph (DAG) as a first-class concept in defining and manipulating sequences of dependencies as steps of an ETL pipeline. Model training would just be a certain step in the process - no different than anything else apart from possibly more exotic resource needs (e.g. GPUs or significantly high RAM) and runtime dependencies (various ML libraries). In many ways, Kubeflow is just an opinionated config layer on top of more traditional DAG concepts like these (with the added details that come from an assumption of Kubernetes).
Slurm: yes, even a distributed scientific computing cluster scheduler can be viewed as an ETL system. Slurm manages a cluster of compute resources (in principle this could run via Kubernetes but usually does not). Once scheduled, a task must perform (serially or in parallel) an extraction that loads data to cluster nodes followed by the logic of the distributed program as a transform. Some type of gather operation consolidates and loads the result artifacts for storage and further processing. Because Slurm provides capabilities for distributed programs with interprocess communication needs, it is a good choice for complex distributed training tasks that don’t easily fit into more structured execution modes like embarrassingly parallel or MapReduce.
MapReduce: similar to Slurm but with a more limited, predefined model of parallelism via the mapper and reducer paradigms.
There are many more examples and complementary tools (e.g. MLflow, Weights & Biases, Neptune.ai) that provide APIs or libraries for tracking, model standardization and object storage that can be interlaced in any of the compute orchestration and scheduling systems discussed above. One beneficial exercise for ML managers is to write down an inventory of your current model training tech stack and diagrams of the sequence of dependency steps involved in model training. How can you map these components to concepts from ETL systems? If you are still relying on manually executed training tasks (e.g. on a developer laptop or a Jupyter notebook) how can you connect the dots between that system and an automated ETL version that can take away the burden of manual execution and add better observability and reliability to your model training workflows?
The nice thing about this essay is that the title says it all. You can walk away with nothing besides “ML model training is an ETL” and it will serve you well. By using the framework of ETLs you can better lobby for SRE & DevOps support and more efficiently connect ML model training concerns to directors, executives and engineering objectives. Finally as you evaluate open-source and vendor solutions in the ever crowding space of model tracking and experiment management, you can map the concepts to more standard ETL concerns. This enables more effective choices given the ETL solutions available to you and facilitates better collaboration with other parties that own and maintain ETL tools in your organization.