Machine learning : Mlflow concepts

MLflow: A Platform for Managing the Machine Learning Lifecycle

Introduction

MLflow is an open-source platform that aims to simplify the ML lifecycle and streamline the collaboration between data scientists and engineers. MLflow provides four main components:

MLflow Tracking:

A service that records and queries experiments, including code, data, configuration, and results.

MLflow Tracking is a crucial component of the MLflow platform that simplifies the machine learning lifecycle and helps data scientists and engineers collaborate more effectively. With MLflow Tracking, you can easily record and query experiments, including code, data, configuration, and results. This service allows you to keep track of your experiments, reproduce them easily, and compare different models. You can also use MLflow Tracking to optimize hyperparameters and work with your team on your ML projects. MLflow Tracking simplifies the process of developing and deploying ML applications by providing a centralized location to manage experiments.

Mlflow stores and logs all the experiment's activities into a concept called run. Mlflow runs are recorded by default to a local file

The location of the file is determined by the MLFLOW_RUN_DIR environment variable. If this variable is not set, the file will be created in the current working directory. The file named mlruns.db contains all the information about the run, including the parameters, metrics, and artifacts that were logged.

Runs can also be recorded in SQlite, Postgres, MySql, Oracle, and other databases that are compatible with SQLAchemy

Instead of running Mlflow on a local server, you can also set up a remote tracking server.

MLflow remote tracking server is a service that allows you to store and manage MLflow runs in a centralized location. This can be useful for organizations that need to track the progress of ML experiments across multiple teams or projects.

To run Mlflow tracking servers you need to set the MLFLOW_TRACKING_URI environment variable to the tracking server or call mlflow.set_traking_uri() a the start of your program.


In Mlflow you can log various kinds of data (artifacts, metrics, model) with methods like:

mlflow.log_param: This will log the parameter param_name with the value param_value to the currently active run. The mlflow.log_param() function can be used to log any parameter, including strings, numbers, lists, and dictionaries. The mlflow.log_param() a function is a powerful tool for tracking the parameters used in your machine-learning experiments. This information can be used to reproduce experiments, debug models, and understand the impact of different parameters on the performance of your models.

import mlflow

with mlflow.start_run() as run:
    mlflow.log_param("param_name", "param_value")

mlflow.log_metric: This will log the metric metric_name with the value metric_value of the currently active run. The mlflow.log_metric() function can be used to log any type of metric, including floats, integers, and strings. The mlflow.log_metric() function is a powerful tool for tracking the performance of your machine-learning models. This information can be used to compare different models, identify the best model for your application, and track the progress of your models over time.

import mlflow

with mlflow.start_run() as run:
    mlflow.log_metric("metric_name", "metric_value")

mlflow.log_artifact: The mlflow.log_artifact() function can also be used to log a directory of files. To do this, specify the directory path as the argument to the function. The mlflow.log_artifact() function is a powerful tool for tracking the artifacts generated by your machine-learning experiments. This information can be used to reproduce experiments, debug models, and understand the impact of different artifacts on the performance of your models.

import mlflow

with mlflow.start_run() as run:
    mlflow.log_artifact("my_file.txt")

mlflow.log_artifacts: The mlflow.log_artifacts() function can also be used to log a directory of files. To do this, specify the directory path as the argument to the function. The mlflow.log_artifacts() function is a powerful tool for tracking the artifacts generated by your machine-learning experiments. This information can be used to reproduce experiments, debug models, and understand the impact of different artifacts on the performance of your models.

import mlflow

with mlflow.start_run() as run:
    mlflow.log_artifacts("my_directory")

In addition to that Mlflow also has an auto log for various machine learning frameworks like tensorflow, scikit-learn, keras, xgboost, spark, Fastai and many more.

After logging all those pieces of information you can easily visualize it with the Mlflow interface that gives you the possibility to see all your experiments, compared them, and manage your models, and your project.

MLflow Projects:

MLflow Projects is a simple and flexible way to package your data science code in a format that can be easily run on different platforms and shared with others. With MLflow Projects, you can package your code, data, and environment in a portable and reproducible way. You can also define parameters and dependencies and run your code in different environments, such as a local machine, a remote cluster, or a cloud service. MLflow Projects makes it easy to organize and manage your code, collaborate with others, and reproduce your results.

Mlflow understands and runs some projects based on a convention:

  • The project name is the name of the working root directory

  • If a conda.yml file is present mlflow will use it a an environnement configuartion , if it’s not mlflow will use the latest oython environnement available in conda

  • Any Python and bash files can be used as an entry point that you can specify when running a project

You can give more control by adding a MLproject file.

The Mlflow project file is written in yaml the file properties are:

  • Name: The name of the project

  • EntryPoints: Commands that can be used in your project

  • Environment: The description of the environment that to use. Mlflow support any type of types of environnement

    • Conda:

    • Virtualenv

    • Docker containers

    • System environnements

MLflow Models:

A format for packaging models to make them easy to deploy across diverse environments.

The storage format for Mlflow model is a directory created when you call a mlflow.save(model, “model_name”) This directory contains arbitrary files along with MlflowModel that can define multiple flavors

The flavors concept makes MLflow Models a powerful tool for packaging and deploying machine learning models. By using flavors, MLflow can work with models from a variety of different libraries, making it possible to write tools that can work with models from any library.

Here are some of the benefits of using flavors:

  • Portability: Flavors make it possible to move models between different libraries and tools. This makes it easier to develop and deploy models.

  • Reusability: Flavors make it possible to reuse models in different applications. This can save time and effort.

  • Interoperability: Flavors make it possible for different tools to communicate with each other. This can improve the overall ML workflow.

    Here are some of the flavors that MLflow supports:

    • Python function: This flavor defines how to run the model as a Python function.

    • PyTorch: This flavor defines how to run the model as a PyTorch model.

    • TensorFlow: This flavor defines how to run the model as a TensorFlow model.

    • XGBoost: This flavor defines how to run the model as an XGBoost model.

    • CatBoost: This flavor defines how to run the model as a CatBoost model.

    • LightGBM: This flavor defines how to run the model as a LightGBM model.

    • ONNX: This flavor defines how to run the model as an ONNX model.

Apart from flavors Mlflow model file also contains the:

  • time_created: The date and time when the model was created.

  • run_id: The ID of the run that created the model.

  • signature: The model signature is in JSON format.

  • input_example: A reference to an artifact with an input example.

  • databricks_runtime: The Databricks runtime version and type, if the model was trained in a Databricks notebook or job.

  • mlflow_version: The version of MLflow that was used to log the model.

    Here is an example of an MLmodel YAML file:

      name: my_model
    
      flavors:
    
      python_function: signature: inputs: - name: x type: float output: name: y type: float implementation: my_model.predict
      time_created: 2023-05-20T06:23:59.999999Z
      run_id: 1234567890
      signature: '{ "inputs": [ { "name": "x", "type": "float" } ], "output": { "name": "y", "type": "float" } }'
      input_example: '{"x": 1.0}'
      databricks_runtime: '10.4-ML'
      mlflow_version: 1.14.0
    
  • When working with ML models, it's essential to know their basic functional properties, such as what inputs they expect and what outputs they produce. MLflow models can include additional metadata about their inputs and outputs, which can be used by downstream tooling.

    The MLflow model signature is a JSON object that defines the schema of a model's inputs and outputs. The signature can be used to validate inputs and outputs and to ensure that the model can be used in different applications.

    The signature object has two properties:

    • inputs: A list of input objects. Each input object has a name and a type. The name is a string that identifies the input. The type is a string that specifies the data type of the input.

    • outputs: A list of output objects. Each output object has a name and a type. The name is a string that identifies the output. The type is a string that specifies the data type of the output.

Here is an example of an MLflow model signature:

    {
      "inputs": [
        {
          "name": "x",
          "type": "float"
        }
      ],
      "outputs": [
        {
          "name": "y",
          "type": "float"
        }
      ]
    }

The MLflow model signature can be used to validate inputs and outputs. For example, if you try to predict using a model with a signature that does not match the input data, MLflow will raise an error.

The MLflow model signature can also be used to ensure that the model can be used in different applications. For example, if you want to use a model in a REST API, you can use the signature to validate the input data and to generate the output data.

  • MLflow Registry: A central repository for storing and managing model versions and metadata.

    The MLflow Model Registry is a centralized repository for storing, managing, and serving ML models. It provides a variety of features that can help you manage the full lifecycle of your ML models, including:

    • Model versioning: The Model Registry allows you to track different versions of your models, including the date and time they were created, the data they were trained on, and the metrics they were evaluated on.

    • Model lineage: The Model Registry allows you to track the lineage of your models, which is the history of how they were created and trained. This information can be helpful for debugging and understanding the performance of your models.

    • Model governance: The Model Registry allows you to control who can access and use your models. This can help you protect your models from unauthorized access and ensure that they are only used for authorized purposes.

    • Model serving: The Model Registry allows you to serve your models in production. This means that you can make your models available to other users and applications without having to worry about the underlying infrastructure.