Notebook centric ML Workflow

Table of Contents

1. Notebook centric ML Workflow

1.1. Why (and how) to put notebooks in production

También hay que buscar librerías que se puedan utilizar en los notebooks pero también en un servidor
dash/plotly por ejemplo está bien en ese aspecto

1.1.1. Side effects

  1. File size. If the notebook includes images (base64), the file size increases considerably, quickly blowing up git repositories.
  2. Versioning. .ipynb files are JSON files. git diff outputs an illegible comparison between versions, making code reviews difficult. → paired notebooks to keep the output in a separate file (nbdime if you really want to work with plain notebooks)
  3. Hidden state / nonlinearity. Since users can execute code in an arbitrary order, this may yield broken code whose recorded output doesn’t match when running cells sequentially.
  4. Testing/Debugging. Given notebook’s interactive nature, lines of code grow fast. Soon enough, you get a notebook with a hundred cells that are hard to test and debug.
    Arbitrary cell execution eases data exploration, but its over-use often produces broken irreproducible code.
    1. Notebooks evolve organically, and once they grow large enough, there are too many variables involved that it’s hard to reason about the execution flow.
    2. Functions defined inside the notebook cannot be unit tested (although this is changing) because we cannot easily import functions defined in a .ipynb file into a testing module. We may decide to define functions in a .py file and import them into the notebook to fix this.

1.1.2. The problem with prototype in notebooks then refactor for production

Given such problems, it’s natural that teams disallow the use of notebooks in production code. A common practice is to use notebooks for prototyping and refactor the code into modules, functions, and scripts for deployment. Often, data scientists are in charge of prototyping models; then, engineers take over, clean up the code and deploy. Refactoring notebooks is a painful, slow, and error-prone process that creates friction and frustration for data scientists and engineers.

1.1.2.1. Lack of tests dramatically increasing the difficulty of the refactoring process.
1.1.2.2. Lack of sync between notebook version and refactored .py version ⇒ copy-paste between them

After the refactoring process occurs, even the data scientist who authored the original code will have difficulty navigating through the refactored version that engineers deployed. Even worse, since the code is not in a notebook anymore, they can no longer execute it interactively.
What ends up happening is a lot of copy-pasting between production code and a new development notebook.

1.1.3. Workflow summary

  1. Use scripts as notebooks.
  2. Smoke test notebooks with a data sample on each git push.
  3. Break down analysis in multiple small notebooks.
  4. Package projects (i.e., add a setup.py file).
  5. Abstract out pieces of logic in functions outside the notebooks for code re-use and unit testing.
1.1.3.1. Switching the underlying format: Scripts as notebooks

As described earlier, the .ipynb format doesn’t play nicely with git. However, Jupyter’s architecture allows the use of alternative file formats. Jupytext enables users to open .py as notebooks

1.1.3.2. Hidden state, testing, and debugging

We want to achieve the following:

  1. Test notebooks on every code change.
  2. Split logic in multiple, small notebooks.
  3. Modularize data transformations and test them separately.
  1. (Smoke) Testing notebooks

    Testing data processing code is challenging on its own because of runtime (it may take hours)
    We can detect the most common errors with small amounts of data: missing columns, wrong array shapes, incompatible data types, etc.
    Hence, an effective strategy is to run all notebooks with a data sample on every push to eliminate the chance of having broken notebooks.
    Nevertheless, since we are not testing the notebook’s output, this isn’t robust; but we can incorporate more complete tests as our project matures.
    To automate notebook execution, we can use papermill or nbclient. Note that since we switched the underlying format, we’ll have to use jupytext to convert .py files back to .ipynb and then execute them.

  2. Split logic in multiple, small notebooks

    Notebook modularization is where most tools fall short. While many workflow orchestrators support notebook execution, they do not make the process enjoyable

    1. Ploomber

      Ploomber allows data scientists to create a data pipeline from multiple notebooks in two steps:

      1. List the notebooks in a pipeline.yaml file.
      2. Declare notebook dependencies by referencing other notebooks by their name.

      To establish the execution order, users only have to declare notebooks that must execute before the one we are working on.
      upstream = ['clean_users', 'clean_activity']
      By referencing other tasks with the upstream variable, we also determine execution order, allowing us to create a pipeline like this:
      Notebook modularization has many benefits:

      • it allows us to run parts in isolation for debugging
      • add integration tests to check the integrity of each output
      • parametrize the pipeline to run with different configurations (development, staging, production)
      • parallelize independent tasks
  3. Code modularization (to avoid .ipynb copy-pasting) and unit testing

    Code declared in a notebook cannot be easily imported from other notebooks, which causes a lot of copy-pasting among .ipynb files
    Refactoring a project like the one above is an authentic nightmare. Instead, we should aim to keep a minimum of code quality at all stages of the pr
    oject.

    1. The use of scripts as notebooks (as mentioned earlier) facilitates code reviews since we no longer have to deal with the complexities of diffing

    .ipynb files.

    1. By providing data scientists a pre-configured project layout, we can help them better organize their work.
    1. Projects structure
      # data scientists can open these files as notebooks
      tasks/
        get.py
        clean.py
        features.py
        train.py
      
      # function definitions
      src/
        get.py
        clean.py
        features.py
        train.py
      
      # unit tests for code in src/
      tests/
        clean.py
        features.py
        train.py
      
    2. Use a setup.py to import code from multiple folders → Empty setup.py? Ore some more stuff needed?

      To ensure this layout works, the code defined in src/ must be importable from tasks/ and tests/. Unfortunately, this won’t work by default. If we op
      en any of the “notebooks” in tasks/, we’ll be unable to import anything from src/ unless we modify sys.path or PYTHONPATH. Still, if a data scientist c
      annot get around the finickiness of the Python import system, they’ll be tempted to copy-paste code.
      Fortunately, this is a simple problem to solve. Add a setup.py file for Python to recognize your project as a package, and you’ll be able to import
      functions from src/ anywhere in your project (even in Python interactive sessions). Whenever I share this trick with fellow data scientists, they start
      writing more reusable code.

1.1.3.3. Reflecting on the future of notebooks in production

Part of that work is developing better tools that go beyond improving the notebook experience: there is undoubtedly a lot of value in providing feat
ures such as live collaboration, better integration with SQL, or managed Jupyter Lab in the cloud
Most people take as undeniable truth that notebooks are just for prototyping
Notebooks will still be perceived as a prototyping tool if we don’t innovate in areas that are more important when deploying code such as orchestrat
ion, modularization, and testing.

1.6. Jupytext - Jupyter Notebooks as Markdown Documents, Julia, Python or R Scripts — Jupytext documentation

Empareja distintos tipos de archivos (ipynb con md, ipynb con python con/sin celdas de markdown)

1.6.1. Installation — Jupytext documentation

pip install jupytext --upgrade
conda install jupytext -c conda-forge
# Esos dos comandos instalan únicamente un comando para mantenerlos sincronizados
# Con eso sólamente ya puedes poner un pre-commit para que te los sincronize (pej si no quieres trabajar con notebooks)
jupyter serverextension enable jupytext # Para que sincronize los archivos

1.6.4. Ventajas

  • Puedes tener los notebooks git
  • Refactoriza un .py y sincroniza los resultados con los ipynb

1.7. bndr/pipreqs: pipreqs - Generate pip requirements.txt file based on imports of any project. Looking for maintainers to move this project forward.

Pitfall: if you are using git submodules, it resolves every import in the submodule. It may cause problems and list more packages than expected
Pitfall: inspect manually, there may be packages that are actually local imports, not packages

Author: Julian Lopez Carballal

Created: 2024-11-06 mié 12:56