Best practices for turning jupyter notebooks into python scripts

后端 未结 4 1408
误落风尘
误落风尘 2021-01-29 19:19

Jupyter (iPython) notebook is deservedly known as a good tool for prototyping the code and doing all kinds of machine learning stuff interactively. But when I use it, I inevitab

相关标签:
4条回答
  • 2021-01-29 19:59

    You should breakdown the logic in small steps, that way your pipeline will be easier to maintain. Since you already have a working codebase, you want to keep your code running, so make small changes, test and repeat.

    I'd go this way:

    1. Add some tests to your pipeline, for ML pipelines this is a bit hard, but if your notebook trains a model, you can use performance metrics to test if your pipeline still works (your test can be accuracy = 0.8, but make sure you define a tolerable range since the number hardly be the exact same for each run)
    2. Break apart your single notebook into smaller ones, the output from one should the input for the other. As soon as you create a split, make sure you add a few tests for each notebook individually. To manage this sequential execution, you can use papermill to execute your notebooks or a workflow management tool such as ploomber which integrates with papermill, is able to resolve complex dependencies and has a hook to run tests upon notebook execution (Disclaimer: I'm ploomber's author)
    3. Once you have a pipeline composed of several notebooks that passes all your tests you can decide whether you want to keep using the ipynb format or not. My recommendation would be to only keep as notebooks the tasks that have rich output (such as tables or plots), the rest can be refactored into Python functions, which are more maintainable
    0 讨论(0)
  • 2021-01-29 20:07

    I made a module recently (NotebookScripter) to help address this issue. It allows you to invoke a jupyter notebook via a function call. Its as simple to use as

    from NotebookScripter import run_notebook
    run_notebook("./path/to/Notebook.ipynb", some_param="Provided Exteranlly")
    

    Keyword parameters can be passed to the function call. Its easy to adapt a notebook to be parameterizable externally.

    Within a .ipynb cell

    from NotebookScripter import receive_parameter
    
    some_param = receive_parameter(some_param="Return's this value by default when matching keyword not provided by external caller")
    
    print("some_param={0} within the invocation".format(some_param))
    

    run_notebook() supports .ipynb files or .py files -- allowing one to easily use .py files as might be generated by nbconvert of vscode's ipython. You can keep your code organized in a way that makes sense for interactive use, and also reuse/customize it externally when needed.

    0 讨论(0)
  • 2021-01-29 20:08

    We are having the similar issue. However we are using several notebooks for prototyping the outcomes which should become also several python scripts after all.

    Our approach is that we put aside the code, which seams to repeat across those notebooks. We put it into the python module, which is imported by each notebook and also used in the production. We iteratively improve this module continuously and add tests of what we find during prototyping.

    Notebooks then become rather like the configuration scripts (which we just plainly copy into the end resulting python files) and several prototyping checks and validations, which we do not need in the production.

    Most of all we are not afraid of the refactoring :)

    0 讨论(0)
  • 2021-01-29 20:14

    Life saver: as you're writing your notebooks, incrementally refactor your code into functions, writing some minimal assert tests and docstrings.

    After that, refactoring from notebook to script is natural. Not only that, but it makes your life easier when writing long notebooks, even if you have no plans to turn them into anything else.

    Basic example of a cell's content with "minimal" tests and docstrings:

    def zip_count(f):
        """Given zip filename, returns number of files inside.
    
        str -> int"""
        from contextlib import closing
        with closing(zipfile.ZipFile(f)) as archive:
            num_files = len(archive.infolist())
        return num_files
    
    zip_filename = 'data/myfile.zip'
    
    # Make sure `myfile` always has three files
    assert zip_count(zip_filename) == 3
    # And total zip size is under 2 MB
    assert os.path.getsize(zip_filename) / 1024**2 < 2
    
    print(zip_count(zip_filename))
    

    Once you've exported it to bare .py files, your code will probably not be structured into classes yet. But it is worth the effort to have refactored your notebook to the point where it has a set of documented functions, each with a set of simple assert statements that can easily be moved into tests.py for testing with pytest, unittest, or what have you. If it makes sense, bundling these functions into methods for your classes is dead-easy after that.

    If all goes well, all you need to do after that is to write your if __name__ == '__main__': and its "hooks": if you're writing script to be called by the terminal you'll want to handle command-line arguments, if you're writing a module you'll want to think about its API with the __init__.py file, etc.

    It all depends on what the intended use case is, of course: there's quite a difference between converting a notebook to a small script vs. turning it into a full-fledged module or package.

    Here's a few ideas for a notebook-to-script workflow:

    1. Export the Jupyter Notebook to Python file (.py) through the GUI.
    2. Remove the "helper" lines that don't do the actual work: print statements, plots, etc.
    3. If need be, bundle your logic into classes. The only extra refactoring work required should be to write your class docstrings and attributes.
    4. Write your script's entryways with if __name__ == '__main__'.
    5. Separate your assert statements for each of your functions/methods, and flesh out a minimal test suite in tests.py.
    0 讨论(0)
提交回复
热议问题