Quick start#
Installation#
SumatraTask requires Python >= 3.7. It can be installed with pip:
pip install smttask
Configuration#
After installing smttask in your virtual environment, change to your project directory and run
smtttask project init
Follow the prompts to configure a new Sumatra project with output and input directories. The current implementation of smttask requires that the output directory by a subdirectory of the input directory.[3]
Hint It’s a good idea to keep the VC repository tracked by Sumatra as lean as possible. Things like project reports and documentation are best kept in a different repository.
Alternatively, if you would prefer to keep project reports in the same repository, we now have
support for “dirty directories”: uncommitted changes in this directories will not prevent SumatraTask
from executing a task. Obviously one needs to take care not to mark a code directories as dirty,
but reports or labnotes directories could make sense.
Support for this is current experimental. The list of dirty directories currently needs to be
specified by manually editing the .smt/project file.
SumatraTask workflows#
Workflows constructed with SumatraTask have a number of benefits over the common “all-in-one” scripts:[1]
Lazy execution of expensive computations;
Automatic on-disk caching of expensive computations;
Optional, in-memory caching of intermediate computations;
Fully reproducible workflows: every required parameter, and every package version, is recorded;
Composability: Tasks can be used as inputs to other Tasks;
Portability: Any Task can be serialized to a JSON file, and then executed from that file. This is great for running batches of jobs, either on a local or a remote machine.
All this with minimal markup. How minimal ? Suppose you have a analysis function called analyze, taking a NumPy array and some parameters dt and nbins, and returning three values:
def analyze(arr, dt, nbins):
...
return (μ, σ, p)
To turn this into a Task, you would do the following:
@RecordedTask
def analyze(arr: Array, dt: float, nbins: int) -> Tuple[float,float,float]:
...
return (μ, σ, p)
and add the following imports to the top of your file:
from typing import Tuple
from smttask import RecordedTask
from scityping.numpy import Array
That’s it ! This is still 100% valid Python, so you can run it directly within your notebook or editor. All it requires is two things:
That each task be a pure function.
That all the inputs be serializable to JSON.
Note that there is no way SumatraTask can check that a function is pure, so it relies on you to do so. Be especially careful with functions that depend on objects which conserve state via private attributes, for example random number generators.
The requirement for serializability means that we need to provide for each data type a pair of functions to serialize and deserialize values to and from JSON. Under the hood, SumatraTask uses Pydantic for serialization, so most built-in types are already supported. Additional types geared for scientific computing (such as NumPy arrays and dtypes) are also defined in scityping.
Ensuring all our input data are serializable is not always trivial, but it is the only thing required to unlock all the benefits mentioned above.
Running tasks#
As part of a script.
One could define, for example, the following file named run.py:
import numpy as np from project.tasks import analyze tasks = [] for dt in [0.1, 0.3, 0.5]: tasks.append(analyze(arr=np.array([1, 2, 3]), dt=0.5, nbins=2)) for task in tasks: task.run()
Typically such a run.py file would be excluded from version control. Especially convenient is using a Jupyter notebook for such a run file, to allow easy in-line documentation.
From a task description file.
In the example example, we could change
task.run()
to
task.save("taskdir")
Now, instead of executing the task, the script generates a complete, self-contained task description file (basically a JSON file) and places it within the directory taskdir with a unique, automatically generated file name.[2] Task description files can be executed from the command line:
smttask run taskdir/task_name
This approach is especially convenient for generating task file locally, and running them on a more powerful computation cluster. Although SumatraTask is not a scheduler, the
smttask runcommand does provide basic multiprocessing and queueing capabilities. For example, the following would run all task files under taskdir, four at a time:smttask run -n4 taskdir/*
Exploring recorded tasks#
Within an IPython console or Juptyer notebook, you can use the RSView object to explore the list of previous records. It provides functionality for filtering the list based on a variety of criteria and for recreating the Task objects which produced the record. See the In-depth documentation and the API reference for details.
The CLI also provides a few commands for manipulating the record store; type smttask store --help for a list, or check its online documentation.
SumatraTask is built upon Sumatra and exposes a subset of its CLI; type smttask smt --help the list of exposed commands. For example smttask smt list can be used to print a list of record labels.
Usage recommendations#
Keep extra project files (such as notes, pdfs or analysis notebooks – anything that does not serve to reproduce a run) in a different repository. Every time you run a task, Sumatra requires you to commit any uncommitted changes, which will quickly become a burden if your repository includes non-code files. Jupyter notebooks are especially problematic, because every time they are opened, the file metadata is changed. (Strongly recommended in this case is to pair the notebook to a Python script with Jupytext, and only add the script to version control.)
This comment about separating the code repository is even more important if you use the ‘store-diff’ option. Otherwise you will end up with very big diffs, and each recorded task may occupy many megabytes.
It will happen that you run a task only to realize that you forgot to make a small change in your last commit. It’s tempting then to use
git commit --amend, so as to not create a new unnecessary commit – do not do this. This will change the commit hash, and any Sumatra records pointing to the old one will be invalidated. And no matter how careful you are to “only do this when there are no records pointing to the old commit”, it will happen, and you will hate yourself.
Recording changes compared to Sumatra#
SumatraTask sets the “main file” to the module where the Task is defined. This may not be the file passed on the command line.
The file passed on the command line is logged as “script arguments”.
Limitations#
stdoutandstderrare currently not tracked.
Features#
SumatraTask will
Manage saving and loading paths, so you can concentrate on what your code should do rather than where it should save its results. All tasks are saved to a unique location, no previous location is ever overwritten, and results are load paths are resolved transparently as needed.
Automatically load previous computation results from disk when available.
Record code version and parameters in a Sumatra project database.
Allow you to insert breakpoints anywhere in your code.
SumatraTask will not
Schedule tasks: tasks are executed sequentially, using plain Python recursion to resolve the dependency tree. To automatically set up sets of tasks to run in parallel with resource management, use a proper scheduling package such as Snakemake, Luigi, NextFlow or DoIt. smttask provides a helper function to generate snakemake workflows; a similar bridge to other managers should also be possible.
Compared to Luigi/Snakemake#
The result of tasks can be kept in memory instead of, or in addition to, writing to disk.
~ This allows for further separation of workflows into many small tasks. A good example where this is useful is a task creating an iterator which returns data samples. This is a typical way of feeding data to deep learning libraries, but since an iterator cannot be reliably reloaded from a disk file, such a task does not fit well within a Luigi/Snakemake workflow.
Entire workflows can be executed within the main Python session.
~ This is especially useful during development: the alternative, which is to spawn new processes for each task (perhaps not even Python processes), can make it easy to lose information from the stack trace, or prevent the usage of breakpoint().
Allows for different parent task
~ Luigi/Snakemake make it easy to use the same task as parent for multiple child tasks, but using different parents for the same child is cumbersome and leads to repeated code. (I think ?)
Manages output/input file paths.
~ Luigi/Snakemake require you to write task-specific code to determine the output and input file paths; Luigi’s file path resolution in particular is somewhat cumbersome. With smttask, file paths are automatically determined from the task name and parameters, and you never need to see them.
Compared to Sumatra#
Both input and output filenames can be derived from parameters ~ (Sumatra requires inputs to be specified on the command line)