Otter — Open Targets Task ExecutoR

Otter is a the task execution framework used in the Open Targets data Pipeline.

It provides an easy to use API to implement generic tasks that are then used by describing the flow in a YAML configuration file.

Take a look at a Simple example.

Overview

The hierarchy is as follows:

A run of the Pipeline is composed of a series of Steps, and those are a bunch of Tasks.

Otter is meant to be launched once per step, and it will execute the tasks in a given step. Orchestrating the execution of steps is outside of the scope of Otter.

Features

This is a list of what you get for free by using Otter:

Parallel execution: Tasks are run in parallel, and Otter will take care of the dependency planning.
Declarative configuration: Steps are described in a YAML file, as list of tasks with different specifications. The task themselves are implemented in Python enabling a lot of flexibility.
Logging: Otter uses the loguru library for logging. It handles all the logging related the task flow, and also logs into the manifest (see next item).
Manifest: Otter manages a manifest file that describes a pipeline run. It is used to both for debugging and for tracking the provenance of the data. A series of simple JQ queries can be used to extract information from it (see Useful JQ queries).
Error management: Otter will stop the execution of the pipeline if a task fails, and will log the error in the manifest.
Scratchpad: A place to store variables that can be overwritten into the configuration file (something like a very simple templating engine), enabling easy parametrization of runs, and passing of data between tasks.
Utilities: Otter provides interfaces to use Google Cloud Storage and other remote storage services, and a bunch of utilities to help you write tasks.

Of course, “for free” means there is not an extreme degree of flexibility some of these are limited in scope. The aim is ease of use and simplicity. You can jump down to read more about the philosophy behind Otter.

The model

The main elements used when writing an application using Otter are:

otter.core.Runner — Handles the application lifecycle.
otter.task.model.Spec — Holds the task specification.
otter.task.model.Task — A Task itself.
otter.task.model.TaskContext — Holds the context of a task.
otter.validators.v() — The method used to run validators for a task.
otter.scratchpad.model.Scratchpad — A place to store variables to overwrite in the config file.
otter.util — A bunch of utilities to help you write tasks.
otter.storage — Remote storage interfaces to use Google Cloud Storage and similar services.

Simple example

Here is an example of a configuration file for a really simple pipeline:

---
work_path: ./work
log_level: DEBUG
scratchpad:
steps:
  my_step:
    - name: hello_world task one
      who: World
    - name: hello_world task two
      requires:
        - hello_world task one
      who: Universe

It defines a my_step step that has two tasks, out of which the second one will only run once the first has finished. An application to run this step would be:

from otter import Runner

def main() -> None:
   runner = Runner()
   runner.run()

Tip

Some other examples:

HelloWorld — Example Task.

Philosophy

One of the reasons for implementing another task execution framework instead of using something like Celery is, we wanted have a very basic set of features baked into all tasks (traceability, logging, error management, etc). Although most of those are already there in many frameworks, we would not be using them to their full extent, effectively increasing the complexity of the system unnecessarily.

Using something like Apache Airflow alone to handle the pipeline was another option. But then we would be completely unable to execute any part of it outside of the enourmous Airflow ecosystem, which —although amazingly powerful— is also unwieldy.

Otter provides us with a way to wrap steps in an executable, independent unit that can be run in any environment, and that can be orchestrated by any other system (even a simple bash for loop). It also encapsulates many tools and interfaces that are common to all tasks, making it easier to write and maintain them.