User Guide

This User guide provides a complete reference to using the Tapis Pipeline software.

Overview and Prerequisites

The Tapis Pipeline software is designed to assist you in running recurring computational jobs against multiple data sets. The design is based on the idea of ETL. It has been designed with TACC resources in mind, but should be broadly applicable to other resources as well. The general idea is:

  • Download input data from remote source (the Remote Outbox) to TACC (the Local Inbox).

  • Compute data products from the input using whatever software you like (the pipeline job software).

  • Send resulting output from TACC (the Local Outbox) back to a remote storage site (the Remote Inobx).

Before you begin building your first Tapis Pipeline, there are a few decisions to make and some prerequisites your project should meet. We collect these here. Think of it as checklist.

TACC Storage and Computing Allocation

Your project will need a storage allocation on some TACC storage resource (for example, Corral, Stockyard, or one of our Cloud-based storage resources) to serve as the project’s Local Inbox and Outbox. The size of the required allocation greatly depends on the size of the files that will be processed in the pipeline.

Your project will also need one or more allocations on a computing system at TACC, such as Frontera, Stampede2, Lonestar5, or one of cloud computing systems. The allocation will be used to run pipeline jobs.

Packaging and Installation of Analysis (Pipeline Job) Software

The ultimate goal of a pipeline is to process data via one or more programs, defined by the project. This software, referred to as the pipeline job software, must be packaged and accessible to the Tapis Pipelines software.

There are a few packaging options available:

  1. Create a container image with the job software. We recommend using Singularity containers for running jobs on HPC systems and Docker, or a container runtime that supports the k8s Container Runtime Interface (CRI), for running jobs on cloud resources, such as the Kubernetes Freetail system.

  2. Package the software using conventional methods, such as RPMs, Python packages, Ruby gems, git repositories, etc.

These lead to the following installation options:

  1. If the job software is packaged as a container as in option 1, the software can be registered as a Tapis App (Singularity or Docker) or Tapis Actor/function (Docker). This is the preferred approach, as it does not require maintaining a separate installation of the job software on each execution system to be used. It also simplifies permissions on the underlying files that must be maintained so that the Tapis Pipelines software can execute it.

  2. Install the job software on the head node (login node) of a TACC execution system. Choose the system matching the allocation for the project. If there are multiple systems, the software must be installed on each one. This method is not recommended.

Remote Outbox and Inbox

Each pipeline must configure a Remote Outbox and a Remote Inbox where files requiring processing (respectively, output files resulting from processing) will be stored. Conceptually, the Remote Outbox and Inbox are storage resources independent of TACC, but they must provide programmatic access. Options for the Remote Outbox and Inbox include:

  1. A path on a Tapis System, including POSIX (SSH/SFTP) and Object storage (S3-compatible).

  2. A Globus endpoint.

With Option 1, the Tapis Pipelines software will be able to utilize Tapis transfers to move data to/from the Remote Outbox and Inbox to any TACC resource. This is the recommended option.

With Option 2, the Tapis Pipelines software utilizes Globus Personal Connect to move data to/from the Remote Outbox and Inbox to the Local Outbox and Inbox. From there, Tapis transfers will be utilized, as needed.

Manifest Files

A key to the Tapis Pipeline architecture is the manifest file.

Pipeline jobs process files that get transferred to the Remote Outbox. Each job will process any number of files, and the number of files processed by a single job is determined by the manifest file. The manifest file is a simple JSON file that describes one or more files to be processed by a job. It can include basic validation information (such as an MD5 checksum of the file), project generated identifies for the job, and some limited support for overriding the default job runner behavior (for example, to specify the job can run at a lower priority than other jobs).

Critically, the presence of a manifest file in the Remote Outbox instructs the Tapis Pipelines software that the files referenced within it are ready to be processed as a job. In particular, all data transfer to the Remote Outbox has completed. No files in the Remote Outbox will be processed until they are included in some manifest file.

The manifest file must adhere to a required format described by a JSON Schema. Invalid manifest files are never processed.

The following describes the format of the mainfest file.

http://github.com/tapis-project/tapis-pipelines/core/manifest_schema.json

A schema describing a valid Tapis Pipelines manifest file.

type

object

properties

  • files

files_list

  • job_config

job_config

files_list

List of files to be processed by this job.

type

array

items

file

file

A file to be processed as part of a job.

type

object

properties

  • file_path

Path to the file on the remote inbox.

type

string

  • md5_checksum

The md5 checksum of the file. Used for validation purposes.

type

string

job_config

Special configuration overrides to apply to this specific job.

type

object

properties

  • priority

Specify a different priority for this job.

type

string

Installing Tapis Pipelines Software

The Tapis Pipelines software is available as a Python package. To install it, simply type:

$ pip install tapis-pipelines

Installing Tapis Pipelines in a virtualenv is recommended.

Alternatively, you can install Tapis Pipelines from source by checking out the repository from GitHub.

Configuration of Tapis Pipelines

An instance of the Tapis Pipelines software must be configured for a specific pipeline. The configuration is provided as a JSON file that conforms to the Tapis Pipeline config JSON Schema definition.

http://github.com/tapis-project/tapis-pipelines/core/configschema.json

type

object

properties

  • remote_inbox

remote_box_definition

  • remote_outbox

remote_box_definition

  • pipeline_job

pipeline_job_definition

  • tapis_config

tapis_config_definition

remote_box_definition

Configuration of remote inbox or outbox

type

object

properties

  • kind

The type of Remote Box being configured.

type

string

enum

tapis, globus

  • box_definition

box_definition

tapis_box_definition

A pipeline box defined using a Tapis system and path.

type

object

properties

  • system_id

The id of the Tapis system to use for the box definition.

type

string

  • path

Path on the Tapis system to use for the box definition.

type

string

globus_box_definition

A pipeline box defined using a Globus endpoint.

type

object

properties

  • client_id

The id of the Globus client to use when issuing transfers.

type

string

  • endpoint_name

The name of the Globus endpoint.

type

string

  • directory

The directory within the Globus endpoint to use for the box definition.

type

string

pipeline_job_definition

Description of the pipeline job to run on new input files.

oneOf

tapis_app_job

tapis_actor_job

tapis_app_job

A Pipeline job described using a Tapis app

type

object

properties

  • app_id

The app id to use when submitting the job,

type

string

  • manifest_input_name

The name of the input on the Tapis app for the manifest file.

type

string

default

manifest_file

  • raw_files_input_name

The name of the input on the Tapis app to be used for sending the raw input files. If empty, no input name will be specified.

type

string

default

tapis_actor_job

A Pipeline job described using a Tapis actor

type

object

properties

  • actor_id

The id of the actor. The Tapis Pipelines software will send a JSON message to the actor with details about the job (see documentation).

type

string

local_script_job

A Pipeline job described using a local script

type

object

tapis_config_definition

General configuration for Tapis usage.

type

object

properties

  • base_url

The base URL for the Tapis tenant to interact with.

type

string

  • username

The Tapis username to use when accessing Tapis services.

type

string

Testing A Pipeline

A number of measures can be take to validate that a pipeline will run correctly before using real cycles.

Validate Configuration

The Tapis Pipelines software includes a config validator that can be run to ensure that all required configurations are present and valid. The validator does basic type checking of all fields. Run the config validator first before moving on to subsequent validation.

Package tests

The Tapis Pipelines software includes a package of tests that can be run once the software is configured. These tests exercise some of the primary functions of the software, such as interacting with the Tapis APIs using the configured authentication. If any of these functions fails, some installation or configuration step is likely missing or incorrect and the pipeline jobs are unlikely to run correctly.

Test Pipeline Runs

In some cases, it can be possible to issue end-to-end test runs of a pipeline using sample data.

To do, more on this coming soon…

Production Pipelines and Dashboard

The Pipelines software makes use of Tapis Metadata service to track the status of jobs as they progress. We include a simple dashboard for displaying the information. The dashboard code can be deployed relatively quickly to most modern web servers.

Troubleshooting and FAQ

Coming soon…