User Guide¶

This User guide provides a complete reference to using the Tapis Pipeline software.

Overview and Prerequisites¶

The Tapis Pipeline software is designed to assist you in running recurring computational jobs against multiple data sets. The design is based on the idea of ETL. It has been designed with TACC resources in mind, but should be broadly applicable to other resources as well. The general idea is:

Download input data from remote source (the Remote Outbox) to TACC (the Local Inbox).

Compute data products from the input using whatever software you like (the pipeline job software).

Send resulting output from TACC (the Local Outbox) back to a remote storage site (the Remote Inobx).

Before you begin building your first Tapis Pipeline, there are a few decisions to make and some prerequisites your project should meet. We collect these here. Think of it as checklist.

TACC Storage and Computing Allocation¶

Your project will need a storage allocation on some TACC storage resource (for example, Corral, Stockyard, or one of our Cloud-based storage resources) to serve as the project’s Local Inbox and Outbox. The size of the required allocation greatly depends on the size of the files that will be processed in the pipeline.

Your project will also need one or more allocations on a computing system at TACC, such as Frontera, Stampede2, Lonestar5, or one of cloud computing systems. The allocation will be used to run pipeline jobs.

Packaging and Installation of Analysis (Pipeline Job) Software¶

The ultimate goal of a pipeline is to process data via one or more programs, defined by the project. This software, referred to as the pipeline job software, must be packaged and accessible to the Tapis Pipelines software.

There are a few packaging options available:

Create a container image with the job software. We recommend using Singularity containers for running jobs on HPC systems and Docker, or a container runtime that supports the k8s Container Runtime Interface (CRI), for running jobs on cloud resources, such as the Kubernetes Freetail system.

Package the software using conventional methods, such as RPMs, Python packages, Ruby gems, git repositories, etc.

These lead to the following installation options:

If the job software is packaged as a container as in option 1, the software can be registered as a Tapis App (Singularity or Docker) or Tapis Actor/function (Docker). This is the preferred approach, as it does not require maintaining a separate installation of the job software on each execution system to be used. It also simplifies permissions on the underlying files that must be maintained so that the Tapis Pipelines software can execute it.

Install the job software on the head node (login node) of a TACC execution system. Choose the system matching the allocation for the project. If there are multiple systems, the software must be installed on each one. This method is not recommended.

Remote Outbox and Inbox¶

Each pipeline must configure a Remote Outbox and a Remote Inbox where files requiring processing (respectively, output files resulting from processing) will be stored. Conceptually, the Remote Outbox and Inbox are storage resources independent of TACC, but they must provide programmatic access. Options for the Remote Outbox and Inbox include:

A path on a Tapis System, including POSIX (SSH/SFTP) and Object storage (S3-compatible).

A Globus endpoint.

With Option 1, the Tapis Pipelines software will be able to utilize Tapis transfers to move data to/from the Remote Outbox and Inbox to any TACC resource. This is the recommended option.

With Option 2, the Tapis Pipelines software utilizes Globus Personal Connect to move data to/from the Remote Outbox and Inbox to the Local Outbox and Inbox. From there, Tapis transfers will be utilized, as needed.

Manifest Files¶

A key to the Tapis Pipeline architecture is the manifest file.

Pipeline jobs process files that get transferred to the Remote Outbox. Each job will process any number of files, and the number of files processed by a single job is determined by the manifest file. The manifest file is a simple JSON file that describes one or more files to be processed by a job. It can include basic validation information (such as an MD5 checksum of the file), project generated identifies for the job, and some limited support for overriding the default job runner behavior (for example, to specify the job can run at a lower priority than other jobs).

Critically, the presence of a manifest file in the Remote Outbox instructs the Tapis Pipelines software that the files referenced within it are ready to be processed as a job. In particular, all data transfer to the Remote Outbox has completed. No files in the Remote Outbox will be processed until they are included in some manifest file.

The manifest file must adhere to a required format described by a JSON Schema. Invalid manifest files are never processed.

The following describes the format of the mainfest file.

http://github.com/tapis-project/tapis-pipelines/core/manifest_schema.json
A schema describing a valid Tapis Pipelines manifest file.
type	object
properties
files	files_list
job_config	job_config

files_list¶

List of files to be processed by this job.
type	array
items	file

file¶

A file to be processed as part of a job.
type	object
properties
file_path	Path to the file on the remote inbox.
	type	string
md5_checksum	The md5 checksum of the file. Used for validation purposes.
	type	string

job_config¶

Special configuration overrides to apply to this specific job.
type	object
properties
priority	Specify a different priority for this job.
	type	string

Installing Tapis Pipelines Software¶

The Tapis Pipelines software is available as a Python package. To install it, simply type:

$ pip install tapis-pipelines

Installing Tapis Pipelines in a virtualenv is recommended.

Alternatively, you can install Tapis Pipelines from source by checking out the repository from GitHub.

Configuration of Tapis Pipelines¶

An instance of the Tapis Pipelines software must be configured for a specific pipeline. The configuration is provided as a JSON file that conforms to the Tapis Pipeline config JSON Schema definition.

http://github.com/tapis-project/tapis-pipelines/core/configschema.json
type	object
properties
remote_inbox	remote_box_definition
remote_outbox	remote_box_definition
pipeline_job	pipeline_job_definition
tapis_config	tapis_config_definition

remote_box_definition¶

Configuration of remote inbox or outbox
type	object
properties
kind	The type of Remote Box being configured.
	type	string
	enum	tapis, globus
box_definition	box_definition

box_definition¶

oneOf	tapis_box_definition
	globus_box_definition

tapis_box_definition¶

A pipeline box defined using a Tapis system and path.
type	object
properties
system_id	The id of the Tapis system to use for the box definition.
	type	string
path	Path on the Tapis system to use for the box definition.
	type	string

globus_box_definition¶

A pipeline box defined using a Globus endpoint.
type	object
properties
client_id	The id of the Globus client to use when issuing transfers.
	type	string
endpoint_name	The name of the Globus endpoint.
	type	string
directory	The directory within the Globus endpoint to use for the box definition.
	type	string

pipeline_job_definition¶

Description of the pipeline job to run on new input files.
oneOf	tapis_app_job
	tapis_actor_job

tapis_app_job¶

A Pipeline job described using a Tapis app
type	object
properties
app_id	The app id to use when submitting the job,
	type	string
manifest_input_name	The name of the input on the Tapis app for the manifest file.
	type	string
	default	manifest_file
raw_files_input_name	The name of the input on the Tapis app to be used for sending the raw input files. If empty, no input name will be specified.
	type	string
	default

tapis_actor_job¶

A Pipeline job described using a Tapis actor
type	object
properties
actor_id	The id of the actor. The Tapis Pipelines software will send a JSON message to the actor with details about the job (see documentation).
	type	string

local_script_job¶

A Pipeline job described using a local script
type	object

tapis_config_definition¶

General configuration for Tapis usage.
type	object
properties
base_url	The base URL for the Tapis tenant to interact with.
	type	string
username	The Tapis username to use when accessing Tapis services.
	type	string

Testing A Pipeline¶

A number of measures can be take to validate that a pipeline will run correctly before using real cycles.

Validate Configuration¶

The Tapis Pipelines software includes a config validator that can be run to ensure that all required configurations are present and valid. The validator does basic type checking of all fields. Run the config validator first before moving on to subsequent validation.

Package tests¶

The Tapis Pipelines software includes a package of tests that can be run once the software is configured. These tests exercise some of the primary functions of the software, such as interacting with the Tapis APIs using the configured authentication. If any of these functions fails, some installation or configuration step is likely missing or incorrect and the pipeline jobs are unlikely to run correctly.

Test Pipeline Runs¶

In some cases, it can be possible to issue end-to-end test runs of a pipeline using sample data.

To do, more on this coming soon…

Production Pipelines and Dashboard¶

The Pipelines software makes use of Tapis Metadata service to track the status of jobs as they progress. We include a simple dashboard for displaying the information. The dashboard code can be deployed relatively quickly to most modern web servers.

Troubleshooting and FAQ¶

Coming soon…