Glossary

File formats

yaml

Language to serialize objects. Used in the CGAT testing framework. (YAML).

bam

Format to store genomic alignments in a compressed format. (BAM).

bed

File containing genomic intervals. (BED).

vcf

Variant call format.

gtf

General transfer format. Format to store genes and transcripts.

gff

General feature format.

bigwig

Compressed format for displaying numerical values across genomic ranges (BIGWIG).

fasta

Sequence format.

wiggle

Format for displaying numerical values across genomic ranges (Wiggle).

psl

Genomic alignment format. The format is described in detail (PSL.

sam

Format to store genomic alignments (SAM).

gdl

gdl

tsv

Tab separated values. In these tables, records are separated by new-line characters and fields by tab characters. Lines with comments are started by the # character and are ignored. The first uncommented line should contain the column headers. For example:

# This is a comment
gene_id       length
gene1 1000
gene2 2000
# Another comment
svg

pass

edge list

pass

fastq

Sequence format containing quality scores, more background is here

sra

sra

axt

axt

agp

AGP format

rdf

Resource description framework

Other terms

test directory

Directory that contains the test.yaml, input and reference files for testing scripts.

experiment

experiment

replicate

replicate

graph

graph

track

track

graph

graph

submit host

pass

execution host

pass

edge list

pass

task

pass

sphinxreport

sphinxreport

query

pass

target

pass

code directory

pass

go

pass

goslim

pass

fastq

pass

tss

Transcription start site

production pipeline

A pipeline that performs common tasks on a certain type of data. The idea of a production pipeline is to provide common preprocessing of data and a first look. A project pipeline might then take data from one or more production pipeline to glean biological insight.

project pipeline

A pipeline that is project specific. Usually code is developed first inside a project pipeline. When it becomes generally useful, it may be refactored into a production pipeline.

stdin

Unix standard input. Most CGAT tools read data from stdin.

stdout

Unix standard output. Most CGAT tools output data to stdout.

stderr

Unix standard error. This is where errors go.

loglevel

Verbosity of logging information. The logging level can be determined by the --verbose option. A level of 0 means no logging output, while 1 is information messages only, while 2 outputs also debugging information.