bam2geneprofile.py - build meta-gene profile for a set of transcripts/genes¶
- Tags
Genomics NGS Genesets Intervals GTF BAM Summary
Purpose¶
This script takes a gtf formatted file, a short reads bam formatted file and computes meta-gene profiles over various annotations derived from the gtf file.
A meta-gene profile is an abstract genomic entity over which reads stored in a bam formatted file have been counted. A meta-gene might be an idealized eukaryotic gene (upstream, exonic sequence, downstream) or any other genomic landmark of interest such as transcription start sites.
The script can be used to visualize binding profiles of a chromatin mark in gene bodies, binding of transcription factors in promotors or sequencing bias (e.g. 3’ bias) in RNA-Seq data.
This script is designed with a slight emphasis on RNA-Seq datasets. For example, it takes care of spliced reads, by using the CIGAR string in the BAM file to accurately define aligned bases (if the –base-accurate is specified, currently this feature is turned off by default for speed reasons).
Alternatively, for the purpose of visualizing binding profiles of transcription factor ChIP-Seq without the need to use any genomic annotations (ENSEMBL, or refseq), you may also consider using bam2peakshape.py - compute peak shape features from a bam-file, which is designed with a slight emphasis on Chip-Seq datasets. For example, bam2peakshape.py - compute peak shape features from a bam-file is able to center the counting window to the summit of every individual peak. bam2peakshape.py - compute peak shape features from a bam-file is also able to: (1) plot the control ChIP-Seq library to enable side-by-side comparison; (2) randomize the given regions to provide a semi-control.
Usage¶
Quick start examples¶
The following command will generate the gene profile plot similar to Fig 1(a) in the published cgat paper, but using a test dataset that is much smaller and simpler than the dataset used for publishing the cgat paper.
python ./scripts/bam2geneprofile.py
--bam-file=./tests/bam2geneprofile.py/multipleReadsSplicedOutAllIntronsAndSecondExon.bam
--gtf-file=./tests/bam2geneprofile.py/onegeneWithoutAnyCDS.gtf.gz
--method=geneprofile
--reporter=gene
In the following, a slightly more involved example will use more features of this script. The following command generate the gene profile showing base accuracy of upstream (500bp), exons, introns and downstream(500bp) of a gene model from some user supplied RNA-Seq data and geneset.
python ./scripts/bam2geneprofile.py
--bam-file=./rnaseq.bam
--gtf-file=./geneset.gtf.gz
--method=geneprofilewithintrons
--reporter=gene
--extension-upstream=500
--resolution-upstream=500
--extension-downstream=500
--resolution-downstream=500
The output will contain read coverage over genes. The profile will contain four separate segments:
the upstream region of a gene ( set to be 500bp ), (
--extension-upstream=500
).the transcribed region of a gene. The transcribed region of every gene will be scaled to 1000 bp (default), shrinking longer transcripts and expanding shorter transcripts.
the intronic regions of a gene. These will be scaled to 1000b (default).
the downstream region of a gene (set to be 500bp), (
--extension-downstream=500
).
Detailed explaination¶
The bam2geneprofile.py
script reads in a set of transcripts
from a gtf formatted file. For each transcript, overlapping
reads from the provided bam file are collected. The counts
within the transcript are then mapped onto the meta-gene structure and
counts are aggregated over all transcripts in the gtf file.
Bam files need to be sorted by coordinate and indexed.
A meta-gene structure has two components - regions of variable size, such as exons, introns, etc, which nevertheless have a fixed start and end coordinate in a transcript. The other component are regions of fixed width, such a regions of a certain size upstream or downstream of a landmark such as a transcription start site.
The size of the former class, regions of variable size, can be varied
with --resolution
options. For example, the option
--resolution-upstream-utr=1000
will create a meta-gene with a
1000bp upstream UTR region. UTRs that are larger will be compressed,
and UTRs that are smaller, will be stretched to fit the 1000bp
meta-gene UTR region.
The size of fixed-width regions can be set with --extension
options. For example, the options --extension-upstream
will set
the size of the uptsream extension region to 1000bp. Note that no
scaling is required when counting reads towards the fixed-width
meta-gene profile.
Type:
python bam2geneprofile.py --help
for command line help.
Options¶
The script provides a variety of different meta-gene structures i.e.
geneprofiles, selectable via using the option: (--method
).
Profiles¶
Different profiles are accessible through the --method
option. Multiple
methods can be applied at the same time. While upstream
and downstream
typically have a fixed size, the other regions such as CDS
, UTR
will be
scaled to a common size.
- utrprofile
UPSTREAM - UTR5 - CDS - UTR3 - DOWNSTREAM gene models with UTR. Separate the coding section from the non-coding part.
- geneprofile
UPSTREAM - EXON - DOWNSTREAM simple exonic gene models
- geneprofilewithintrons
UPSTREAM - EXON - INTRON - DOWNSTREAM
gene models containing also intronic sequence, only correct if used with
--use-base-accuracy
option.- separateexonprofile
UPSTREAM - FIRST EXON - EXON - LAST EXON - DOWNSTREAM
gene models with the first and last exons separated out from all other exons. Only applicable to gene models with > 1 exons.
- separateexonprofilewithintrons
UPSTREAM - FIRST EXON - EXON - INTRON - LAST EXON - DOWNSTREAM
gene models with first and last exons separated out, and includes all introns together. Excludes genes with < 2 exons and no introns.
geneprofileabsolutedistancefromthreeprimeend
UPSTREAM - EXON (absolute distance, see below) - INTRON (absolute distance, see below) - DOWNSTREAM (the downstream of the exons) region, the script counts over the mRNA transcript only, skipping introns. Designed to visualize the 3 prime bias in RNASeq data, only correct if used together with
--use-base-accuracy
option.absolute distance: In order to to visualize the 3 prime bias, genes are not supposed to be streched to equal length as it did in all other counting methods. In this counting method, we first set a fix length using
--extension-exons-absolute-distance-topolya
, the script will discard genes shorter than this fixed length. For genes (when all the exons stitched together) longer than this fixed length, the script will only count over this fixed length ( a absolute distance ) from three prime end, instead of compress the longer genes. Same goes for absolute distance intron counting.
- tssprofile
UPSTREAM - DOWNSTREAM transcription start/stop sites
- intervalprofile
UPSTREAM - INTERVAL - DOWNSTREAM Similar to geneprofile, but count over the complete span of the gene (including introns).
- midpointprofile
UPSTREAM - DOWNSTREAM aggregate over midpoint of gene model
Normalization¶
Normalization can be applied in two stages of the computation.
Count vector normalization¶
Before adding counts to the meta-gene profile, the profile for the individual transcript can be normalized. Without normalization, highly expressed genes will contribute more to the meta-gene profile than lowly expressed genes. Normalization can assure that each gene contributes an equal amount.
Normalization is applied to the vector of read counts that is computed
for each transcript. Normalization can be applied for the whole
transcript (total
) or on a per segment basis depending on the
counter. For example, in the gene counter, exons, upstream and
downstream segments can be normalized independently.
Counts can be normalized either by the maximum or the sum of all
counts in a segment or across the whole transcript. Normalization is
controlled with the command line option --normalize-trancript
. Its
arguments are:
none
: no normalizationsum
: sum of counts within a region (exons, upstream, …). The area under the curve will sum to 1 for each region.max
: maximum count within a region (exons,upstream, …).total-sum
: sum of counts across all regions. The area under the curve will sum to 1 for the complete transcript.total-max
: maximum count across all regions.
The options above control the contribution of individual transcripts to a meta-gene profile and are thus suited for example for RNA-Seq data.
The options above do not control for different read-depths or any local biases. To compare meta-gene profiles between samples, additional normalization is required.
Meta-gene profile normalization¶
To enable comparison between experiments, the meta-gene profile itself can be normalized. Normalization a profile can help comparing the shapes of profiles between different experiments independent of the number of reads or transcripts used in the construction of the meta-gene profile.
Meta-gene profile normalization is controlled via the
--normalize-profile
option. Possible normalization are:
none: no normalization
area: normalize such that the area under the meta-gene profile is 1.
counts: normalize by number of features (genes,tss) that have been counted.
background: normalize with background (see below).
A special normalization is activated with the background
option.
Here, the counts at the left and right most regions are used to
estimate a background level for each transcript. The counts are then
divided by this background-level. The assumption is that the meta-gene
model is computed over a large enough area to include genomic
background.
Genes versus transcripts¶
The default is to collect reads on a per-transcript
level. Alternatively, the script can merge all transcripts of a gene
into a single virtual transcript. Note that this virtual transcript
might not be a biologically plausible transcript. It is usually better
to provide bam2geneprofile.py
with a set of representative
transcripts per gene in order to avoid up-weighting genes with
multiple transcripts.
Control¶
If control files (chip-seq input tracks) are supplied, counts in the control file can be used to compute a fold-change.
Bed and wiggle files¶
The densities can be computed from bed or wiggle
formatted files. If a bed formatted file is supplied, it must
be compressed with and indexed with tabix
.
Note
Paired-endedness is ignored. Both ends of a paired-ended read are treated individually.
Command line options¶
usage: bam2geneprofile [-h] [--version]
[-m {geneprofile,tssprofile,utrprofile,intervalprofile,midpointprofile,geneprofilewithintrons,geneprofileabsolutedistancefromthreeprimeend,separateexonprofile,separateexonprofilewithintrons}]
[-b BAM] [-c BAM] [-g GTF]
[--normalize-transcript {none,max,sum,total-max,total-sum}]
[--normalize-profile {all,none,area,counts,background}]
[-r {gene,transcript}] [-i SHIFTS] [-a] [-u]
[-e EXTENDS]
[--resolution-upstream RESOLUTION_UPSTREAM]
[--resolution-downstream RESOLUTION_DOWNSTREAM]
[--resolution-upstream-utr RESOLUTION_UPSTREAM_UTR]
[--resolution-downstream-utr RESOLUTION_DOWNSTREAM_UTR]
[--resolution-cds RESOLUTION_CDS]
[--resolution-first-exon RESOLUTION_FIRST]
[--resolution-last-exon RESOLUTION_LAST]
[--resolution-introns RESOLUTION_INTRONS]
[--resolution-exons-absolute-distance-topolya RESOLUTION_EXONS_ABSOLUTE_DISTANCE_TOPOLYA]
[--resolution-introns-absolute-distance-topolya RESOLUTION_INTRONS_ABSOLUTE_DISTANCE_TOPOLYA]
[--extension-exons-absolute-distance-topolya EXTENSION_EXONS_ABSOLUTE_DISTANCE_TOPOLYA]
[--extension-introns-absolute-distance-topolya EXTENSION_INTRONS_ABSOLUTE_DISTANCE_TOPOLYA]
[--extension-upstream EXTENSION_UPSTREAM]
[--extension-downstream EXTENSION_DOWNSTREAM]
[--extension-inward EXTENSION_INWARD]
[--extension-outward EXTENSION_OUTWARD]
[--scale-flank-length SCALE_FLANKS]
[--control-factor CONTROL_FACTOR]
[--output-all-profiles]
[--counts-tsv-file INPUT_FILENAME_COUNTS]
[--background-region-bins BACKGROUND_REGION_BINS]
[--output-res RESOLUTION_IMAGES]
[--image-format IMAGE_FORMAT] [--timeit TIMEIT_FILE]
[--timeit-name TIMEIT_NAME] [--timeit-header]
[--random-seed RANDOM_SEED] [-v LOGLEVEL]
[--log-config-filename LOG_CONFIG_FILENAME]
[--tracing {function}] [-? ?]
[-P OUTPUT_FILENAME_PATTERN] [-F] [-I STDIN]
[-L STDLOG] [-E STDERR] [-S STDOUT]
bam2geneprofile: error: argument -?: expected one argument