compute stats from a bam-file

Purpose¶

This script takes a bam file as input and computes a few metrics by iterating over the file. The metrics output are:

Category	Content
total	total number of alignments in bam file
alignments_mapped	alignments mapped to a chromosome (bam flag)
alignments_unmapped	alignments unmapped (bam flag)
qc_fail	alignments failing QC (bam flag)
mate_unmapped	alignments in which the mate is unmapped (bam flag)
reverse	alignments in which read maps to reverse strand (bam flag)
mate_reverse	alignments in which mate maps to reverse strand (bam flag)
proper_pair	alignments in which both pairs have been mapped properly (according to the mapper) (bam flag)
read1	alignments for 1st read of pair (bam flag)
paired	alignments of reads that are paired (bam flag)
duplicate	read is PCR or optical duplicate (bam flag)
read2	alignment is for 2nd read of pair (bam flag)
secondary	alignment is not primary alignment
alignments_duplicates	number of alignments mapping to the same location
alignments_unique	number of alignments mapping to unique locations
reads_total	number of reads in file. Either given via –num-reads or deduc ed as the sum of mappend and unmapped reads
reads_mapped	number of reads mapping in file. Derived from the total number o f alignments and removing counts for multiple matches. Requires the NH flag to be set correctly.
reads_unmapped	number of reads unmapped in file. Assumes that there is only one entry per unmapped read.
reads_missing	number of reads missing, if number of reads given by –input-rea ds. Otherwise 0.
pairs_total	number of total pairs - this is the number of reads_total divided by two. If there were no pairs, pairs_total will be 0.
pairs_mapped	number of mapped pairs - this is the same as the number of proper pairs.

Additionally, the script outputs histograms for the following tags and scores.

NM: number of mismatches in alignments.
NH: number of hits of reads.
mapq: mapping quality of alignments.

Supplying a fastq file¶

If a fastq file is supplied (--fastq-file), the script will compute some additional summary statistics. However, as it builds a dictionary of all sequences, it will also require a good amount of memory. The additional metrics output are:

Category	Content
pairs_total	total number of pairs in input data
pairs_mapped	pairs in which both reads map
pairs_unmapped	pairs in which neither read maps
pairs_proper_unique	pairs which are proper and map uniquely.
pairs_incomplete_unique	pairs in which one of the reads maps uniquely, but the other does not map.
pairs_incomplete_multimapping	pairs in which one of the reads maps uniquely, but the other maps to multiple locations.
pairs_proper_duplicate	pairs which are proper and unique, but marked as duplicates.
pairs_proper_multimapping	pairs which are proper, but map to multiple locations.
pairs_not_proper_unique	pairs mapping uniquely, but not flagged as proper
pairs_other	pairs not in any of the above categories

Note that for paired-end data, any or /2 suffixes will be removed from the read name in the assumption that these have been removed in the bam file as well.

Usage¶

Example:

python bam2stats.py in.bam

This command will generate various statistics based on the supplied BAM file, such as percentage reads mapped and percentage reads mapped in pairs. The output looks like this:

category	counts	percent	of
alignments_total	32018	100.00	alignments_total
alignments_mapped	32018	100.00	alignments_total
alignments_unmapped	0	0.00	alignments_total
alignments_qc_fail	0	0.00	alignments_mapped
alignments_mate_unmapped	241	0.75	alignments_mapped
alignments_reverse	16016	50.02	alignments_mapped
alignments_mate_reverse	15893	49.64	alignments_mapped
alignments_proper_pair	30865	96.40	alignments_mapped
alignments_read1	16057	50.15	alignments_mapped
alignments_paired	32018	100.00	alignments_mapped
alignments_duplicate	0	0.00	alignments_mapped
alignments_read2	15961	49.85	alignments_mapped
alignments_secondary	0	0.00	alignments_mapped
alignments_filtered	31950	99.79	alignments_mapped
reads_total	34250	100.00	reads_total
reads_unmapped	0	0.00	reads_total
reads_mapped	32018	93.48	reads_total
reads_missing	2232	6.52	reads_total
reads_mapped_unique	32018	100.00	reads_mapped
reads_multimapping	0	0.00	reads_mapped
pairs_total	17125	100.00	pairs_total
pairs_mapped	17125	100.00	pairs_total
pairs_unmapped	0	0.00	pairs_total
pairs_proper_unique	14880	86.89	pairs_total
pairs_incomplete_unique	2232	13.03	pairs_total
pairs_incomplete_multimapping	0	0.00	pairs_total
pairs_proper_duplicate	0	0.00	pairs_total
pairs_proper_multimapping	0	0.00	pairs_total
pairs_not_proper_unique	13	0.08	pairs_total
pairs_other	0	0.00	pairs_total
read1_total	17125	100.00	read1_total
read1_unmapped	0	0.00	read1_total
read1_mapped	16057	93.76	read1_total
read1_mapped_unique	16057	100.00	read1_mapped
reads_multimapping	0	0.00	read1_mapped
read1_missing	1068	6.65	read1_total
read2_total	17125	100.00	read2_total
read2_unmapped	0	0.00	read2_total
read2_mapped	15961	93.20	read2_total
read2_mapped_unique	15961	100.00	read2_mapped
reads_multimapping	0	0.00	read2_mapped
read2_missing	1164	7.29	read2_total

The first column contains the caterogy, the second the number of counts and the third a percentage. The fourth column denotes the denomiminator that was used to compute the percentage. In the table above, wee see that 16,057 first reads in a pair map and 15,961 second reads in pair map, resulting in 14,880 proper uniquely mapped pairs.

Type:

cgat bam2stats --help

for command line help.

Bam2stats can read from standard input:

cat in.bam | python bam2stats.py -

Documentation¶

Reads are not counted via read name, but making use of NH and HI flags when present. To recap, NH is the number of reported alignments that contain the query in the current record, while HI is the hit index and ranges from 0 to NH-1.

Unfortunately, not all aligners follow this convention. For example, gsnap seems to set NH to the number of reportable alignments, while the actual number of reported alignments in the file is less. Thus, if the HI flag is present, the maximum HI is used to correct the NH flag. The assumption is, that the same reporting threshold has been used for all alignments.

If no NH flag is present, it is assumed that all reads have only been reported once.

Multi-matching counts after filtering are really guesswork. Basically, the assumption is that filtering is consistent and will tend to remove all alignments of a query.

The error rates are computed using the following key:

substitution_rate: Number of mismatches divided by number of aligned bases. This is the same as the samtools stats error rate.
insertion_rate: Number of deletions in the read/insertions in the reference divided by the number of aligned bases.
deletion_rate: Number of insertions in the read/deletions in the reference divided by the number of aligned bases.
error_rate: Number of mismatches and deletions in the read divided by the number of aligned bases.
coverage: Percentage of bases aligned divided by read length.

The following graphic illustrates the computation. A . signifies a position that is included in the metric with X being an error:

AAAAACAAAA AAAAAAAA   Reference
 AAAAAAAAAAAA AAA     Read
 ....X.... ......     substitution_rate NM / (CMATCH + CINS)
 .........X.. ...     insertion_rate CINS / (CMATCH + CINS)
 ......... ..X...     deletion_rate CDEL / (CMATCH + CDEL)
 ....X.... ..X...     error_rate NM / (CMATCH + CINS) (corresponds to samtools stats)
 .........X.. ...     match_rate CMATCH / (CMATCH + CINS)
 ....X.... .. ...     mismatch_rate NM / (CMATCH) (1 - percent_identity/100)

With CINS: Insertion into the reference (consumes read, but not reference) and CDEL=Deletion from the reference (consumes reference, but not read).

Command line options¶

usage: bam2stats [-h] [--version] [-r GFF] [-f] [-i INPUT_READS] [-d]
                 [--output-readmap] [--add-alignment-details]
                 [-q FILENAME_FASTQ] [--basic-counts] [--timeit TIMEIT_FILE]
                 [--timeit-name TIMEIT_NAME] [--timeit-header]
                 [--random-seed RANDOM_SEED] [-v LOGLEVEL]
                 [--log-config-filename LOG_CONFIG_FILENAME]
                 [--tracing {function}] [-? ?] [-P OUTPUT_FILENAME_PATTERN]
                 [-F] [-I STDIN] [-L STDLOG] [-E STDERR] [-S STDOUT]
bam2stats: error: argument -?: expected one argument