compute stats from a bam-file
Purpose¶
This script takes a bam file as input and computes a few metrics by iterating over the file. The metrics output are:
Category |
Content |
total |
total number of alignments in bam file |
alignments_mapped |
alignments mapped to a chromosome (bam flag) |
alignments_unmapped |
alignments unmapped (bam flag) |
qc_fail |
alignments failing QC (bam flag) |
mate_unmapped |
alignments in which the mate is unmapped (bam flag) |
reverse |
alignments in which read maps to reverse strand (bam flag) |
mate_reverse |
alignments in which mate maps to reverse strand (bam flag) |
proper_pair |
alignments in which both pairs have been mapped properly (according to the mapper) (bam flag) |
read1 |
alignments for 1st read of pair (bam flag) |
paired |
alignments of reads that are paired (bam flag) |
duplicate |
read is PCR or optical duplicate (bam flag) |
read2 |
alignment is for 2nd read of pair (bam flag) |
secondary |
alignment is not primary alignment |
alignments_duplicates |
number of alignments mapping to the same location |
alignments_unique |
number of alignments mapping to unique locations |
reads_total |
number of reads in file. Either given via –num-reads or deduc ed as the sum of mappend and unmapped reads |
reads_mapped |
number of reads mapping in file. Derived from the total number o f alignments and removing counts for multiple matches. Requires the NH flag to be set correctly. |
reads_unmapped |
number of reads unmapped in file. Assumes that there is only one entry per unmapped read. |
reads_missing |
number of reads missing, if number of reads given by –input-rea ds. Otherwise 0. |
pairs_total |
number of total pairs - this is the number of reads_total divided by two. If there were no pairs, pairs_total will be 0. |
pairs_mapped |
number of mapped pairs - this is the same as the number of proper pairs. |
Additionally, the script outputs histograms for the following tags and scores.
NM: number of mismatches in alignments.
NH: number of hits of reads.
mapq: mapping quality of alignments.
Supplying a fastq file¶
If a fastq file is supplied (--fastq-file
), the script will
compute some additional summary statistics. However, as it builds a dictionary
of all sequences, it will also require a good amount of memory. The additional
metrics output are:
Category |
Content |
pairs_total |
total number of pairs in input data |
pairs_mapped |
pairs in which both reads map |
pairs_unmapped |
pairs in which neither read maps |
pairs_proper_unique |
pairs which are proper and map uniquely. |
pairs_incomplete_unique |
pairs in which one of the reads maps uniquely, but the other does not map. |
pairs_incomplete_multimapping |
pairs in which one of the reads maps uniquely, but the other maps to multiple locations. |
pairs_proper_duplicate |
pairs which are proper and unique, but marked as duplicates. |
pairs_proper_multimapping |
pairs which are proper, but map to multiple locations. |
pairs_not_proper_unique |
pairs mapping uniquely, but not flagged as proper |
pairs_other |
pairs not in any of the above categories |
Note that for paired-end data, any or
/2
suffixes will be
removed from the read name in the assumption that these have been removed
in the bam file as well.
Usage¶
Example:
python bam2stats.py in.bam
This command will generate various statistics based on the supplied BAM file, such as percentage reads mapped and percentage reads mapped in pairs. The output looks like this:
category |
counts |
percent |
of |
alignments_total |
32018 |
100.00 |
alignments_total |
alignments_mapped |
32018 |
100.00 |
alignments_total |
alignments_unmapped |
0 |
0.00 |
alignments_total |
alignments_qc_fail |
0 |
0.00 |
alignments_mapped |
alignments_mate_unmapped |
241 |
0.75 |
alignments_mapped |
alignments_reverse |
16016 |
50.02 |
alignments_mapped |
alignments_mate_reverse |
15893 |
49.64 |
alignments_mapped |
alignments_proper_pair |
30865 |
96.40 |
alignments_mapped |
alignments_read1 |
16057 |
50.15 |
alignments_mapped |
alignments_paired |
32018 |
100.00 |
alignments_mapped |
alignments_duplicate |
0 |
0.00 |
alignments_mapped |
alignments_read2 |
15961 |
49.85 |
alignments_mapped |
alignments_secondary |
0 |
0.00 |
alignments_mapped |
alignments_filtered |
31950 |
99.79 |
alignments_mapped |
reads_total |
34250 |
100.00 |
reads_total |
reads_unmapped |
0 |
0.00 |
reads_total |
reads_mapped |
32018 |
93.48 |
reads_total |
reads_missing |
2232 |
6.52 |
reads_total |
reads_mapped_unique |
32018 |
100.00 |
reads_mapped |
reads_multimapping |
0 |
0.00 |
reads_mapped |
pairs_total |
17125 |
100.00 |
pairs_total |
pairs_mapped |
17125 |
100.00 |
pairs_total |
pairs_unmapped |
0 |
0.00 |
pairs_total |
pairs_proper_unique |
14880 |
86.89 |
pairs_total |
pairs_incomplete_unique |
2232 |
13.03 |
pairs_total |
pairs_incomplete_multimapping |
0 |
0.00 |
pairs_total |
pairs_proper_duplicate |
0 |
0.00 |
pairs_total |
pairs_proper_multimapping |
0 |
0.00 |
pairs_total |
pairs_not_proper_unique |
13 |
0.08 |
pairs_total |
pairs_other |
0 |
0.00 |
pairs_total |
read1_total |
17125 |
100.00 |
read1_total |
read1_unmapped |
0 |
0.00 |
read1_total |
read1_mapped |
16057 |
93.76 |
read1_total |
read1_mapped_unique |
16057 |
100.00 |
read1_mapped |
reads_multimapping |
0 |
0.00 |
read1_mapped |
read1_missing |
1068 |
6.65 |
read1_total |
read2_total |
17125 |
100.00 |
read2_total |
read2_unmapped |
0 |
0.00 |
read2_total |
read2_mapped |
15961 |
93.20 |
read2_total |
read2_mapped_unique |
15961 |
100.00 |
read2_mapped |
reads_multimapping |
0 |
0.00 |
read2_mapped |
read2_missing |
1164 |
7.29 |
read2_total |
The first column contains the caterogy, the second the number of counts and the third a percentage. The fourth column denotes the denomiminator that was used to compute the percentage. In the table above, wee see that 16,057 first reads in a pair map and 15,961 second reads in pair map, resulting in 14,880 proper uniquely mapped pairs.
Type:
cgat bam2stats --help
for command line help.
Bam2stats can read from standard input:
cat in.bam | python bam2stats.py -
Documentation¶
Reads are not counted via read name, but making use of NH and HI flags when present. To recap, NH is the number of reported alignments that contain the query in the current record, while HI is the hit index and ranges from 0 to NH-1.
Unfortunately, not all aligners follow this convention. For example, gsnap seems to set NH to the number of reportable alignments, while the actual number of reported alignments in the file is less. Thus, if the HI flag is present, the maximum HI is used to correct the NH flag. The assumption is, that the same reporting threshold has been used for all alignments.
If no NH flag is present, it is assumed that all reads have only been reported once.
Multi-matching counts after filtering are really guesswork. Basically, the assumption is that filtering is consistent and will tend to remove all alignments of a query.
The error rates are computed using the following key:
- substitution_rate
Number of mismatches divided by number of aligned bases. This is the same as the samtools stats error rate.
- insertion_rate
Number of deletions in the read/insertions in the reference divided by the number of aligned bases.
- deletion_rate
Number of insertions in the read/deletions in the reference divided by the number of aligned bases.
- error_rate
Number of mismatches and deletions in the read divided by the number of aligned bases.
- coverage
Percentage of bases aligned divided by read length.
The following graphic illustrates the computation. A . signifies a position that is included in the metric with X being an error:
AAAAACAAAA AAAAAAAA Reference
AAAAAAAAAAAA AAA Read
....X.... ...... substitution_rate NM / (CMATCH + CINS)
.........X.. ... insertion_rate CINS / (CMATCH + CINS)
......... ..X... deletion_rate CDEL / (CMATCH + CDEL)
....X.... ..X... error_rate NM / (CMATCH + CINS) (corresponds to samtools stats)
.........X.. ... match_rate CMATCH / (CMATCH + CINS)
....X.... .. ... mismatch_rate NM / (CMATCH) (1 - percent_identity/100)
With CINS: Insertion into the reference (consumes read, but not reference) and CDEL=Deletion from the reference (consumes reference, but not read).
Command line options¶
usage: bam2stats [-h] [--version] [-r GFF] [-f] [-i INPUT_READS] [-d]
[--output-readmap] [--add-alignment-details]
[-q FILENAME_FASTQ] [--basic-counts] [--timeit TIMEIT_FILE]
[--timeit-name TIMEIT_NAME] [--timeit-header]
[--random-seed RANDOM_SEED] [-v LOGLEVEL]
[--log-config-filename LOG_CONFIG_FILENAME]
[--tracing {function}] [-? ?] [-P OUTPUT_FILENAME_PATTERN]
[-F] [-I STDIN] [-L STDLOG] [-E STDERR] [-S STDOUT]
bam2stats: error: argument -?: expected one argument