bam2bam.py - modify bam files

Purpose

This script reads a bam formatted file from stdin, performs an action (see methods below) then outputs a modified bam formatted file on stdout.

Note

You need to redirect logging information to a file (via -L) or turn it off via -v 0 in order to get a valid sam/bam file.

Documentation

The script implements the following methods:

set-nh

set the NH flag. Some tools (bowtie, bwa) do not set the NH flag. If set, this option will set the NH flag (for mapped reads). This option requires the bam/sam file to be sorted by read name.

unset-unmapped_mapq

some tools set the mapping quality of unmapped reads. This causes a violation in the Picard tools.

filter

remove alignments based on a variety of flags. The filtering method is determined by the --filter-method option. These may be unique, non-unique, mapped, NM or CM. If unique is set, only uniquely mapping reads will be output. If non-unique is set then only multi-mapping reads will be output. This method first checks for the NH flag - if set, a unique match should have at most NH=1 hits. If not set, the method checks for BWA flags. Currently it checks if X0 is set (X0=Number of best hits found by BWA). If mapped is given, unmapped reads will be removed. If NM or CM is set, the alignment of reads in two sam files (input and reference) is compared and only reads with a lower number of mismatches in the input compared to the reference sam file will be kept. If CM is set, the colourspace mismatch tag (for ABI Solid reads) will be used to count differences to the reference sam file. By default, the NM (number of mismatches) tag is used. The tag that is used needs to present in both input sam file and the reference sam file. If unique is given this wil NOT remove any unmapped reads. This can be achieved by providing the filter option twice, once each with mapped and unique.

Note

The filter methods can’t currently combined with any of the other methods - this is work in progress.

strip-sequence

remove the sequence from all reads in a bam-file. Note that stripping the sequence will also remove the quality scores. Stripping is not reversible if the read names are not unique.

strip-quality

remove the quality scores from all reads in a bam-file. Stripping is not reversible if the read names are not unique.

set-sequence

set the sequence and quality scores in the bam file to some dummy values (‘A’ for sequence, ‘F’ for quality which is a valid score in most fastq encodings. Necessary for some tools that can not work with bam-files without sequence.

unstrip

add sequence and quality scores back to a bam file. Requires a fastq formatted file with the sequences and quality scores to insert.

unset-unmapped-mapq

sets the mapping quality of unmapped reads to 0.

keep-first-base

keep only the first base of reads so that read counting tools will only consider the first base in the counts

downsample-single

generates a downsampled bam file by randomly subsampling reads from a single ended bam file. The downsmpling retains multimapping reads. The use of this requires downsampling parameter to be set and optionally randomseed.

downsample-paired

generates a downsampled bam file by randomly subsampling reads from a paired ended bam file. The downsampling retains multimapping reads. The use of this requires downsampling parameter to be set and optionally randomseed.

add-sequence-error

add a certain amount of random error to read sequences. This method picks a certain proportion of positions within a read’s sequence and alters the nucleotide to a randomly chosen alternative. The model is naive and applies uniform probabilities for positions and nucleotides. The method does not update base qualities, the alignment and the NM flag. As a result, error rates that are computed via the NM flag will be unaffected. The error rate is set by –error-rate.

By default, the script works from stdin and outputs to stdout.

Usage

For example:

cgat bam2bam --method=filter --filter-method=mapped < in.bam > out.bam

will remove all unmapped reads from the bam-file.

Example for running downsample:

cgat bam2bam –method=downsample-paired –downsample=30000 –randomseed=1 -L out.log < Paired.bam > out.bam

Type:

cgat bam2bam --help

for command line help.

Command line options

usage: bam2bam [-h] [--version]
               [-m {filter,keep-first-base,set-nh,set-sequence,strip-sequence,strip-quality,unstrip,unset-unmapped-mapq,downsample-single,downsample-paired,add-sequence-error}]
               [--strip-method {all,match}]
               [--filter-method {NM,CM,mapped,unique,non-unique,remove-list,keep-list,error-rate,min-read-length,min-average-base-quality}]
               [--reference-bam-file REFERENCE_BAM] [--force-output]
               [--output-sam] [--first-fastq-file FASTQ_PAIR1]
               [--second-fastq-file FASTQ_PAIR2] [--downsample DOWNSAMPLE]
               [--filename-read-list FILENAME_READ_LIST]
               [--error-rate ERROR_RATE]
               [--minimum-read-length MINIMUM_READ_LENGTH]
               [--minimum-average-base-quality MINIMUM_AVERAGE_BASE_QUALITY]
               [--timeit TIMEIT_FILE] [--timeit-name TIMEIT_NAME]
               [--timeit-header] [--random-seed RANDOM_SEED] [-v LOGLEVEL]
               [--log-config-filename LOG_CONFIG_FILENAME]
               [--tracing {function}] [-? ?] [-I STDIN] [-L STDLOG]
               [-E STDERR] [-S STDOUT]
bam2bam: error: argument -?: expected one argument