fastq2table.py - compute stats on reads in fastq files

Tags

Genomics NGS Sequences FASTQ Annotation

Purpose

This script iterates over a fastq file and outputs summary statistics for each read.

The output is a tab-delimited text file with the following columns:

Column

Content

read

read identifier present in input fastq file

nfailed

number of reads that fall below Q10

nN

number of ambiguous base calls (N)

nval

number of bases in the read

min

minimum base quality score for the read

max

maximum base quality for the read

mean

mean base quality for the read

median

median base quality for the read

stddev

standard devitation of quality scores for the read

sum

sum of quality scores for the read

q1

25th percentile of quality scores for the read

q3

25th percentile of quality scores for the read

Usage

Example:

cgat fastq2table --guess-format=sanger < in.fastq > out.tsv

In this example we know that our data have quality scores formatted as sanger. Given that illumina-1.8 quality scores are highly overlapping with sanger, this option defaults to sanger qualities. In default mode the script may not be able to distinguish highly overlapping sets of quality scores.

If we provide two reads to the script:

@DHKW5DQ1:308:D28FGACXX:5:2211:8051:4398
ACAATGTCCTGATGTGAATGCCCCTACTATTCAGATCGCTTAGGGCATGC
+
B1=?DFDDHHFFHIJJIJGGIJGFIEE9CHIIFEGGIIJGIGIGIIDGHI
@DHKW5DQ1:308:D28FGACXX:5:1315:15039:83265
GAATGCCCCTACTATTCAGATCGCTTAGGGCATGCGTCGCATGTGAGTAA
+
@@@FDFFFHGHHHJIIIJIGHIJJIGHGHC9FBFBGHIIEGHIGC>F@FA

we get the following table as output:

read

nfailed

nN

nval

min

max

mean

median

stddev

sum

q1

q3

DHKW5DQ1:308:D28FGACXX:5:2211:8051:4398

0

0

50

16.0000

41.0000

37.2000

38.0000

4.4900

1860.0000

36.0000

40.0000

DHKW5DQ1:308:D28FGACXX:5:1315:15039:83265

0

0

50

24.0000

41.0000

37.0200

38.0000

3.5916

1851.0000

36.0000

40.0000

Type:

cgat fastq2table --help

for command line help.

Command line options

usage: fastq2table [-h] [--version]
                   [--guess-format {sanger,solexa,phred64,illumina-1.8,integer}]
                   [--target-format {sanger,solexa,phred64,illumina-1.8,integer}]
                   [--timeit TIMEIT_FILE] [--timeit-name TIMEIT_NAME]
                   [--timeit-header] [--random-seed RANDOM_SEED] [-v LOGLEVEL]
                   [--log-config-filename LOG_CONFIG_FILENAME]
                   [--tracing {function}] [-? ?] [-I STDIN] [-L STDLOG]
                   [-E STDERR] [-S STDOUT]
fastq2table: error: argument -?: expected one argument