fastq2fastq.py - manipulate fastq files¶
- Tags
Genomics NGS Sequences FASTQ Manipulation
Purpose¶
This script performs manipulations on fastq formatted files. For example it can be used to change the quality score format or sample a subset of reads.
The script predominantly is used for manipulation of single fastq
files. However, for some of its functionality it will take paired data
using the --pair-fastq-file
and --output-filename-pattern
options.
This applies to the sample
and sort
methods.
Usage¶
- Example::
In this example we randomly sample 50% of reads from paired data provided in two fastq files.
head in.fastq.1
@SRR111956.1 HWUSI-EAS618:7:1:27:1582 length=36 CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC +SRR111956.1 HWUSI-EAS618:7:1:27:1582 length=36 =@A@9@BAB@;@BABA?=;@@BB<A@9@;@2>@;?? @SRR111956.2 HWUSI-EAS618:7:1:29:1664 length=36 CCCCCCCCCCCCCCCCCCCCCCCCCCCACCCCCCCC +SRR111956.2 HWUSI-EAS618:7:1:29:1664 length=36 =B@9@0>A<B=B=AAA?;*(@A>(@<=*9=9@BA>7 @SRR111956.3 HWUSI-EAS618:7:1:38:878 length=36 AGTGAGCAGGGAAACAATGTCTGTCTAAGAATTTGA
head in.fastq.2
+SRR111956.3 HWUSI-EAS618:7:1:38:878 length=36 <?@BA?;A=@BA>;@@7################### @SRR111956.4 HWUSI-EAS618:7:1:38:1783 length=36 ATTAGTATTATCCATTTATATAATCAATAAAAATGT +SRR111956.4 HWUSI-EAS618:7:1:38:1783 length=36 ?ABBA2CCBBB2?=BB@C>=AAC@A=CBB####### @SRR111956.5 HWUSI-EAS618:7:1:39:1305 length=36 CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC +SRR111956.5 HWUSI-EAS618:7:1:39:1305 length=36 AA>5;A>*91?=AAA@@BBA<B=?ABA>2>?A<BB@
- command-line::
- cat in.fastq.1 | python fastq2fastq.py
–method=sample –sample-size 0.5 –pair-fastq-file in.fastq.2 –output-filename-pattern out.fastq.2 > out.fastq.1
head out.fastq.1 @SRR111956.1 HWUSI-EAS618:7:1:27:1582 length=36 CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC + =@A@9@BAB@;@BABA?=;@@BB<A@9@;@2>@;?? @SRR111956.2 HWUSI-EAS618:7:1:29:1664 length=36 CCCCCCCCCCCCCCCCCCCCCCCCCCCACCCCCCCC + =B@9@0>A<B=B=AAA?;*(@A>(@<=*9=9@BA>7 @SRR111956.3 HWUSI-EAS618:7:1:38:878 length=36 AGTGAGCAGGGAAACAATGTCTGTCTAAGAATTTGA + <?@BA?;A=@BA>;@@7################### @SRR111956.4 HWUSI-EAS618:7:1:38:1783 length=36 ATTAGTATTATCCATTTATATAATCAATAAAAATGT + ?ABBA2CCBBB2?=BB@C>=AAC@A=CBB#######
Options¶
The following methods are implemented (--method
).
change-format
change the quality format to new format given as target-format. Options are
sanger
,
solexa
,phred64
,integer
andillumina-1.8
sample
Sub-sample a fastq file. The size of the sample is set by –sample-size
unique
Remove duplicate reads based on read name
trim3
Trim a fixed number of nucleotides from the 3’ end of reads. (see
--num-bases
). Note that there are better tools fortrimming.
trim5
Trim a fixed number of nucleotides from the 5’ end of reads. (see
--num-bases
). Note that there are better tools fortrimming.
sort
Sort the fastq file by read name.
renumber-reads
Rename the reads based on pattern given in
--pattern-identifier
e.g.--pattern-identifier="read_%010i"
Type:
python fastq2fastq.py --help
for command line help.
Command line options¶
usage: fastq2fastq [-h] [--version] [-i INPUT_FASTQ_FILE]
[--output-removed-tsv OUTPUT_REMOVED_TSV]
[--output-stats-tsv OUTPUT_STATS_TSV]
[--output-removed-fastq OUTPUT_REMOVED_FASTQ]
[-m {filter-N,filter-identifier,filter-ONT,offset-quality,apply,change-format,renumber-reads,sample,sort,trim3,trim5,unique,reverse-complement,grep}]
[--set-prefix SET_PREFIX]
[--input-filter-tsv INPUT_FILTER_TSV]
[--min-average-quality MIN_AVERAGE_QUALITY]
[--min-sequence-length MIN_SEQUENCE_LENGTH]
[--quality-offset QUALITY_OFFSET]
[--target-format {sanger,solexa,phred64,integer,illumina-1.8}]
[--guess-format {sanger,solexa,phred64,integer,illumina-1.8}]
[--sample-size SAMPLE_SIZE] [--pair-fastq-file PAIR]
[--map-tsv-file MAP_TSV_FILE] [--num-bases NBASES]
[--seed SEED] [--pattern-identifier RENUMBER_PATTERN]
[--grep-pattern GREP_PATTERN] [--timeit TIMEIT_FILE]
[--timeit-name TIMEIT_NAME] [--timeit-header]
[--random-seed RANDOM_SEED] [-v LOGLEVEL]
[--log-config-filename LOG_CONFIG_FILENAME]
[--tracing {function}] [-? ?] [-P OUTPUT_FILENAME_PATTERN]
[-F] [-I STDIN] [-L STDLOG] [-E STDERR] [-S STDOUT]
fastq2fastq: error: argument -?: expected one argument