Sra.py - Methods for dealing with short read archive files¶

Utility functions for dealing with SRA formatted files from the Short Read Archive.

Requirements: * fastq-dump >= 2.1.7

Code¶

Sra.peek(sra, outdir=None)¶

return the full file names for all files which will be extracted

Parameters

outdir (path) – perform extraction in outdir. If outdir is None, the extraction will take place in a temporary directory, which will be deleted afterwards.

Returns

files (list) – A list of fastq formatted files that are contained in the archive.
format (string) – The quality score format in the fastq formatted files.

Sra.extract(sra, outdir, tool='fastq-dump')¶: return statement for extracting the SRA file in outdir. possible tools are fastq-dump and abi-dump. Use abi-dump for colorspace

Sra.prefetch(sra)¶: Use prefetch from the SRA toolkit to download the local cache

Sra.clean_cache(sra)¶: Remove the specified SRA file from the cache.

Sra.fetch_ENA(dl_path, outdir, protocol='ascp')¶: Fetch fastq from ENA given accession

Sra.fetch_ENA_files(accession)¶: Get the names of the files matching the ENA accession

Sra.fetch_TCGA_fastq(acc, filename, token=None, outdir='.')¶

Get Fastq file from TCGA repository. Because of the nature of the TCGA repository it assumes certain things:

That data is paired-end fastq

That the files end in _1.fastq or _2.fastq

Sra.fetch_TCGA_BAM(acc, token, outdir='.', filter_bed=None)¶: Get BAM file from TCGA repository based on UUID. Will return statement and path/filename of downloaded file. A bed file may be provided to filter to remove contigs not present in the reference genome

Sra.process_remote_BAM(infile, token=None, outdir='.', filter_bed=None)¶: generate statement from .remote file