Genomics.py - Tools for working with genomic data

Tags

Python

Reference

Genomics.parse_region_string(s)

parse a genomic region string.

Returns tuple of contig, start, end. Missing values are None.

Genomics.reverse_complement(s)

reverse complement a sequence.

>>> complement("ACATACATACTA")
'TAGTATGTATGT'
Returns

Return type

string

Genomics.GetHID(sequence)

returns a hash value for a sequence.

The hash value is computed using md5 and converted into printable characters.

>>> GetHID("ACATACATACTA")
'trcPGx9VNT36XMlG0XvUBQ'
Returns

Return type

A hash value

Genomics.String2Location(s)

convert a string to location information.

>>> String2Location("chr1:12:15")
('chr1', '+', 12, 15)
Returns

  • contig (string)

  • strand (string)

  • start (int)

  • end (int)

Genomics.readContigSizes(infile)

read sizes of contigs from file.

Parameters

infile (string) – Filename of tsv separated file.

Returns

Return type

dict

Genomics.forceForwardCoordinates(start, end, strand, length)

return forward coordinates.

If strand is negative, the coordinates in a and b will be converted. If they are on the positive strand, they will be returned as is.

Parameters
  • start (int) – Start coordinate

  • end (int) – End coordinate

  • strand (string) – Strand of interval. The values of “-”, “0”, “-1” indicate a negative strand.

  • length (int) – Length of chromosome.

Genomics.CountGeneFeatures(first_position, alignment, genomic_sequence=None, border_stop_codon=0, stop_codons=('TAG', 'TAA', 'TGA'))

calculate number of genomic features in a peptide to genome alignment.

Note that codons can be split, for example:

S 0 2 5 0 2 I 0 17541 3 0 2 S 1 2 5 0 2 I 0 27979 3 0 2 S 1 2
Parameters
  • first_position (int) – Start of alignment on genome.

  • alignment (string) – Alignment in CIGAR format, for example from exonerate_.

  • genomic_sequence (string) – Genomic sequence for alignment

  • border_stop_codon (int) – Number of codons that are ignored at the edges of match regions. border_stop_codon should be divisible by three.

  • stop_codons (list) – List of stop codons

Returns

  • nintrons (int) – Number of introns

  • nframeshifts (int) – Number of frameshifts in aligment.

  • ngaps (int) – Number of gaps in aligment.

  • nsplit (int) – Number of codons split by introns in alignment.

  • nstopcodons (int) – Number of stop codons in alignment.

  • disruptions (list) – List of disruptions in the prediction. Each disruption is a tuple of ( “stop|frameshift”, position in protein, position in cds, position on genomic sequence).

Genomics.Alignment2String(alignment)

convert a tuple alignment to an alignment string.

Genomics.String2Alignment(source)

convert an alignment string to a tuple alignment.

Genomics.GetAlignmentLength(alignment)

return Alignment length

Genomics.Alignment2ExonBoundaries(alignment, query_from=0, sbjct_from=0, add_stop_codon=1)

extract exon coordinates from a peptide2genome alignment.

Parameters
  • aligment (list) – List of tuples of the alignment in CIGAR format.

  • query_from (int) – Start position of alignment on peptide sequence.

  • sbjct_from (int) – Start position of alignment on nucleotide sequence.

  • add_stop_codon (int) – Add final stop codon to exon boundaries.

Returns

exons – A list of exons. Each exon is a tuple of (query_from, query_pos, frame, sbjct_from, sbjct_pos, ali)

Return type

list

Genomics.RemoveFrameShiftsFromAlignment(row_ali, col_ali, gap_char='-')

remove frame shifts in an alignment.

Frameshifts are gaps are 1, 2, 4, or 5 residues long.

>>> RemoveFrameShiftsFromAlignment("ABC-EFG", "AB-DEFG")
('ABEFG', 'ABEFG')
Parameters
  • row_ali (string) – Alignment string of row.

  • col_ali (string) – Alignment string of column.

  • gap_char (string) – Gap character to identify aligments.

Returns

  • new_row_ali (string) – New alignment string for row

  • new_col_ali (string) – New aligment string for column

Genomics.MaskStopCodons(sequence, stop_codons=('TAG', 'TAA', 'TGA'))

mask stop codons in a sequence.

Stop codons are masked with NNN.

Parameters
  • sequence (string) – Nucleotide sequence to mask.

  • stop_codons (string) – List of known stop codons.

Returns

masked_sequence

Return type

string

Genomics.Alignment2DNA(alignment, query_from=0, sbjct_from=0)

convert a peptide2genome alignment to a nucleotide2nucleotide alignment.

Instead of peptide coordinates, the alignment will be in codon coordinates.

Parameters
  • aligment (list) – List of tuples of the alignment in CIGAR format.

  • query_from (int) – Start position of alignment on peptide sequence.

  • sbjct_from (int) – Start position of alignment on nucleotide sequence.

Returns

alignment – The alignment as an alignlib.AlignmentVector object.

Return type

object

Genomics.encodeGenotype(code)

encode genotypes like GG, GA into a one-letter code. The returned code is lower case if code[0] < code[1], otherwise it is uppercase.

Genomics.decodeGenotype(code)

decode single letter genotypes like m, M into two letters. This is the reverse operation to encodeGenotype().

Genomics.resolveAmbiguousNA(code)

resolve ambiguous nucleic acid letters.

Genomics.resolveReverseAmbiguousNA(genotype)

map a genotype to a single letter amino acid amiguous code, for example, CT -> Y.

Genomics.GetMapAA2Codons()

returns a map of amino acids to codons

No stop codons. .

Genomics.MapCodon2AA(codon, is_seleno=False, ignore_n=True)

map a codon to an amino acid using the standard translation tables

The mapping returns gaps as gaps and will return an amino acid for incomplete codons if there is unambiguous mapping.

If is_seleno is set, the codon is translated for a selenoprotein.

If ignore_n is set, codons with n are returned as ? in order to distinguish them from stop codons.

Amino acids are returned as upper-case letters.

Genomics.Alignment2PeptideAlignment(alignment, query_from=0, sbjct_from=0, genomic_sequence=None)

convert a Peptide2DNA aligment to a Peptide2Peptide alignment.

How to handle frameshifts?

Genomics.translate(sequence, is_seleno=False, prefer_lowercase=True, ignore_n=False)

convert DNA sequence to a peptide sequence

If is_seleno is set, “TGA” codons are treated as encoding for selenocysteine.

If ignore_n is set, codons with n are returned as ? in order to distinguish them from stop codons.

Genomics.TranslateDNA2Protein(*args, **kwargs)

convert a DNA sequence to a peptide sequence. keep case.

deprecated - use translate() instead.

Genomics.Alignment2CDNA(alignment, query_from=0, sbjct_from=0, genome=None, remove_frameshifts=0)

build cDNA sequence from genomic fragment and return alignment of query to it.

Genomics.Exons2Alignment(exons)

build an cigar alignment string from a list of exons.

Genomics.AlignmentProtein2CDNA(src, exons1=None, exons2=None)

convert a peptide alignment to a nucleotide alignment.

multiplies coordinates with 3. Insert introns.

Note: alignment starts at 1

Genomics.GetDegenerateSites(seq1, seq2, degeneracy=4, position=3)

returns two new sequenes containing only degenerate sites.

Only unmutated positions are counted.

class Genomics.SequencePairInfo

Bases: object

the first characters are ACGT.

getGCContent()

return GC content.

class Genomics.SequencePairInfoCodons

Bases: Genomics.SequencePairInfo

the first characters are ACGT.

Genomics.AlignedPair2SubstitutionMatrix(seq1, seq2, alphabet)

given a pair of sequences, calculate a substitution matrix for the given alphabet.

Genomics.CalculatePairIndices(seq1, seq2, gap_char='-', with_codons=False)

returns number of idential and transitions/transversions substitutions in the alignment.

If with-codons = True, synonymous and nonsynonymous changes will be recorded as well. The routine assumes no frame-shifts and will count more than one change as non-synonymous.

Genomics.makeSubstitutionMatrix(type='EMBOSS')

make alignator with DNA substitution matrix.

EMBOSS style matrix: identity = 5 mismatch = -4 gop = -16 gep = -4

ClustalW style matrix: match = 1 mismatch = 0 gop = -10 gep = -0.1

Genomics.CalculateRCSUValuesFromCounts(counts, pseudo_counts=0)

calculate RCSU values for codons.

RCSU = relative frequency / uniform frequency

Genomics.CalculateCodonFrequenciesFromCounts(counts, pseudo_counts=0)

calculate codon frequencies from codon counts per amino acid. pseudo_counts are added if desired.

Genomics.CalculateCAIWeightsFromCounts(counts, pseudo_counts=0)

calculate CAI weights from codon counts. pseudo_counts are added if desired.

Genomics.IsJunk(contig)

returns true, if contigs is likely to be junk.

This is done by name matching. Junk contigs contain either one of the following:

random, unknown, chrU, chU.

Genomics.CountCodons(sequence)

count the codons in a sequence.

Genomics.GetUniformCodonUsage()

get list of frequencies for codons expected for uniform codon usage.

Genomics.GetBiasedCodonUsage(bias=1.0)

get list of frequencies for codons according to some bias.

The first codon for each aa is the most biased, all others are less biased.

The ratio determines the relative bias between the first and all other codons. 0.0 is no bias, 1.0 is complete bias.

Genomics.convertStrand(strand)

convert various strand notations into [+-.].

Genomics.GetIntronType(sequence, both_strands=False)

return intron type for an intronic sequence.

If both_strands is True, both strands are checked.

Genomics.printPrettyAlignment(seq1, *args)

print a pretty alignment.

Genomics.ReadPeptideSequences(infile, filter=None, as_array=False, regex_identifier=None)

read peptide sequence from fasta infile.

Genomics.ParseFasta2Hash(infile, filter=None, regex_identifier=None)

read fasta formatted sequences file and build a hash.

Keys are all characters before the first whitespace in the description line.

Previously, if the key contained a “:”, everything before the “:” was removed. This is not true any more.

Use array for higher space efficiency.

If regex_identifier is given, this is used to extract the identifier from the fasta description line.