diff_fasta.py - compare contents of two fasta files¶
- Tags
Genomics Sequences FASTA Comparison
Purpose¶
This script takes two sets of fasta sequences and matches the identifiers. It then compares the sequences with the same identifiers and, depending on the output options selected, outputs
which sequences are missing
which sequences are identical
which sequences are prefixes/suffixes of each other
An explanatory field is appended to output sequence identifiers. An explanation of the different field values is provided in the log.
Options¶
- -s, --correct-gap-shift
This option will correct shifts in alignment gaps between two sequences being compared
- -1, --pattern1
regular expression pattern to extract identifier from in sequence 1
- -2, --pattern2
regular expression pattern to extract identifier from in sequence 2
Depending on the option --output-section
the following are output:
- diff
identifiers of sequences that are different
- seqdiff
identifiers of sequences that are different plus sequence
- missed
identifiers of seqences that are missing from one set or the other
This script is of specialized interest and has been used in the past to check if ENSEMBL gene models had been correctly mapped into a database schema.
Usage¶
Example:
cat a.fasta | head
>ENSACAP00000004922
MRSRNQGGESSSSGKFSKSKPIINTGENQNLQEDAKKKNKSSRKEE ...
>ENSACAP00000005213
EEEEDESNNSYLYQPLNQDPDQGPAAVEETAPSTEPALDINERLQA ...
>ENSACAP00000018122
LIRSSSMFHIMKHGHYISRFGSKPGLKCIGMHENGIIFNNNPALWK ...
python diff_fasta.py --output-section=missed --output-section=seqdiff a.fasta b.fasta
cat diff.out
# Legend:
# seqs1: number of sequences in set 1
# seqs2: number of sequences in set 2
# same: number of identical sequences
# diff: number of sequences with differences
# nmissed1: sequences in set 1 that are not found in set 2
# nmissed2: sequences in set 2 that are not found in set 1
# Type of sequence differences
# first: only the first residue is different
# last: only the last residue is different
# prefix: one sequence is prefix of the other
# selenocysteine: difference due to selenocysteines
# masked: difference due to masked residues
# fixed: fixed differences
# other: other differences
Type:
python diff_fasta.py --help
for command line help.
Command line options¶
usage: diff-fasta [-h] [--version] [-s] [-1 PATTERN1] [-2 PATTERN2]
[-o {diff,missed,seqdiff}] [--timeit TIMEIT_FILE]
[--timeit-name TIMEIT_NAME] [--timeit-header]
[--random-seed RANDOM_SEED] [-v LOGLEVEL]
[--log-config-filename LOG_CONFIG_FILENAME]
[--tracing {function}] [-? ?] [-I STDIN] [-L STDLOG]
[-E STDERR] [-S STDOUT]
diff-fasta: error: argument -?: expected one argument