bed2bed - manipulate bed files¶
Purpose¶
This script provides various methods for merging (by position, by name or by score), filtering and moving bed formatted intervals and outputting the results as a bed file
-
This script provides several methods, each with a set of options
-
to control behavoir:
-
cgat.tools.bed2bed.
merge
()¶
-
+++++
-
Merge together overlapping or adjacent intervals. The basic
-
functionality is similar to bedtools merge, but with some additions:
-
\* Merging by name: specifying the --merge-by-name option will mean
that only overlaping (or adjacent intervals) with the same value in the 4th column of the bed will be merged
-
\* Removing overlapping intervals with inconsistent names: set the
--remove-inconsistent-names
option.
-
.. caution::
Intervals of the same name will only be merged if they are consecutive in the bed file.
-
\* Only output merged intervals: By specifiying the --merge-min-intervals=n
options, only those intervals that were created by merging at least n intervals together will be output
-
Intervals that are close but not overlapping can be merged by setting
-
--merge-distance to a non-zero value
-
cgat.tools.bed2bed.
bins
()¶
-
++++
-
Merges together overlapping or adjecent intervals only if they have
-
"similar" scores. Score similarity is assessed by creating a number of
-
score bins and assigning each interval to a bin. If two adjacent
-
intervals are in the same bin, the intervals are merged. Note that in
-
contrast to merge-by-name above, two intervals do not need to be
-
overlapping or within a certain distance to be merged.
-
There are several methods to create the bins:
-
\* equal-bases: Bins are created to that they contain the same number of bases.
Specified by passing “equal-bases” to –binning-method. This is the default.
-
\* equal-intervals: Score bins are create so that each bin contains the
same number of intervals. Specified by passing “equal-intervals” to –binning-method.
-
\* equal-range: Score bins are created so that
each bin covers the same fraction of the total range of scores. Specified by passing “equal-range” to –binning-method.
-
\* bin-edges: Score binds can be specified by manually passing a comma
seperated list of bin edges to –bin-edges.
-
The number of bins is specified by the --num-bins options, and the
-
default is 5.
-
cgat.tools.bed2bed.
block
()¶
-
+++++
-
Creates blocked bed12 outputs from a bed6, where intervals with the
-
same name are merged together to create a single bed12 entry.
-
.. Caution:: Input must be sorted so that entries of the same
-
name are together.
-
filter-genome
-
+++++++++++++
-
Removes intervals that are on unknown contigs or extend off the 3' or
-
5' end of the contig. Requires a tab seperated input file to -g which
-
lists the contigs in the genome, plus their lengths.
-
sanitize-genome
-
+++++++++++++++
-
As above, but instead of removing intervals overlapping the ends of
-
contigs, truncates them. Also removes empty intervals.
-
filter-names
-
++++++++++++
-
Output intervals whose names are in list of desired names. Names are
-
supplied as a file with one name on each line.
-
cgat.tools.bed2bed.
shift
()¶
-
+++++
-
Moves intervals by the specified amount, but will not allow them to be
-
shifted off the end of contigs. Thus if a shift will shift the start
-
of end of the contig, the interval is only moved as much as is
-
possible without doing this.
-
rename-chr
-
++++++++++
-
Renames chromosome names. Source and target names are supplied as a file
-
with two columns. Examples are available at:
-
https://github.com/dpryan79/ChromosomeMappings
-
Note that unmapped chromosomes are dropped from the output file.
-
Other options
-
+++++++++++++
-
-g/--genome-file, -b/--bam-file:
the filter-genome, sanitize-genome and shift methods require a genome in order to ensure they are not placing intervals outside the limits of contigs. This genome can be supplied either as a samtools or cgat indexed genome, or extracted from the header of a bam file.
Examples
Merge overlapping or adjectent peaks from a CHiP-seq experiment where the intervals have the same name:
cat chip-peaks.bed | cgat bed2bed –method=merge –merge-by-name > chip-peaks-merged.bed
Merge adjected ChIP-seq peaks if their scores are in the same quartile of all scores:
cat chip-peaks.bed | cgat bed2bed –method=bins –binning-method=equal-intervals –num-bins=4
Remove intervals that overlap the ends of a contig and those that are on a non-standard contig. Take the input intervals from a file rather than stdin. Note that hg19.fasta has been indexed with index_genome:
cgat bed2bed –method=filter-genome –genome-file=hg19.fasta -I chip-peaks.bed -O chip-peaks-sanitized.bed
Convert a bed file contain gene structures with one line per exon to a bed12 with linked block representing the gene structure. Note the transparent use of compressed input and output files:
cgat bed2bed –method=block -I transcripts.bed.gz -O transcripts.blocked.bed.gz
Rename UCSC chromosomes to ENSEMBL.
cat ucsc.bed | cgat bed2bed –method=rename-chr –rename-chr-file=ucsc2ensembl.txt > ensembl.bed
Usage¶
cgat bed2bed –method=[METHOD] [OPTIONS]
Will read bed file from stdin and apply the specified method
Command line options¶
usage: bed2bed [-h]
[-m {merge,filter-genome,bins,block,sanitize-genome,shift,extend,filter-names,rename-chr}]
[--num-bins NUM_BINS] [--bin-edges BIN_EDGES]
[--binning-method {equal-bases,equal-intervals,equal-range}]
[--merge-distance MERGE_DISTANCE]
[--merge-min-intervals MERGE_MIN_INTERVALS] [--merge-by-name]
[--merge-and-resolve-blocks] [--merge-stranded]
[--remove-inconsistent-names] [--offset OFFSET]
[-g GENOME_FILE] [-b BAM_FILE] [--filter-names-file NAMES]
[--rename-chr-file RENAME_CHR_FILE] [--timeit TIMEIT_FILE]
[--timeit-name TIMEIT_NAME] [--timeit-header]
[--random-seed RANDOM_SEED] [-v LOGLEVEL]
[--log-config-filename LOG_CONFIG_FILENAME]
[--tracing {function}] [-? ?] [-I STDIN] [-L STDLOG]
[-E STDERR] [-S STDOUT]
bed2bed: error: argument -?: expected one argument