bed2bed - manipulate bed files

Purpose

This script provides various methods for merging (by position, by name or by score), filtering and moving bed formatted intervals and outputting the results as a bed file

This script provides several methods, each with a set of options
to control behavoir:
cgat.tools.bed2bed.merge()
+++++
Merge together overlapping or adjacent intervals. The basic
functionality is similar to bedtools merge, but with some additions:
\* Merging by name: specifying the --merge-by-name option will mean

that only overlaping (or adjacent intervals) with the same value in the 4th column of the bed will be merged

\* Removing overlapping intervals with inconsistent names: set the

--remove-inconsistent-names option.

.. caution::

Intervals of the same name will only be merged if they are consecutive in the bed file.

\* Only output merged intervals: By specifiying the --merge-min-intervals=n

options, only those intervals that were created by merging at least n intervals together will be output

Intervals that are close but not overlapping can be merged by setting
--merge-distance to a non-zero value
cgat.tools.bed2bed.bins()
++++
Merges together overlapping or adjecent intervals only if they have
"similar" scores. Score similarity is assessed by creating a number of
score bins and assigning each interval to a bin. If two adjacent
intervals are in the same bin, the intervals are merged. Note that in
contrast to merge-by-name above, two intervals do not need to be
overlapping or within a certain distance to be merged.
There are several methods to create the bins:
\* equal-bases: Bins are created to that they contain the same number of bases.

Specified by passing “equal-bases” to –binning-method. This is the default.

\* equal-intervals: Score bins are create so that each bin contains the

same number of intervals. Specified by passing “equal-intervals” to –binning-method.

\* equal-range: Score bins are created so that

each bin covers the same fraction of the total range of scores. Specified by passing “equal-range” to –binning-method.

\* bin-edges: Score binds can be specified by manually passing a comma

seperated list of bin edges to –bin-edges.

The number of bins is specified by the --num-bins options, and the
default is 5.
cgat.tools.bed2bed.block()
+++++
Creates blocked bed12 outputs from a bed6, where intervals with the
same name are merged together to create a single bed12 entry.
.. Caution:: Input must be sorted so that entries of the same
name are together.
filter-genome
+++++++++++++
Removes intervals that are on unknown contigs or extend off the 3' or
5' end of the contig. Requires a tab seperated input file to -g which
lists the contigs in the genome, plus their lengths.
sanitize-genome
+++++++++++++++
As above, but instead of removing intervals overlapping the ends of
contigs, truncates them. Also removes empty intervals.
filter-names
++++++++++++
Output intervals whose names are in list of desired names. Names are
supplied as a file with one name on each line.
cgat.tools.bed2bed.shift()
+++++
Moves intervals by the specified amount, but will not allow them to be
shifted off the end of contigs. Thus if a shift will shift the start
of end of the contig, the interval is only moved as much as is
possible without doing this.
rename-chr
++++++++++
Renames chromosome names. Source and target names are supplied as a file
with two columns. Examples are available at:
https://github.com/dpryan79/ChromosomeMappings
Note that unmapped chromosomes are dropped from the output file.
Other options
+++++++++++++
-g/--genome-file, -b/--bam-file:

the filter-genome, sanitize-genome and shift methods require a genome in order to ensure they are not placing intervals outside the limits of contigs. This genome can be supplied either as a samtools or cgat indexed genome, or extracted from the header of a bam file.

Examples

Merge overlapping or adjectent peaks from a CHiP-seq experiment where the intervals have the same name:

cat chip-peaks.bed | cgat bed2bed –method=merge –merge-by-name > chip-peaks-merged.bed

Merge adjected ChIP-seq peaks if their scores are in the same quartile of all scores:

cat chip-peaks.bed | cgat bed2bed –method=bins –binning-method=equal-intervals –num-bins=4

Remove intervals that overlap the ends of a contig and those that are on a non-standard contig. Take the input intervals from a file rather than stdin. Note that hg19.fasta has been indexed with index_genome:

cgat bed2bed –method=filter-genome –genome-file=hg19.fasta -I chip-peaks.bed -O chip-peaks-sanitized.bed

Convert a bed file contain gene structures with one line per exon to a bed12 with linked block representing the gene structure. Note the transparent use of compressed input and output files:

cgat bed2bed –method=block -I transcripts.bed.gz -O transcripts.blocked.bed.gz

Rename UCSC chromosomes to ENSEMBL.

cat ucsc.bed | cgat bed2bed –method=rename-chr –rename-chr-file=ucsc2ensembl.txt > ensembl.bed

Usage

cgat bed2bed –method=[METHOD] [OPTIONS]

Will read bed file from stdin and apply the specified method

Command line options

usage: bed2bed [-h]
               [-m {merge,filter-genome,bins,block,sanitize-genome,shift,extend,filter-names,rename-chr}]
               [--num-bins NUM_BINS] [--bin-edges BIN_EDGES]
               [--binning-method {equal-bases,equal-intervals,equal-range}]
               [--merge-distance MERGE_DISTANCE]
               [--merge-min-intervals MERGE_MIN_INTERVALS] [--merge-by-name]
               [--merge-and-resolve-blocks] [--merge-stranded]
               [--remove-inconsistent-names] [--offset OFFSET]
               [-g GENOME_FILE] [-b BAM_FILE] [--filter-names-file NAMES]
               [--rename-chr-file RENAME_CHR_FILE] [--timeit TIMEIT_FILE]
               [--timeit-name TIMEIT_NAME] [--timeit-header]
               [--random-seed RANDOM_SEED] [-v LOGLEVEL]
               [--log-config-filename LOG_CONFIG_FILENAME]
               [--tracing {function}] [-? ?] [-I STDIN] [-L STDLOG]
               [-E STDERR] [-S STDOUT]
bed2bed: error: argument -?: expected one argument