These tracks display evidence of open chromatin in multiple cell types
from the Duke/UNC/UT-Austin/EBI
ENCODE group. Open chromatin was identified using two independent and
complementary methods: DNaseI hypersensitivity (HS) and Formaldehyde-Assisted
Isolation of Regulatory Elements (FAIRE), combined with chromatin
immunoprecipitation (ChIP) for select regulatory factors. Each method was
verified by two detection platforms: Illumina (formerly Solexa) sequencing by
synthesis, and high-resolution 1% ENCODE tiled microarrays supplied by
NimbleGen.
DNaseI HS data: DNaseI is an enzyme that has long been used to map
general chromatin accessibility, and DNaseI "hyperaccessibility" or
"hypersensitivity" is a feature of active cis-regulatory sequences. The use of
this method has led to the discovery of functional regulatory elements that
include enhancers, silencers, insulators, promotors, locus control regions and
novel elements. DNaseI hypersensitivity signifies chromatin accessibility
following binding of trans-acting factors in place of a canonical nucleosome.
FAIRE data: FAIRE (Formaldehyde Assisted Isolation of Regulatory
Elements) is a method to isolate and identify nucleosome-depleted regions of
the genome. FAIRE was initially discovered in yeast and subsequently shown to
identify active regulatory elements in human cells (Giresi et al.,
2007). Although less well-characterized than DNase, FAIRE also appears to
identify functional regulatory elements that include enhancers, silencers,
insulators, promotors, locus control regions and novel elements. DNA fragments
isolated by FAIRE are 100-200 bp in length, with the average length being 140
bp.
ChIP data: ChIP (Chromatin Immunoprecipitation) is a method to identify
the specific location of proteins that are directly or indirectly bound to
genomic DNA. By identifying the binding location of sequence-specific
transcription factors, general transcription machinery components, and chromatin
factors, ChIP can help in the functional annotation of the open chromatin
regions identified by DNaseI HS mapping and FAIRE.
Display Conventions and Configuration
This track is a multi-view composite track that contains multiple data types
(views). For each view, there are multiple subtracks that display
individually on the browser. Instructions for configuring multi-view tracks
are here.
Chromatin data displayed here represents a continuum of signal intensities.
The Crawford lab recommends setting the "Data view scaling:
auto-scale" option when viewing signal data in full
mode. In general, for each experiment in each of the cell types, the Open
Chromatin tracks contain the following views:
Peaks
Regions of enriched signal in either
DNaseI HS, FAIRE, or ChIP experiments. Peaks were called based on signals
created using F-Seq, a software program developed at Duke (Boyle et al.,
2008b). Significant regions were determined by performing ROC analysis of
sequence data using data from the 1% ENCODE arrays, and determining a cut-off
value at approximately the 95% sensitivity level. The solid vertical line in
the peak represents the point with highest signal. ENCODE Peaks tables contain
a p-value for statistical significance. For these data, this was determined by
fitting the data to a gamma distribution.
Peaks (Zinba)
Enriched regions for FAIRE data were called
using ZINBA (Zero Inflated Negative Binomial Algorithm). ZINBA is a flexible
statistical method that uses a generalized linear model to select genomic
windows with enriched sequence counts after adjusting for relevant confounding
factors such as mappability, GC content, and copy number alterations.
Significant regions are selected using the set of standardized residuals
below a false discovery rate (qvalue) threshold. Peaks were further refined
using a shape detection algorithm to identify local maxima and boundaries of
the Signal (Base Overlap) data within each significant region.
Signal (F-Seq Density)
Density graph (wiggle) of signal
enrichment calculated using F-Seq for the combined set of sequences from all
replicates. F-Seq employs Parzen kernel density estimation to create base pair
scores (Boyle et al., 2008b). This method does not look at fixed-length
windows but rather weights contributions of nearby sequences in proportion to
their distance from that base. It only considers sequences aligned 4 or less
times in the genome, and uses an alignability background model to try to correct
for regions where sequences cannot be aligned. For the K562, HepG2 and HelaS3
cell types, where there is an abnormal karyotype, a model to try to correct
for amplifications and deletions was also used. No control data were used in the
creation of these annotations.
Signal (Base Overlap)
An alternative version of the
Signal (F-Seq Density) track annotation that provides a higher resolution
view of the raw sequence data. This track also includes the combined set of
sequences from all replicates. For each sequence, the aligned read is extended
in the following way: for DNase, the read is extended 5 bp in both directions
from its 5' aligned end where DNase cut the DNA; for FAIRE and ChIP, the
sequence is extend to a fragment length of 134 bp from the 5' aligned end
representing the approximate average fragment length. The score at each base
pair represents the number of extended fragments that overlap the base pair.
Alignments
Mappings of short reads to the genome (currently
only available for
download).
Additional data that were used to generate these tracks
are located in the ENCODE Mappability track:
Uniqueness
The Duke uniqueness tracks were used in
identify regions of unique sequence for different tag lengths. The tracks
also identify regions where high-throughput sequence tags cannot be mapped.
Excluded Regions
The Duke excluded regions track
was used to identify problematic regions for short sequence tag signal
detection (such as satellites and rRNA genes). These regions of the genome were
excluded from the Open Chromatin tracks.
DNaseI hypersensitive sites were isolated using methods called DNase-seq or
DNase-chip (Boyle et al., 2008a, Crawford et al., 2006).
Briefly, cells were lysed with NP40, and intact nuclei were digested with optimal
levels of DNaseI enzyme. DNaseI digested ends were captured from three different
DNase concentrations, and material was sequenced using Illumina (Solexa)
sequencing. DNase-seq data were verified using material that was hybridized to
NimbleGen Human ENCODE tiling arrays (1% of the genome). Multiple independent
growths (replicates) were compared to verify the reproducibility of the data.
A more detailed protocol is available
here.
FAIRE was performed (Giresi et al., 2007) by cross-linking proteins
to DNA using 1% formaldehyde solution, and the complex was sheared using
sonication. Phenol/chloroform extractions were performed to remove DNA
fragments cross-linked to protein. The DNA recovered in the aqueous phase was
hybridized to NimbleGen Human ENCODE tiling arrays (1% of the genome) and sequenced
using a Solexa sequencing system. The ENCODE array data were used to verify
the accuracy of the sequencing data, and multiple independent growths
(replicates) were compared to assess the reproducibility of the data.
A more detailed protocol is available
here.
Also see Giresi et al., 2009.
To perform ChIP, proteins were cross-linked to DNA in vivo
using 1% formaldehyde solution (Bhinge et al., 2007, ENCODE Project
Consortium., 2007). Cross-linked chromatin was sheared by sonication and
immunoprecipitated using a specific antibody against the protein of interest.
After reversal of the cross-links, the immunoprecipitated DNA was used to
identify the genomic location of transcription factor binding. This was
accomplished by Solexa sequencing of the ends of the immunoprecipitated DNA
(ChIP-seq), as well as labeling and hybridization of the immunoprecipitated
DNA to NimbleGen Human ENCODE tiling arrays (1% of the genome) along with the
input DNA as reference (ChIP-chip). The ENCODE array data were used to verify
the accuracy of the sequencing data, and multiple independent growths
(replicates) were compared to assess the reproducibility of the data. A more
detailed protocol is available
here.
ENCODE Array data were normalized using the Tukey biweight normalization, and
peaks were called using ChIPOTle (Buck, et al., 2005) at multiple
levels of significance. Regions matched on size to these peaks that were
devoid of any significant signal were also created to allow for ROC analysis.
Sequences from each experiment were aligned to the genome using Maq (Li
et al., 2008) and those that aligned to 4 or fewer locations were retained.
Other sequences were also filtered based on their alignment to problematic
regions (such as satellites and rRNA genes). The resulting digital signal was
converted to a continuous wiggle track using F-Seq that employs Parzen kernel
density estimation to create base pair scores (Boyle et al., 2008b).
Discrete DNase HS, FAIRE, and ChIP sites (peaks) were identified from
DNase/FAIRE/ChIP-seq using F-Seq by setting a Parzen cutoff based on ROC curve
analysis using peaks and non-peaks identified from DNase/FAIRE/ChIP-chip using
NimbleGen Human ENCODE tiling arrays (1% of the genome).
Input data was generated for GM12878, K562, HeLa-S3, HepG2, and HUVEC.
These were used directly to create a control/background model used for
F-Seq when generating signal annotations and subsequenntly peaks for these
cell lines. These models are meant to correct for sequencing biases,
alignment artifacts, and copy number changes in these cell lines. Input
data is not being generated directly for other cell lines. Instead, a
general background model was derived from the five Input data sets. This
should provide corrections for sequencing biases and alignment artifacts,
but obviously not for cell type specific copy number changes.
Release Notes
This is Release 3 (Mar 2010) of this track, which includes 18 new cell line
or cell/treatment experiments. In addition, a number of new experiments
were added to existing cell lines. Almost all Peaks have been called anew
using improved cut-offs and p-Values. Finally, a second type of peak called
using a ZINBA algorithm has been provided for several of the FAIRE-seq
experiments. For all new versions of previously-released data, the affected
database tables and files include 'V2' or 'V3' in the name, and metadata is
marked with "submittedDataVersion=V", followed by a number and reason for
replacement. Previous versions of these files are available for download from the
FTP site.
Credits
These data and annotations were created by a collaboration of multiple
institutions (contact:
Terry Furey):
Data users may freely use ENCODE data, but may not, without prior consent,
submit publications that use an unpublished ENCODE dataset until nine months
following the release of the dataset. This date is listed in the Restricted
Until column on the track configuration page and the download page. The
full data release policy for ENCODE is available here.