This track was produced as part of the ENCODE Transcriptome Project
and shows the starts and ends of full-length mRNA transcripts determined
by Gene Identification Signature (GIS) paired-end ditag (PET) sequencing using
RNA extracts
from different
sub-cellular
localizations
in different
cell lines.
Short tags used in GIS-PET sequencing provide signatures
of the 5' start and the 3' end of individual mRNA transcripts, thus
demarcating the first and last exon, and contain enough coding information
to map the tags uniquely to the genome, in turn making it possible to
identify unconventional fusion transcripts. These 5' and 3' paired-end tags
extracted by restriction enzyme are ligated together to form a ditag for sequencing,
where the 3' end includes two adenine bases from the polyA tail thereby
reducing the relative amount of unique sequence.
The RNA-PET information provided in this track is composed of two different
PET length versions based on how the PETs were extracted using different
restriction enzymes. The cloning-based PET method (18 bp and 16 bp for each of
the 5' and 3' ends) is an earlier version (Ng et al., 2006). While the
cloning-free PET approach (27 bp and 25 bp for each of the 5' and 3' ends) is a
recently modified version which uses Type III restriction enzyme EcoP15I
to generate a longer length of PET (Ruan and Ruan, 2012),
which results in a significant enhancement
in both library construction and mapping efficiency. Both versions of PET
templates were sequenced by Illumina platform at 2 x 36 bp paired-end
sequencing.
See the Methods and References sections below for more details.
Display Conventions and Configuration
This track is a multi-view composite track that contains multiple data types
(views). For each view, there are multiple subtracks that display
individually on the browser. Instructions for configuring multi-view tracks
are here.
Color differences among the views are arbitrary. They provide a visual cue for
distinguishing between the different cell types and compartments.
Clusters
The Clusters view shows clusters built from the alignments.
In the graphical display, the ends are represented by blocks connected by a
horizontal line. In full and packed display modes, the arrowheads on the
horizontal line represent the direction of transcription. Although some of
the subtracks have score information most of them do not and score filtering
has been disabled.
Plus Raw Signal
The Plus Raw Signal view graphs the base-by-base density of tags on the forward strand.
Minus Raw Signal
The Minus Raw Signal view graphs the base-by-base density of tags on the reverse strand.
Alignments
The Alignments view shows alignment of individual PET sequences.
The alignment file follows the standard SAM/BAM format indicated in the
SAM Format Specification.
Some files also use the tag XA, generated by Bowtie, to represent the total
number of mismatches in the tag.
Metadata for a particular subtrack can be found by clicking the down arrow
in the list of subtracks.
Methods
Cells were grown according to the approved
ENCODE cell culture protocols.
Two different GIS RNA-PET protocols were used to generate the full-length transcriptome
PETs: one is based on a cloning-free RNA-PET library construction and sequencing strategy (Ruan and Ruan, 2012),
and the other is a cloning-based library construction (Ng et al., 2005)
and recent Illumina paired-end sequencing.
Cloning-free RNA-PET (52 bp reads, 27 bp and 25 bp tag for each of the 5' and 3' ends)
Method:
The cloning-free RNA-PET libraries were generated from polyA mRNA
samples and constructed using a recently modified GIS protocol (Ruan and Ruan, 2012).
High quality total RNA was used as starting material and purified with a
MACs polyT column to obtain full-length polyA mRNAs. Approximately 5
µgrams of enriched polyA mRNA was used for reverse transcription
to convert polyA mRNA to full-length cDNA. Specific linker sequences
were ligated to the full-length cDNA. The modified cDNA was circularized
by ligation generating circular cDNA molecules. The 27 bp tag from each
end of the full-length cDNA was extracted by type III enzyme EcoP15I
digestion. The resulting PETs were ligated with sequencing adaptors at
both ends, amplified by PCR, and further purified as complex templates
for paired-end sequencing using Illumina platforms.
Data:
The sequenced RNA-PETs resulted in reads of 27 bp and 25 bp corresponding
to the 5' and 3' end of each cDNA, respectively.
Redundant and noisy reads were excluded from downstream
analysis. Strand-specific orientation of each PET was determined using
the barcode built into the sequencing template. The oriented RNA-PET was
mapped onto the reference genome allowing up to two mismatches. The majority
of the PETs mapped to known transcripts. A small portion of misaligned PETs,
defined as discordant PETs, mapped too far from each tag, with wrong
orientations, or to different chromosomes. These discordant PETs indicated
the existence of some transcription variants that could be caused by
genomic structural variants such as fusions, deletions, insertions,
inversions, tandem repeats, translocations or RNA trans-splicing etc.
Cloning-based RNA-PET (34 bp reads, 18 bp and 16 bp tag for each of the 5' and 3' ends)
Method:
The cloning-based RNA-PET (GIS-PET) libraries were generated from polyA RNA samples
and constructed using the protocol described by Ng et al., 2005. Total RNA in good
quality was used as starting material and further purified with a MACs polyT column
to enrich polyA mRNA. Approximately 10 µgrams of polyA enriched mRNA was
reverse transcribed resulting in full-length cDNA. The obtained full-length
cDNA was modified with specific linker sequences and ligated to a GIS-developed (pGIS4)
vector. The resulting plasmids form a complex full-length cDNA library, which
was cloned into E. coli. The plasmid DNA was then isolated from the library,
followed by MmeI (a type II enzyme) digestion to generate a final length of 18 bp/16 bp
ditags from each end of the full-length cDNA. The single ditag (or PET) was then
ligated to form a diPET structure (a concatemer with two unrelated PET linked by
a linker sequence) to facilitate Illumina paired-end sequencing.
Data:Sequencing of clone-based RNA-PETs resulted in paired reads of 18 bp
and 16 bp corresponding to the 5' and 3' end of each cDNA, respectively. The redundant
reads were filtered out and unique reads were included for analysis. PET sequences
were then mapped to (GRCh37, hg19, excluding mitochondrion, haplotypes, randoms
and chromosome Y) reference genome using the following specific criteria
(Ruan et al., 2007):
A minimal continuous 16 bp match must exist for the 5' signature; the 3' signature
must have a minimal continuous 14 bp match
Both 5' and 3' signatures must be present on the same chromosome
Their 5' to 3' orientation must be correct (5' signature followed by 3' signature)
The maximal genomic span of a PET genomic alignment must be less than one million bp
PETs mapping to 2-10 locations are also included and may represent duplicated genes
or pseudogenes in the genome.
A majority of the PETs mapped to known transcripts or splice variants. A small
portion of misaligned PETs,
defined as discordant PETs, mapped either too far from each other, in the wrong orientation, or
to different chromosomes. The presence of discordant PETs indicates that some
transcriptional variants exist. These variants could be caused by genomic
structural variants such as fusions, deletions, insertions, inversions, tandem
repeats, translocation or RNA trans-splicing etc.
Clusters
PETs were clustered using the following procedure. The mapping location of
the 5' and 3' tag of a given PET was extended by 100 bp in both directions
creating 5' and 3' search windows. If the 5' and 3' tags of a second PET mapped
within the 5' and 3' search window of the first PET then the two PETs were clustered
and the search windows were adjusted so that they contained the tag
extensions of the second PET. PETs which subsequently
mapped with their 5' and 3' tags within the adjusted 5' and 3' search
window, respectively, were also assigned to this cluster
and the search window was readjusted. This iterative process continued
until no new PETs fell within the search window. This process is
repeated until all PETs were assigned to a cluster.
The total count of PET sequences mapped to the same locus but with
slight nucleotide differences may reflect the expression level of the
transcripts. PETs that mapped to multiple locations may represent low
complexity or repetitive sequences.
Verification
To assess overall PET quality and mapping specificity, the top ten
most abundant PET clusters that mapped to well-characterized
known genes were examined. Over 99% of the PETs represented full-length
transcripts, and the majority fell within 10 bp of the
known 5' and 3' boundaries of these transcripts. The PET mapping was
further verified by confirming the existence of physical
cDNA clones represented by the ditags. PCR primers were designed
based on the PET sequences and amplified the corresponding cDNA
inserts either from full-length cDNA library (cloning-based PET)
or from isolated total RNA (cloning-free PET) for sequencing confirmation.
Release Notes
This is Release 2 (Aug 2012) of this track. It adds data for tier 2
cell lines (A549, SK-N-SH, IMR90, and MCF-7). This newer data has no
scores in the Clusters files.
Note: As mentioned above, this track mixes two different methodologies.
The clone-based data has functioning score fields in
the Cluster files which could be used for filtering or shading.
However, the clone-free data either has scores that are not scaled
well or scores that are set to zero for all items. Therefore, the scores
are useful for some tables and not for others.
Credits
The GIS RNA-PET libraries and sequence data for transcriptome
analysis were generated and analyzed by
scientists Xiaoan Ruan, Atif Shahab, Chialin Wei, and Yijun Ruan at the
Genome Institute of
Singapore.
Data users may freely use ENCODE data, but may not, without prior
consent, submit publications that use an unpublished ENCODE dataset until
nine months following the release of the dataset. This date is listed in
the Restricted Until column, above. The full data release policy
for ENCODE is available
here.