GENCODE Versions GENCODE Genes V14 Track Settings

JavaScript is disabled in your web browser

You must have JavaScript enabled in your web browser to use the Genome Browser

Gene Annotations from ENCODE/GENCODE Version 14

Track collection: Container of all new and previous GENCODE releases

Description

All tracks in this collection (23)

Maximum display mode: Reset to defaults

Select view (Help):

Genes ▾

2-way

PolyA

Genes Configuration gene name: gene id: transcript id:

Color track by codons: Help on codon coloring

Show codon numbering:
Filter items by: (select multiple categories and items - Help)

Transcript Class	Transcript Annotation Method	Transcript Biotype	Support Level

Highlight items by: (select multiple categories and items - Help)

Transcript Annotation Method	Transcript Biotype	Support Level

Show only transcripts with these accessions:

Display data as a density graph:

Select all subtracks

List subtracks: only selected/visible all

	Name^↓1	view^↓2	Track Name^↓3
hide Configure	Basic	Genes	Basic Gene Annotation Set from ENCODE/GENCODE Version 14	Data format
hide Configure	Comprehensive	Genes	Comprehensive Gene Annotation Set from ENCODE/GENCODE Version 14	Data format
hide Configure	Pseudogenes	Genes	Pseudogene Annotation Set from ENCODE/GENCODE Version 14	Data format
hide	2-way Pseudogenes	2-way	2-way Pseudogene Annotation Set from ENCODE/GENCODE Version 14	Data format
hide	PolyA	PolyA	PolyA Transcript Annotation Set from ENCODE/GENCODE Version 14	Data format

Assembly: Human Feb. 2009 (GRCh37/hg19)

Description

The GENCODE Genes track (version 14, October 2012) shows high-quality manual annotations merged with evidence-based automated annotations across the entire human genome generated by the GENCODE project. The GENCODE gene set presents a full merge between HAVANA manual annotation process and Ensembl automatic annotation pipeline. Priority is given to the manually curated HAVANA annotation using predicted Ensembl annotations when there are no corresponding manual annotations. The annotation was carried out on genome assembly GRCh37 (hg19).

As of GENCODE Version 11, Ensembl and GENCODE have converged. The gene annotations in the GENCODE comprehensive set are the same as the corresponding Ensembl release. UCSC will continue to provide a separate Ensembl track on Human in the same format as the Ensembl tracks on other organisms.

NOTE: Due to the UCSC Genome Browser using the NC_001807 mitochondrial genome sequence (chrM) and GENCODE annotating the NC_012920 mitochondrial sequence, the GENCODE mitochondrial annotations have been lifted to NC_001807 coordinates in the UCSC Genome Browser. The original annotations with NC_012920 coordinates are available for download in the GENCODE GTF files.

Display Conventions and Configuration

This track is a multi-view composite track that contains differing data sets (views). Instructions for configuring multi-view tracks are here. To show only selected subtracks, uncheck the boxes next to the tracks that you wish to hide.

Views available on this track are:

Genes: The gene annotations in this view are divided into three subtracks:

GENCODE Basic set is a subset of the Comprehensive set. The selection criteria are described in the methods section.
GENCODE Comprehensive set contains all GENCODE coding and non-coding transcript annotations, including polymorphic pseudogenes. This includes both manual and automatic annotations. This is a super-set of the Basic set.
GENCODE Pseudogenes include all annotations except polymorphic pseudogenes.

2-way

GENCODE 2-way Pseudogenes contains pseudogenes predicted by both the Yale Pseudopipe and UCSC Retrofinder pipelines. The set was derived by looking for 50 base pairs of overlap between pseudogenes derived from both sets based on their chromosomal coordinates. When multiple Pseudopipe predictions map to a single Retrofinder prediction, only one match is kept for the 2-way consensus set.

PolyA

GENCODE PolyA contains polyA signals and sites manually annotated on the genome based on transcribed evidence (ESTs and cDNAs) of 3' end of transcripts containing at least 3 A's not matching the genome.

Filtering is available for the items in the GENCODE Basic, Comprehensive and Pseudogene tracks using the following criteria:

Transcript class: filter by the basic biological function of a transcript annotation
- All - don't filter by transcript class
- coding - display protein coding transcripts, including polymorphic pseudogenes
- nonCoding - display non-protein coding transcripts
- pseudo - display pseudogene transcript annotations
- problem - display problem transcripts (Biotypes of retained_intron, TEC, or disrupted_domain)
Transcript Annotation Method: filter by the method used to create the annotation
- All - don't filter by transcript class
- manual - display manually created annotations, including those that are also created automatically
- automatic - display automatically created annotations, including those that are also created manually
- manual_only - display manually created annotations that were not annotated by the automatic method
- automatic_only - display automatically created annotations that were not annotated by the manual method
Transcript Biotype: filter transcripts by biotype

Coloring for the gene annotations is based on the annotation type:

coding
non-coding
pseudogene
problem
all 2-way pseudogenes
all polyA annotations

Methods

The GENCODE project aims to annotate all evidence-based gene features on the human reference sequence with high accuracy by integrating computational approaches (including comparative methods), manual annotation and targeted experimental verification. This goal includes identifying all protein-coding loci with associated alternative variants, non-coding loci which have transcript evidence, and pseudogenes. For a detailed description of the methods and references used, see Harrow et al. (2006).

GENCODE Basic Set selection: The GENCODE Basic Set is intended to provide a simplified subset of the GENCODE transcript annotations that will be useful to the majority of users. The goal was to have a high-quality basic set that also covered all loci. Selection of GENCODE annotations for inclusion in the basic set was determined independently for the coding and non-coding transcripts at each gene locus.

Criteria for selection of coding transcripts (including polymorphic pseudogenes) at a given locus:
- All full-length coding transcripts (except problem transcripts or transcripts that are nonsense-mediated decay) were included in the basic set.
- If there were no transcripts meeting the above criteria, then the partial coding transcript with the largest CDS was included in the basic set (excluding problem transcripts).
Criteria for selection of non-coding transcripts at a given locus:
- All full-length non-coding transcripts (except problem transcripts) with a well characterized biotype (see below) were included in the basic set.
- If there were no transcripts meeting the above criteria, then the largest non-coding transcript was included in the basic set (excluding problem transcripts).
It no transcripts were included by either the above criteria, the longest problem transcript is included.

Non-coding transcript categorization: Non-coding transcripts are categorized using their biotype and the following criteria:

well characterized: antisense, Mt_rRNA, Mt_tRNA, miRNA, rRNA, snRNA, snoRNA
poorly characterized: 3prime_overlapping_ncrna, lincRNA, misc_RNA, non_coding, processed_transcript, sense_intronic, sense_overlapping

Transcription Support Level (TSL): It is important that users understand how to assess transcript annotations that they see in GENCODE. While some transcript models have a high level of support through the full length of their exon structure, there are also transcripts that are poorly supported and that should be considered speculative. The Transcription Support Level (TSL) is a method to highlight the well-supported and poorly-supported transcript models for users. The method relies on the primary data that can support full-length transcript structure: mRNA and EST alignments supplied by UCSC and Ensembl.

The mRNA and EST alignments are compared to the GENCODE transcripts and the transcripts are scored according to how well the alignment matches over its full length. The GENCODE TSL provides a consistent method of evaluating the level of support that a GENCODE transcript annotation is actually expressed in humans. Human transcript sequences from the International Nucleotide Sequence Database Collaboration (GenBank, ENA, and DDBJ) are used as the evidence for this analysis. Exonerate RNA alignments from Ensembl, BLAT RNA and EST alignments from the UCSC Genome Browser Database are used in the analysis. Erroneous transcripts and libraries identified in lists maintained by the Ensembl, UCSC, HAVANA and RefSeq groups are flagged as suspect. GENCODE annotations for protein-coding and non-protein-coding transcripts are compared with the evidence alignments.

Annotations in the MHC region and other immunological genes are not evaluated, as automatic alignments tend to be very problematic. Methods for evaluating single-exon genes are still being developed and they are not included in the current analysis. Multi-exon GENCODE annotations are evaluated using the criteria that all introns are supported by an evidence alignment and the evidence alignment does not indicate that there are unannotated exons. Small insertions and deletions in evidence alignments are assumed to be due to polymorphisms and not considered as differing from the annotations. All intron boundaries must match exactly. The transcript start and end locations are allowed to differ.

The following categories are assigned to each of the evaluated annotations:

tsl1 - all splice junctions of the transcript are supported by at least one non-suspect mRNA
tsl2 - the best supporting mRNA is flagged as suspect or the support is from multiple ESTs
tsl3 - the only support is from a single EST
tsl4 - the best supporting EST is flagged as suspect
tsl5 - no single transcript supports the model structure
tslNA - the transcript was not analyzed for one of the following reasons:
- pseudogene annotation, including transcribed pseudogenes
- human leukocyte antigen (HLA) transcript
- immunoglobin gene transcript
- T-cell receptor transcript
- single-exon transcript (will be included in a future version)

Downloads

GENCODE GTF files are available from the GENCODE release 14 site.

Verification

Selected transcript models are verified experimentally by RT-PCR amplification followed by sequencing. Those experiments can be found at GEO:

GSE34797:[E-MTAB-684] - Batch IV is based on chromosome 3, 4 and 5 annotations from GENCODE 4 (January 2010).
GSE34820:[E-MTAB-737] - Batch V is based on annotations from GENCODE 6 (November 2010).
GSE34821:[E-MTAB-831] - Batch VI is based on annotations from GENCODE 6 (November 2010) as well as transcript models predicted by the Ensembl Genebuild group based on the Illumina Human BodyMap 2.0 data.

See Harrow et al. (2006) for information on verification techniques.

Release Notes

This GENCODE version 14 corresponds to Ensembl 69 from October 2012 and Vega 49 from September 2012.

Credits

This GENCODE release is the result of a collaborative effort among the following laboratories: (contact: GENCODE at the Sanger Institute)

Lab/Institution	Contributors
GENCODE Principal Investigator	Tim Hubbard
HAVANA manual annotation group, Wellcome Trust Sanger Insitute (WTSI), Hinxton, UK	Adam Frankish, Jose Manuel Gonzalez, Mike Kay, Alexandra Bignell, Gloria Despacio-Reyes, Garaub Mukherjee, Gary Sanders, Veronika Boychenko, Jennifer Harrow
Genome Bioinformatics Lab (CRG), Barcelona, Spain	Thomas Derrien, Tyler Alioto, Andrea Tanzer, Roderic Guigó
Genome Bioinformatics, University of California Santa Cruz (UCSC), USA	Rachel Harte, Mark Diekhans, Robert Baertsch, David Haussler
Computational Genomics Lab, Washington University St. Louis (WUSTL), USA	Jeltje van Baren, Charlie Comstock, David Lu, Michael Brent
Computer Science and Artificial Intelligence Lab, Broad Institute of MIT and Harvard, USA	Mike Lin, Manolis Kellis
Computational Biology and Bioinformatics, Yale University (Yale), USA	Philip Cayting, Suganthi Balasubramanian, Baikang Pei, Cristina Sisu, Mark Gerstein
Center for Integrative Genomics, University of Lausanne, Switzerland	Cedric Howald, Alexandre Reymond
Ensembl Genebuild group, Wellcome Trust Sanger Insitute (WTSI), Hinxton, UK	Steve Searle, Bronwen Aken, Amonida Zadissa, Daniel Barrell
Structural Computational Biology Group, Centro Nacional de Investigaciones Oncologicas (CNIO), Madrid, Spain	José Manuel Rodríguez, Michael Tress, Alfonso Valencia

References

Flicek P, Amode MR, Barrell D, Beal K, Brent S, Chen Y, Clapham P, Coates G, Fairley S, Fitzgerald S et al. Ensembl 2011. Nucleic Acids Res. 2011 Jan;39(Database issue):D800-6.

Harrow J, Denoeud F, Frankish A, Reymond A, Chen CK, Chrast J, Lagarde J, Gilbert JG, Storey R, Swarbreck D et al. GENCODE: producing a reference annotation for ENCODE. Genome Biol. 2006;7 Suppl 1:S4.1-9.

Data Release Policy

GENCODE data are available for use without restrictions.