new
Note: Released Mar. 31, 2025
Description
These tracks contain pseudogene predictions and their parents as identified by PseudoPipe.
PseudoPipe is a developed
homology-based computational pipeline that can search a mammalian genome and identify pseudogene
sequences in a comprehensive and consistent manner.
Pseudogenes are genomic sequences that bear similarity to specific protein-coding genes, but are
unable to produce functional proteins due to the existence of frameshifts, premature stop codons, or
other deleterious mutations. They arise from gene duplication or retrotransposition events and are
important resources in understanding the evolutionary history of genes and genomes.
Display Conventions
This composite track consists of two subtracks: the Pseudogenes track and the Pseudogene
Parents track. The Pseudogene Parents track displays parent genes labeled with their
HUGO IDs,
which were derived from Ensembl gene IDs provided by the
Gerstein lab after dataset creation. It
includes indicators for pseudogenes, each linked to its corresponding entry in the Pseudogenes
track. The Pseudogenes track shows pseudogenes labeled with their parent HUGO ID and colored
according to pseudogene type. The authors assigned PGOHUMG IDs to genes and PGOHUMT IDs
to transcripts. Note: Not all PseudoPipe IDs could be mapped back to their original Ensembl
IDs. In these cases, the gene ID is listed as NA.
Pseudogene types:
- Unspecified pseudogenes include pseudogenic fragments and protein/chromosome homologies
with high sequence similarity but are too decayed to be reliably classified as processed or
duplicated.
- Processed pseudogenes (retrotransposed pseudogenes) result from the reverse
transcription of mRNA into DNA, which is then inserted into the genome. These pseudogenes
lack introns, often have small flanking direct repeats, and may retain a 3' polyadenine
tail. PseudoPipe distinguishes them from duplicated pseudogenes by a combination of these
features, with the emphasis on the evidence of ancient introns.
- Unprocessed pseudogenes (duplicated pseudogenes) arise from genomic DNA duplication or
unequal crossing-over. They often retain the original exon-intron structures of the
functional genes, although sometimes incompletely.
Pseudogene Parents track
Each parent gene is shown with their pseudogenes represented as grey blocks.
- purple - parent gene
- grey - pseudogene indicators
If a parent gene has four grey blocks beneath it, this indicates the presence of four pseudogenes
elsewhere in the genome. Hovering over a grey block displays the pseudogene type and its PGOHUMT
ID, along with a link to its corresponding entry in the Pseudogenes track and its genomic position.
Clicking the PGOHUMT ID redirects the genome browser to the pseudogene's locus.
Pseudogenes track
Pseudogenes are colored by type.
- orange - unspecified pseudogene
- blue - unprocessed pseudogene
- olive green - processed pseudogene
Mouse over on an item will display the PseudoPipe ID (PGOHUMG), the Parent Ensembl gene ID
with a link to the corresponding parent gene location in the Pseudogenes track, and the pseudogene
type.
Methods
The PseudoPipe pipeline identifies pseudogenes through a series of steps. It first uses BLAST to
rapidly cross-reference potential parent proteins against the intergenic regions of the genome. The
resulting raw hits are then processed by removing redundancies, clustering neighboring sequences,
and aligning each cluster with a unique parent gene. Finally, pseudogenes are classified based on a
combination of criteria, including homology, intron-exon structure, and the presence of stop codons
or frameshifts. This method is designed to detect pseudogenes that are unable to be translated into
proteins.
These tracks were generated using a Bash script that processes a GTF file with pseudogene
annotations by removing duplicates, correcting overlapping exons, and converting the data to BED
format with pseudoPipeToBed.py. This script extracts gene and transcript IDs, merges overlapping
exons, assigns colors based on pseudogene type, and outputs a BED file with gene and parent
annotations. PseudoPipeParents.py then links pseudogenes to their functional genes by determining
parent gene coordinates, updating pseudogene entries with interactive browser links and generating a
parent BED file. The final data are formatted into pseudoPipePgenes.bb and pseudoPipeParents.bb BigBed
files. The detailed documentation (makeDoc) and
Python scripts are available in our GitHub repository.
Data Access
The raw data can be explored interactively with the
Table Browser or the
Data Integrator.
The data may also be explored interactively using our
REST API.
For automated download and analysis, the genome annotation is stored at UCSC in bigBed files
that can be downloaded from the
download server.
Individual regions or the whole genome annotation can be obtained using our tool
bigBedToBed which can be compiled from the source code or downloaded as a precompiled
binary for your system.
Instructions for downloading source code and binaries can be found
here.
The tool can also be used to obtain only features within a given range, e.g.
bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/hg38/hg38/pseudogenes/pseudoPipePgenes.bb -chrom=chr21 -start=0 -end=10000000 stdout
Credits
Thanks to the Gerstein lab at Yale University for making this data available, and to Cristina
Sisu for providing data in GTF format with parent annotations.
References
Zhang Z, Carriero N, Zheng D, Karro J, Harrison PM, Gerstein M.
PseudoPipe: an automated pseudogene identification pipeline.
Bioinformatics. 2006 Jun 15;22(12):1437-9.
PMID: 16574694
|