Description
This container track helps call out sections of the genome that often cause problems or
confusion when working with the genome. The hg19 genome has a track with the same name, but with
many more subtracks, as the GeT-RM and Genome-in-a-Bottle artifact variants do not exist yet
for hg38, to our knowledge. If you are missing a track here that you know from
hg19 and have an idea how to add it hg38, do not hesitate to contact us.
Problematic Regions
The Problematic Regions track contains the following subtracks:
-
The UCSC Unusual Regions subtrack contains annotations collected at UCSC,
put together from other tracks, our experiences and support email list
requests over the years. For example, it contains the most well-known gene
clusters (IGH, IGL, PAR1/2, TCRA, TCRB, etc) and annotations for the GRC
fixed sequences, alternate haplotypes, unplaced
contigs, pseudo-autosomal regions, and mitochondria. These loci can yield alignments with
low-quality mapping scores and discordant read pairs, especially for short-read sequencing data.
This data set was manually curated, based on the Genome Browser's
assembly description, the FAQs about assembly, and the
NCBI RefSeq "other" annotations
track data.
-
The ENCODE Blacklist subtrack contains a comprehensive set of regions which are troublesome
for high-throughput Next-Generation Sequencing (NGS) aligners. These regions tend to have a very
high ratio of multi-mapping to unique mapping reads and high variance in mappability due to
repetitive elements such as satellite, centromeric and telomeric repeats.
-
The GRC Exclusions subtrack contains a set of regions that have been flagged by the GRC to
contain false duplications or contamination sequences. The GRC has now removed these sequences from
the files that it uses to generate the reference assembly, however, removing the sequences from the
GRCh38/hg38 assembly would trigger the next major release of the human assembly. In order to
help users recognize these regions and avoid them in their analyses, the GRC have produced a masking
file to be used as a companion to GRCh38, and the BED file is available from the
GenBank FTP site.
Highly Reproducible Regions
The Highly Reproducible Regions track highlights regions and variants
from eight samples that can be used to assess variant detection pipelines. The
"Highly Reproducible Regions" subtrack comprises the intersection of the reproducible
regions across all eight samples, while the "Variants" subtracks contain the reproducible
variants from each assayed sample. Both tracks contain data from the following samples:
- a Chinese Quartet, samples CQ-5, CQ-6, CQ-7, CQ-8
- a HapMap Trio, samples NA10385, NA12248, NA12249
- a Genome in a Bottle sample, NA12878s
Please refer to the Pan et al reference for more information on how
these regions were defined.
GIAB Problematic Regions
The Genome in a Bottle (GIAB) Problematic Regions tracks provide stratifications of the
genome to evaluate variant calls in complex regions. It is designed for use with Global Alliance
for Genomic Health (GA4GH) benchmarking tools like
hap.py
and includes regions with low complexity, segmental duplications, functional regions,
and difficult-to-sequence areas. Developed in collaboration with GA4GH, the
Genome in a Bottle (GIAB) consortium, and the
Telomere-to-Telomere Consortium (T2T), the dataset aims to standardize the
analysis of genetic variation by offering pre-defined BED files for stratifying true and false
positives in genomic studies, facilitating accurate assessments in complex areas of the genome.
The creation of the GIAB Problematic Regions tracks involves using a pipeline and configuration to
generate stratification BED files that categorize genomic regions based on specific challenges,
such as low complexity or difficult mapping, to facilitate accurate benchmarking of variant calls.
For more information on the pipeline and configuration used, please visit the following webpage:
https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/genome-stratifications/v3.5/README.md.
If you have questions or comments, please write to Justin Zook (jzook@nist.gov).
Display Conventions and Configuration
Each track contains a set of regions of varying length with no special configuration options.
The UCSC Unusual Regions track has a mouse-over description, all other tracks have at most
a name field, which can be shown in pack mode. The tracks are usually kept in dense mode.
The Hide empty subtracks control hides subtracks with no data in the browser window.
Changing the browser window by zooming or scrolling may result in the display of a different
selection of tracks.
Data access
The raw data can be explored interactively with the Table Browser
or the Data Integrator.
For automated download and analysis, the genome annotation is stored in bigBed files that
can be downloaded from
our download server.
Individual
regions or the whole genome annotation can be obtained using our tool bigBedToBed
which can be compiled from the source code or downloaded as a precompiled
binary for your system. Instructions for downloading source code and binaries can be found
here.
The tool
can also be used to obtain only features within a given range, e.g.
bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/hg38/problematic/comments.bb -chrom=chr21 -start=0 -end=100000000 stdout
Methods
Files were downloaded from the respective databases and converted to bigBed format.
The procedure is documented in our
hg38 makeDoc file.
Credits
Thanks to Anna Benet-Pagès, Max Haeussler, Angie Hinrichs, Daniel Schmelter, and Jairo
Navarro at the UCSC Genome Browser for planning, building, and testing these tracks. The
underlying data comes from the
ENCODE Blacklist and some parts were copied manually from the HGNC and NCBI
RefSeq tracks.
References
Amemiya HM, Kundaje A, Boyle AP.
The ENCODE Blacklist: Identification of Problematic Regions of the Genome.
Sci Rep. 2019 Jun 27;9(1):9354.
PMID: 31249361; PMC: PMC6597582
Dwarshuis N, Kalra D, McDaniel J, Sanio P, Alvarez Jerez P, Jadhav B, Huang WE, Mondal R, Busby B,
Olson ND et al.
The GIAB genomic stratifications resource for human reference genomes.
Nat Commun. 2024 Oct 19;15(1):9029.
PMID: 39424793; PMC: PMC11489684
Krusche P, Trigg L, Boutros PC, Mason CE, De La Vega FM, Moore BL, Gonzalez-Porta M, Eberle MA,
Tezak Z, Lababidi S et al.
Best practices for benchmarking germline small-variant calls in human genomes.
Nat Biotechnol. 2019 May;37(5):555-560.
PMID: 30858580; PMC: PMC6699627
Pan B, Ren L, Onuchic V, Guan M, Kusko R, Bruinsma S, Trigg L, Scherer A, Ning B, Zhang C et
al.
Assessing reproducibility of inherited variants detected with short-read whole genome
sequencing.
Genome Biol. 2022 Jan 3;23(1):2.
PMID: 34980216; PMC: PMC8722114
|
|