Description
These tracks show high-confidence "Platinum Genome" variant calls for two individuals,
NA12877 and NA12878, part of a sequenced 17 member pedigree for family number
1463, from the Centre d'Etude du Polymorphisme Humain (CEPH). The hybrid
track displays a merging of the NA12878 results with variant calls produced by Genome in a
Bottle, discussed further below. CEPH is an international genetic research center that provides
a resource of immortalized cell cultures used to map genetic markers, and pedigree 1463
represents a family lineage from Utah of four grandparents, two parents, and 11 children.
The whole pedigree was sequenced to 50x depth on a HiSeq 2000 Illumina system, which is
considered a platinum standard, where platinum refers to the quality and completeness of
the resulting assembly, such as providing full chromosome scaffolds with phasing and
haplotypes resolved across the entire genome.
This figure depicts the pedigree of the family sequenced for this study, where the ID for each
sample is defined by adding the prefix NA128 to each numbered individual, so that 77 = NA12877
and 78 = NA12878, corresponding to the VCF tracks available in this track set. The dark orange
individuals indicate sequences used in the analysis methods, whereas the blue represent the
founder generations (grandparents), which were also sequenced and used in validation steps.
The genomes of the parent-child trio on the top right side, 91-92-78, were also sequenced
during Phase I of the 1000 Genomes Project.
These tracks represent a comprehensive genome-wide set of phased small variants that have been
validated to high confidence. Sequencing and phasing a larger pedigree, beyond the two parents
and one child, increases the ability to detect errors and assess the accuracy of more of the
variants compared to a standard trio analysis. The genetic inheritance data enables creating a more
comprehensive catalog of "platinum variants" that reflects both high accuracy and
completeness. These results are significant as a comprehensive set of valid
single-nucleotide variants (SNVs) and insertions and deletions (indels),
in both the easy and difficult parts of the genome, provides a vital resource for software
developers creating the next generation of variant callers, because these are the areas where
the current methods most need training data to improve their methods. Since every one of the
variants in this catalog is phased, this data set provides a resource to better assess emerging
technologies designed to generate valid phasing information. To generate the calls, six analysis
pipelines to call SNVs and indels were used and merged into one catalog, where the sensitivity of
the genetic inheritance aided to detect genotyping errors and maximize the chance of only
including true variants, that might otherwise be removed by suboptimal filtering. Read more
about the detailed methods in the referenced paper, further describing this variant catalog
of 4.7 million SNVs plus 0.7 million small (1-50 bp) indels, that are all consistent with
the pattern of inheritance in the parents and 11 children of this pedigree.
The hybrid track in this set extends the characterization of NA12878
by incorporating high confidence calls produced by Genome in a Bottle analysis.
The resulting merged files contain more comprehensive coverage of variation than either
set independently, for instance, the hg19 version contains over 80,000 more indels than
either input set. Read more about the hybrid methods at the following link:
https://github.com/Illumina/PlatinumGenomes/wiki/Hybrid-truthset
Data Access
The VCF files for this track can be obtained from the download server:
https://hgdownload.soe.ucsc.edu/gbdb/hg19/platinumGenomes/.
These files were obtained from the Platinum genomes source archive:
https://s3.eu-central-1.amazonaws.com/platinum-genomes/2017-1.0/ReleaseNotes.txt.
Reference
Eberle MA, Fritzilas E, Krusche P, Källberg M, Moore BL, Bekritsky MA, Iqbal Z, Chuang HY,
Humphray SJ, Halpern AL et al.
A reference data set of 5.4 million phased human variants validated by genetic inheritance from
sequencing a three-generation 17-member pedigree.
Genome Res. 2017 Jan;27(1):157-164.
PMID: 27903644; PMC: PMC5204340
|
|