Description
This track shows single nucleotide variants (SNVs), from the
Mouse Genomes Project.
Display Conventions
In "dense" mode, a vertical line is drawn at the position of each
variant.
In "pack" mode, since these variants have been phased, the
display shows a clustering of haplotypes in the viewed range, sorted
by similarity of alleles weighted by proximity to a central variant.
The clustering view can highlight local patterns of linkage.
In the clustering display, each sample's phased diploid genotype is split
into two independent haplotypes.
Each haplotype is placed in a horizontal row of pixels; when the number of
haplotypes exceeds the number of vertical pixels for the track, multiple
haplotypes fall in the same pixel row and pixels are averaged across haplotypes.
Each variant is a vertical bar with white (invisible) representing the reference allele
and black representing the non-reference allele(s).
Tick marks are drawn at the top and bottom of each variant's vertical bar
to make the bar more visible when most alleles are reference alleles.
The vertical bar for the central variant used in clustering is outlined in purple.
In order to avoid long compute times, the range of alleles used in clustering
may be limited; alleles used in clustering have purple tick marks at the
top and bottom.
The clustering tree is displayed to the left of the main image.
It does not represent relatedness of individuals; it simply shows the arrangement
of local haplotypes by similarity. When a rightmost branch is purple, it means
that all haplotypes in that branch are identical, at least within the range of
variants used in clustering.
Methods
Listed below are the strain names as they appear in the VCF header, the full
strain name, gender of samples sequenced and the approximate sequence
fold-coverage of the genome, based on the number of read bases mapped to
the reference genome (and excluding reads marked as PCR duplicates).
VCF header name |
strainname |
sex |
sequence fold-coverage |
129P2_OlaHsd | (129P2/OlaHsd) | F | 52 |
129S1_SvImJ | (129S1/SvImJ) | F | 68 |
129S5SvEvBrd | (129S5SvEvBrd) | F | 22 |
A_J | (A/J) | F | 52 |
AKR_J | (AKR/J) | F | 57 |
BALB_cJ | (BALB/cJ) | F | 62 |
BTBR_T+_Itpr3tf_J | (BTBR T+ Itpr3tf/J) | M | 85 |
BUB_BnJ | (BUB/BnJ) | M | 49 |
C3H_HeH | (C3H/HeH) | F | 14 |
C3H_HeJ | (C3H/HeJ) | F | 63 |
C57BL_10J | (C57BL/10J) | M | 37 |
C57BL_6NJ | (C57BL/6NJ) | F | 61 |
C57BR_cdJ | (C57BR/cdJ) | M | 51 |
C57L_J | (C57L/J) | M | 64 |
C58_J | (C58/J) | M | 55 |
CAST_EiJ | (CAST/EiJ) | F | 53 |
CBA_J | (CBA/J) | F | 56 |
DBA_1J | (DBA/1J) | M | 49 |
DBA_2J | (DBA/2J) | F | 56 |
FVB_NJ | (FVB/NJ) | F | 73 |
I_LnJ | (I/LnJ) | M | 45 |
KK_HiJ | (KK/HiJ) | M | 55 |
LEWES_EiJ | (LEWES/EiJ) | F | 19 |
LP_J | (LP/J) | F | 54 |
MOLF_EiJ | (MOLF/EiJ) | M | 40 |
NOD_ShiLtJ | (NOD/ShiLtJ) | F | 66 |
NZB_B1NJ | (NZB/B1NJ) | M | 47 |
NZO_HlLtJ | (NZO/HlLtJ) | F | 72 |
NZW_LacJ | (NZW/LacJ) | M | 58 |
PWK_PhJ | (PWK/PhJ) | F | 53 |
RF_J | (RF/J) | M | 54 |
SEA_GnJ | (SEA/GnJ) | M | 49 |
SPRET_EiJ | (SPRET/EiJ) | F | 67 |
ST_bJ | (ST/bJ) | M | 81 |
WSB_EiJ | (WSB/EiJ) | F | 51 |
ZALENDE_EiJ | (ZALENDE/EiJ) | M | 19 |
All SNP and indel calls are relative to the reference mouse
genome C57BL/6J (GRCm38/mm10). The reference genome used for the
alignment can be found here: ftp-mouse.sanger.ac.uk/ref/.
Gene models from Ensembl release 78 were used to predict the functional
consequences of the SNPs and indels.
SNPs and indels are annotated with rs IDs from dbSNP Build 142. The dbSNP
data was downloaded from:
ftp.ncbi.nlm.nih.gov/snp/organisms/mouse_10090/VCF/
and the 'vcf-annotate' Perl utility from the VCFtools package
(Danecek et al, 2011) was used to add the rsIDs to calls in this release.
(See below for VCFtools information).
For SNPs, the position, reference allele and alternative alleles were all
compared:
e.g.
vcf-annotate -c CHROM,POS,ID,REF,ALT
For indels, only the positions were matched:
e.g.
vcf-annotate -c CHROM,POS,ID
Sequencing was performed using the Illumina HiSeq platform. All reads are
100bp paired-end reads except for strains 129P2 and 129S5 in which the
sequence data included reads of 75 bps or less. Also, a small amount of
the sequence data for MOLF_EiJ is single-end sequencing.
In version 3 all variant data was obtained from sequencing of female mice
only. In version 4, 10 new strains were included in which all data was
obtained from sequencing of male mice. The data for an additional 8 strains
included in this release (version 5) was obtained from sequencing of male
mice for 5 strains, and female mice for the remaining 3 strains. As such,
the SNP and indel VCF files contain calls on chromosomes 1-19, MT, X and Y.
The BAM files used to call SNPs and indels are located here:
ftp-mouse.sanger.ac.uk/REL-1502-BAM/.
Reads were aligned to the reference genome (GRCm38/mm10) using BWA-MEM v0.7.5-r406
(Li and Durbin, 2009; Li, 2013).
Reads were realigned around indels using GATK realignment tool v3.0.0
(McKenna et al., 2010) with default parameters.
SNP and indel discovery was performed with the SAMtools v1.1 with parameters:
Samtools mpileup -t DP,DV,DP4,SP,DPR,INFO/DPR -E -Q 0 -pm3 -F0.25 �#"d500
and calling was performed with BCFtools call v1.1 with parameters:
Bcftools call -mv -f GQ,GP -p 0.99
Indels were then left-aligned and normalized using bcftools norm v1.1 with
parameters:
bcftools norm -D -s -m+indels
The vcf-annotate function in the VCFtools package was used to soft-filter the
SNP and indel calls.
The Variant Effect Predictor software from Ensembl (McLaren et al., 2010)
was used to predict the functional consequence of SNP and indels queried
against Ensembl release 78 mouse gene models.
Definitions of consequence types can be found here:
http://www.ensembl.org/info/genome/variation/predicted_data.html#consequences.
SNP calling was performed for each strain independently. These strain specific
VCF files can be found on the ftp site
ftp-mouse.sanger.ac.uk/REL-1505-SNPs_Indels/strain_specific_vcfs/.
A single list of all polymorphic sites across the genome was then produced
from all of the 36 strains' SNP calls. This list was then used to call
SNPs again, this time across all 36 strains simultaneously, using the
'samtools mpileup -l' option. The calls from all 36 strains were then
merged into a single VCF file. All strain specific information was retained
in the sample columns for each strain. For indels, the same approach was
taken with the addition of the indel normalisation step after the
initial variant calling. The merged SNP VCF and indel VCF for version 5 can
be found here: ftp-mouse.sanger.ac.uk/REL-1505-SNPs_Indels/.
Information regarding the filtering of SNP and indel calls can be found
in the VCF file headers in the '##FILTER' and '##source' lines.
Credits
Thanks to the
Mouse Genomes Project for supplying the data for this track.
See also: ftp-mouse.sanger.ac.uk/REL-1505-SNPs_Indels/README.
References
Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Handsaker RE, Lunter G, Marth GT,
Sherry ST et al.
The variant call format and VCFtools.
Bioinformatics. 2011 Aug 1;27(15):2156-8.
PMID: 21653522; PMC: PMC3137218
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at
http://arxiv.org/pdf/1303.3997v2.pdf 2013.
Li H, Durbin R.
Fast and accurate short read alignment with Burrows-Wheeler transform.
Bioinformatics. 2009 Jul 15;25(14):1754-60.
PMID: 19451168; PMC: PMC2705234
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, 1000 Genome
Project Data Processing Subgroup..
The Sequence Alignment/Map format and SAMtools.
Bioinformatics. 2009 Aug 15;25(16):2078-9.
PMID: 19505943; PMC: PMC2723002
Li H.
A statistical framework for SNP calling, mutation discovery, association mapping and population
genetical parameter estimation from sequencing data.
Bioinformatics. 2011 Nov 1;27(21):2987-93.
PMID: 21903627; PMC: PMC3198575
McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D,
Gabriel S, Daly M et al.
The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing
data.
Genome Res. 2010 Sep;20(9):1297-303.
PMID: 20644199; PMC: PMC2928508
McLaren W, Pritchard B, Rios D, Chen Y, Flicek P, Cunningham F.
Deriving the consequences of genomic variants with the Ensembl API and SNP Effect Predictor.
Bioinformatics. 2010 Aug 15;26(16):2069-70.
PMID: 20562413; PMC: PMC2916720
|