1000 Genomes Projects dataset description

The data from the 1000 Genomes project is a very large catalog of human genetic variants. The project aims to determine genetic variants with frequencies higher than 1% in the populations studied. The data has been made openly available and freely accessible through public data repositories to scientists worldwide. Also, the data from the 1000 Genomes project is widely used to screen variants discovered in exome data from individuals with genetic disorders and in cancer genome projects.

The genotype dataset in Variant Call Format (VCF) provides the data of human individuals (that is, samples) and their genetic variants, and in addition, the global allele frequencies as well as the ones for the super populations. The data denotes the population's region for each sample which is used for the predicted category in our approach. Specific chromosomal data (in VCF format) may have additional information denoting the super-population of the sample or the sequencing platform used. For multiallelic variants, each alternative allele frequency (AF) is presented in a comma-separated list, shown as follows:

1 15211 rs78601809 T G 100 PASS AC=3050;
AF=0.609026;
AN=5008;
NS=2504;
DP=32245;
EAS_AF=0.504;
AMR_AF=0.6772;
AFR_AF=0.5371;
EUR_AF=0.7316;
SAS_AF=0.6401;
AA=t|||;
VT=SNP

The AF is calculated as the quotient of Allele Count (AC) and Allele Number (AN) and NS is the total number of samples with data, whereas _AF denotes the AF for a specific region.

The 1000 Genomes Project started in 2008; the consortium consisted of more than 400 life scientists and phase 3 finished in September 2014 covering 2,504 individuals from 26 populations (that is, ethnic backgrounds) in total. In total, over 88 million variants (84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions (indels), and 60,000 structural variants) have been identified as high-quality haplotypes.

In short, 99.9% of the variants consist of SNPs and short indels. Less important variants—including SNPs, indels, deletions, complex short substitutions, and other structural variant classes—have been removed for quality control. As a result, the third phase release leaves 84.4 million variants.

Each of the 26 populations has about 60-100 individuals from Europe, Africa, America (South and North), and Asia (South and East). The population samples are grouped into super-population groups according to their predominant ancestry: East Asian (CHB, JPT, CHS, CDX, and KHV), European (CEU, TSI, FIN, GBR, and IBS), African (YRI, LWK, GWD, MSL, ESN, ASW, and ACB), American (MXL, PUR, CLM, and PEL), and South Asian (GIH, PJL, BEB, STU, and ITU). For details, refer to Figure 1:

Figure 1: Geographic ethnic groups from 1000 Genomes project's release 3 (source http://www.internationalgenome.org/)

The released datasets provide the data for 2,504 healthy adults (18 years and older, third project phase); only reads with at least 70 base pairs (bp) have been used until more advanced solutions are available. All genomic data from all samples were combined to attribute all variants to a region. However, note that specific haplotypes may not occur in the genomes of a particular region; that is, the multi-sample approach allows attributing variants to an individual's genotype even if the variants are not covered by sequencing reads from that sample.

In other words, overlapping reads are provided and the single sample genomes have not necessarily been consolidated. All individuals were sequenced using both of these:

  • Whole-genome sequencings (mean depth = 7.4x, where x is the number of reads, on average, that are likely to be aligned at a given reference bp)
  • Targeted exome sequencing (mean depth = 65.7x)

In addition, individuals and their first-degree relatives such as an adult offspring were genotyped using high-density SNP microarrays. Each genotype comprises all 23 chromosomes and a separate panel file denotes the sample and population information. Table 1 gives an overview of the different releases of the 1000 Genomes project:

Table 1 – Statistics of the 1000 Genomes project's genotype dataset (source: http://www.internationalgenome.org/data)

1000 genome release

Variants

Individual

Populations

File format

Phase 3

Phase 3

2,504

26

VCF

Phase 1

37.9 million

1,092

14

VCF

Pilot

14.8 million

179

4

VCF

 

The AF in the five super-population groups, EAS=East Asian, EUR=European, AFR=African, AMR=American, SAS=South Asian populations are calculated from allele numbers (AN, range= [0, 1]).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset