During management of SNPs, PanGEA calculates several benchmarks to assess the effect of the user-specified parameters. Using this benchmarks it is possible to assess the amount of false positives and wrong negatives roughly. Therefore the user may hone the settings in an interactive and flexible manner to achieve the best possible results. This results will be a compromise between specificity and sensitivity. The basis for calculating specificity and sensitivity is the SNP-bias and some statistical assumptions. We however caution the user. This benchmarks are solely based on statistical considerations, the real number of true positives and false negatives may of course significantly deviate from this estimates. The SNP-benchmarks should solely act as a rough guide for users to assess the effect of the specified parameters.

SNP-bias

PanGEA displays the SNP-bias for all loaded SNPs and for the user-specified subset of SNPs.

Example of the SNP bias in all and the subset of SNPs using real data:

SNP-bias in all SNP-sites 832 :364
SNP-bias in the subset SNP-sites 53 :35
  a b

Where 'a' is the number of SNP-sites in which the most frequent allele is the reference sequence allele and 'b' is the number of SNP-sites in which the most frequent allele is any other allele. This example generated with real 454-ESTs demonstrates that the SNP-bias can be efficiently reduced using restrictive settings. For 454-sequences the most effective way to reduce the SNP-bias is to use the 'low alignment quality tokens in neighborhood of the SNP' feature of PanGEA. Using this feature only SNPs which are not in the immediate neighborhood of a homopolymer may be selected.

SNP-benchmarks

From the SNP-bias in the initial set of SNPs and the subset of SNPs PanGEA further calculates the specificity and the sensitivity for the user-specified parameters. This calculations are based on a simple assumption: that the number of true SNP-sites is the number of SNP-sites in which the most frequent allele is not the reference sequence allele, multiplied by two. This is further based on the notion, that it is not very probale that sequencing mistakes at a certain site occure more frequently than correct reads.

Following the equations which were used to calculate specificity and sensitivity.
First the parameters for all SNPs: The number of SNP-sites in which the most frequent allele is the reference sequence is called 'ra' (832, example above) and the number of SNP-site in which the most frequent allele is any other allele is called 'oa' (364);

The same applies for the subset of SNPs; reference allele: 'rs' (53); other allele 'os' (35);

true positives (TP) = 2 * os
false positives(FP) = rs + os - TP
false negatives (FN) = 2 * oa -TP
true negatives (TN) = ra + oa - TP - FP - FN
Sensitivity = TP / (TP + FN)
Specificity = TN / (FP + TN)

Although this values are oversimplifications we found them very useful in the analysis of the data.

 

Remarks:

However, the SNP bias was (is) a matter of heated discussion, Prof. Schlötterer suspected inconsistencies in the pairwise alignments whereas I argued that the bias is caused by sequencing errors.

At the moment several observations suggest that the SNP-bias is indeed caused by sequencing mistakes. However if you, the users spot any errors in PanGEA which might cause the SNP-bias please contact me immediately at:

Observation 1:

We conducted an analysis involving randomly generated ESTs from the Mycoplasam genitalium genome. We produced 57 000 random ESTs of length 100 and incorporated a pseudosequencing mistake into each.

The expected number of SNP-sites in which each allele occures at least 2x can be calculated with a Binominal-distribution. Following the R code:

dbinom(2,10,1/300,log=FALSE)*3*570000

2.. refers to the number of sequencing mistakes that have to occur at a SNP-site. Of course this has to be exactly the same mistake.
10.. is the number of trials, i.e the coverage
1/300.. with a phred of 20 the probabiltiy that a certain sequecing mistake occures at a certain site is 1/300 (3 different mistakes are possible for each nucleotide)
3.. the number of possible sequencing mistakes.
570000 the length of the Mycoplasma genitalium genome.

That is 837 SNP-sites would be expected due to mere chance assuming one sequencing mistake in each EST of length 100 and a 10x coverage. PanGEA identified 660 SNP-sites which suggests that the observed number of wrong SNP-sites is lower than the expected one. This also demonstrates that the SNP-bias can be caused by sequencing mistakes because each of the 660 SNP-sites identified in this trial has the SNP-bias.

Observation 2:

Increasing demands on sequence quality and an increasing minimum distances from homopolymers effectively reduce the SNP-bias, which suggests that the bias is caused by sequencing mistakes. Using 454-sequences the single parameter which reduces the SNP-bias most effectively is the distance from homopolymers (low alignment quality tokens in neighborhood of the SNP).

Observation 3:

I inspected hundreds of pairwise alignments manually and was not able to detect an error.