PanGEA uses four measures to assess the quality of SNPs at a SNP-site.
- The sequence quality at the SNP. This information is retrieved from the quality files
- The average sequence quality in the neighborhood of the SNP. This information is again retrieved from the quality files. The extent of this 'neighborhood', i.e. how many nucleotides 5' and 3' of the SNP, has to be specified by the user. The quality at the SNP-site is not included for this measure as it is already contained in the first measure (quality at SNP).
- The distance from the alignment end. It is well known that border effects at the ends of pairwise alignments may cause, suboptimal aligned bases, which can lead to wrongly identified SNPs. PanGEA therefore calculates the distance from the alignment end for each SNP. A minimum distance can be specified by the user (option 'Manage SNPs'). We found that 10 bp are usually sufficient.
- PanGEA also provides an 'ad hoc' solution to estimate the quality of a SNP from pairwise alignments. We found that indels and 'N' characters in the neighborhood of a SNP are usually an indication that a SNP may be a false positive. Therefore PanGEA counts the number of this low quality tokens in the neighborhood of a SNP. The extent of this neighborhood can again be specified by the user.
However, we also implemented an adaption of this 'ad hoc' solution for 454-sequences (more generally sequencing-by-synthesis). In 454-sequences adjacent homopolymers are usually also an indication that the SNP may be erroronous. This is especially true if the homopolymer has the same nucleotide character as a neighboring SNP (carry forward event).