PanGEA - The comprehensive Gene Expresion Analyzer

Chronology

PanGEA is the product of a cooperation between the team of Prof. Christian Schlötterer (Institut für Tierzucht und Genetik, Vienna) and the team of Prof. Tamas Lelley (IFA-Tulln). The idea to develop a user-friendly tool for the analysis of gene-expression and SNP-identificiation with next-generation sequencing technologies is from Prof. Christian Schlötterer.

The project was initiated in Dec. 2007 when Prof. Schlötterer approached me due to problems regarding SNP analysis with the 454-technolgy. The 454-platform frequently estimates the length of homopolymers inaccurately which leads to faulty pairwise alignments and further causes problems in downstream identification of SNPs. I first implemented PanGEA-Sw , a tool which generates improved pairwise alignments for 454-sequences (more generally sequencing-by-synthesis sequences) as PanGEA-Sw is adapted to incorrect estimates of homopolymer-length. We further decided to incorporate this algorithm into a special BlastN adaption and into a userfriendly tool which also allows the identification and management of SNPs. The following month I spent, strongly support of Tatiana T. Torres, with writing and improving the software, identifying bugs and implementing new features. After countless meetings and discussions with Prof. Schlötterer and his group the tool gradually improved and work was completed in April 2008.

At this occassion I want to thank Prof. Christian Schlötterer and Dr. Tatiana T. Torres for their invaluable contributions, their patience and the great ideas they contributed to this project. I am also very thankful to the supervisior of my Ph.D. thesis Prof. Tamas Lelley for giving me the opportunity to work on this project.

 

Quick introduction

PanGEA can be used for the analysis of gene expression using next-generation sequencing technologies. Transcriptome profiling using next-generation sequencing technologies has several advantages compared to the well established method: microarrays platform. Using sequences, allele specific gene-expression, alternative splicing, inter-species differences in gene expression, and relative expression levels between genes can easily be investigated.

To facilitate this different tasks PanGEA offers several features. ESTs can be mapped to genes or whole genomes using an modification of the BlastN algorithmus. As PanGEA operates in batch mode the number of ESTs which may be mapped is not limited. We found that 50 000 ESTs with an average length of 250 bp can be mapped in 1 h and 1 000 000 ESTs with length 30 bp can be mapped in 3h. To prevent complications in downstream SNP analysis, with 454 sequences (more generally sequencing-by-synthesis), we developed a modified version of the Smith-Waterman algorithm which places gaps preferentially at homopolymers. This lead to an dramatic improvement of pairwise alignment quality with 454 sequences. The number of wrongly identified SNPs was reduced by 2/3!!

The pairwise alignments generated with PanGEA may subsequently be interactively and flexibly analysed using PanGEAs 'Manage pairwise alignments' option. Subsets of the alignments can be specified using the criteria: minimum coverage, minimum similarity and minimum length. Furthermore only alignments containing large gaps (as specified by the user) may be selected and investigates. The selected alignments can be exported and used for downstream applications like the identification of SNPs or the identification of alternative splicing sites. At the moment only SNPs can be identified with PanGEA, a feature for intron identification is in discussion.

However, the pairwise alignments generated with PanGEA can be used to identify SNPs. The user can choose whether indels should be considered as valid SNP alleles or if quality files should be used. We furthermore developed ad-hoc solutions to estimate the quality of a SNP from the pairwise alignments or from the homopolymer terrain in the neighborhood of the SNP. It was recently demonstrated that sequencing errors of the 454 technology frequently occure close to homopolymers. In the so-called 'carry forward events' the SNP has the same nucleotide character as the adjacent homopolymer. In the special 454-SNP-identification mode, bad scores will be assigned to SNPs which occure close to homopolymers, this scores are especially high if the SNP has the same sequence as the homopolymer.

Subsequently the SNPs can be interactively analyzed and a subsets of the SNPs can be chosen. PanGEA tries to estimate the quality of the selected subset using the two parameters specificity and sensitivity. Specificity relates to the number of wrongly identified SNP sites and sensitivity to the number of missed true SNP-sites. Tuning the settings, users should attempt to obtain a maximum for both values, especially the specificity is important. This two benchmarks are calculated from a feateure we discovered accidentially. During analysis of the SNPs the number of SNP sites in which the allele of the reference sequence is more frequent than any other allele, is usually much higher than the number of SNP sites in which any allele is more frequent than the allele of the reference-sequence. We termed this phenomonen SNP-bias and PanGEA automatically calculates the SNP-bias for the initial data set and the selected subset. This allows to estimate whether the active settings are restrictive enough to efficiently eliminate the SNP-bias. We believe that a reduction of the SNP bias also entails a reduction of the false positive SNP-sites.