PanGEA Online Manual

Mapping of ESTs to the genes or the whole genome of an organism is done with the PanGEA-BlastN algorithm. This algorithm is for several reasons especially adapted to next-generation sequencing technologies.

the seeding, ie. the heuristic search for approximate hits, has been optimized. Pairwise alignments will only be created for the best seeds, which reduces the number of dynamic programming steps and thus computation time
the necessity to map ESTs unambiguously is explicitly addressed, ambiguous mapping results are reported into a separate file
a dynamic programming algorithm is provided, which is especially adapted to the idiosyncrasies of pyrosequencing
several adaptions addressing problems related to introns have been implemented.

The mapping algorithm of PanGEA-BlastN, initially constructs as hash-table of the database sequence and subsequently scans for approximate hits between the query and the database sequence (seeds).
To reduce computation time, PanGEA attempts to estimate the best candidates for yielding the highest scoring hit already from the seeds, by computing the longest diagonal, i.e. succession of shared words, between the query and the database sequence. Only the longest diagonals will act as origin for subsequent dynamic programming which creates the pairwise alignments between the database and the query sequence.
PanGEA-BlastN provides two different dynamic programming algorithm, the normal Smith-Waterman algorithm and the homopolymer Smith-Waterman algorithm which is especially adapted to the idiosyncrasies of sequencing-by-synthesis, i.e. incorrect estimates of homopolymer length.
Both dynamic programming algorithm implement the improvements described by Gotoh (1982), to reduce computation time.

Unambiguously mapped ESTs are identified by comparing the scores of the pairwise alignments. If the score difference between the best and the second best hit exceeds a minimum threshold, a mapping result is considered unambiguous. Ambiguous results are reported into a separate output file.

Furthermore, PanGEA-BlastN offers a intron-mode in which introns are, as a novelty, already considered during seeding. Exons flanking the introns are separately aligned by dynamic programming (partial alignments) and subsequently aggregated into a composite hit. Overlapping partial alignments are resolved by calculates the alignment scores for each overlap individually and removing the overlap with the lowest score.

PanGEA-BlastN, however, does not allow to specify a minimum similarity threshold, insignificant hits may instead be removed afterward's with the option 'manage pairwise alignments' which allows to specify a minimum similarity, minimum alignment length and a minimum alignment coverage for mapped ESTs.

The upper limit of database sequences which may be analyzed at once is currently 65.536. Since most organism contain between 15.000 - 30.000 genes we hope that this number is sufficient. However, increasing this limit may easily be achieved but the RAM requirement would also drastically increase. If you really require this modification please contact me.

The total length of the database sequences is only limited by the amount of available RAM and no upper limit exists for the number of query sequences as PanGEA-BlastN operates in batch mode.

PanGEA-BlastN is also available as console application. PanGEA-BlastN