SNP-site file:

As mentioned before a SNP-site has to contain information about the ID of the reference sequence, the position in the references sequence and the indel-shift.

example from a SNP-site output file using the D. melanogaster genes as reference sequences and the ESTs published by Torres et al., 2008 as query.

FBgn0000579 4431-0 16 2 0.22
FBgn0003279 1911-0 19 2 0.19
FBgn0003979 444-0 11 2 0.40
FBgn0003979 445-0 11 2 0.40
FBgn0013349 717-0 4 2 0.50
a b c d e

 

a The identity of the reference sequence (gene ID)
b

The position within the reference sequence; the first part indicates the position in the reference sequence and the second part is the indel shift

c the number of SNPs at this SNP site (proportional to the number of ESTs mapping to this SNP-site)
d The number of alleles at the SNP site
e The PIC at the SNP site (polymorphism information content)

HINT: when you use a SNP-site file as input only the columns 'a' and 'b' are required. The columns have to be separted by a tabulator key


SNP file:

As mentioned before a SNP (according to PanGEAs terminology) is merely the nucleotide character in an EST at a SNP-site. Therefore many SNPs map to a single SNP-site, i.e.: one SNP for each EST mapping to a SNP-site.

example from a SNP file using the D. melanogaster genes as reference sequences and the ESTs published by Torres et al., 2008 as querys.

FBgn0000579 EST_a +/- 4431-0 63-0 True 58 10 25 22
FBgn0000579 EST_b +/- 4431-0 58-0 True 51 3 27 24
FBgn0000579 EST_c +/+ 4431-0 20-0 True 19 3 20 19
FBgn0000579 EST_d +/+ 4431-0 20-0 True 19 3 20 19
FBgn0000579 EST_e +/+ 4431-0 58-0 True 46 3 23 20
a b c d e f g h i j

 

a name of the database sequence; database sequence ID; reference sequence ID
b name of the query sequence; query sequence ID; EST ID
c is the original EST or the reverse complement of the EST aligned with the reference sequence, +/+ and +/- respectively. plus strand or minus strand
d position of the SNP in the reference sequence.
e position of the SNP in the EST. if the strand is +/+, the position is derived from the 5' end of the EST, if the strand is +/- the position is derived from the 3'-end. The reason for this is that a reverse complement was necessary to align the EST and than the position is simply derived from the new 5' end which is the former 3'-end.
f Is this a valid SNP, invalid SNPs are merely used to determine the coverage of a SNP-site, for example a 'N' , 'Y' or an indel (if not choosen to be valid). True/False
g Distance from the alignment end. The next closest alignment end is considered
h Number of low alignment quality tokens in the neighborhood of the SNP
i sequence quality at the SNP
j sequence quality in the neighborhood of the SNP

We believe that this SNP files may be very helpful and could act as the basis for subsequent processing. The SNP file contains all information necessary and is easily parseable. It might for example be used as a basis for machine learning or AI approaches for SNP-identification. The format is tab-delimited, i.e all columns are separated by a tabulator character.

 

Torres TT, Metta M, Ottenwälder B, Schlötterer C.
Gene expression profiling by massively parallel sequencing.
Genome Res. 2008 Jan;18(1):172-7. Epub 2007 Nov 21.