Motif PSSM with Bio.motifs

How to Calculate Motif PSSM with Bio.motifs Module?

PSSM (Position-Specific Scoring Matrix), also referred as PSWM (Position-Specific Weight Matrix) or LSM (Logodds Scoring Matrix), represents how well the frequency of each letter at each position matches with a given background frequency. PSSM can be expressed as:

PSSM[i,j] = log_{2}(PPM[i,j]/B[i]) where: PPM[i,j] is the Position Probability Matrix. B[i] is a background frequency column. log_{2}() is logarithm function of base 2.

The simplest background frequency model assumes that each letter appears equally in the entire population. So for DNA sequences, the simplest background frequency column is B = (Ba, Bc, Bg, Bt) = (0.25, 0.25, 0.25, 0.25).

In Biopython, we can use the log_odds() to calculate the PSSM against the simplest background frequency model. Note that log_odds() uses B = (0.25, 0.25, 0.25, 0.25) by default.

fyicenter$ python >>> from Bio import motifs >>> samples = [ ... "AAGAAT", ... "ATCATA", ... "AAGTAA", ... "AACAAA", ... "ATTAAA", ... "AAGAAT" ... ] >>> m = motifs.create(samples) >>> ppm = m.counts.normalize() >>> print(ppm) 0 1 2 3 4 5 A: 1.00 0.67 0.00 0.83 0.83 0.67 C: 0.00 0.00 0.33 0.00 0.00 0.00 G: 0.00 0.00 0.50 0.00 0.00 0.00 T: 0.00 0.33 0.17 0.17 0.17 0.33 >>> pssm = ppm.log_odds() >>> print(pssm) 0 1 2 3 4 5 A: 2.00 1.42 -inf 1.74 1.74 1.42 C: -inf -inf 0.42 -inf -inf -inf G: -inf -inf 1.00 -inf -inf -inf T: -inf 0.42 -0.58 -0.58 -0.58 0.42

We can verify the calculation using the math.log(x,2) function for a couple of locations in the matrix.

>>> import math >>> math.log(ppm["A",0]/0.25, 2) 2.0 >>> math.log(ppm["A",1]/0.25, 2) 1.4150374992788437

In order to avoid -inf in the PSSM, we can also add a set of pseudocounts into the PPM.

>>> pseudocounts = {"A": 0.25, "C": 0.25, "G": 0.25, "T": 0.25} >>> ppm = m.counts.normalize(pseudocounts) >>> background = {"A": 0.25, "C": 0.25, "G": 0.25, "T": 0.25} >>> pssm = ppm.log_odds(background) >>> print(pssm) 0 1 2 3 4 5 A: 1.84 1.28 -2.81 1.58 1.58 1.28 C: -2.81 -2.81 0.36 -2.81 -2.81 -2.81 G: -2.81 -2.81 0.89 -2.81 -2.81 -2.81 T: -2.81 0.36 -0.49 -0.49 -0.49 0.36

