Motif PSSM with Bio.motifs

Q

How to Calculate Motif PSSM with Bio.motifs Module?

✍: FYIcenter.com

A

PSSM (Position-Specific Scoring Matrix), also referred as PSWM (Position-Specific Weight Matrix) or LSM (Logodds Scoring Matrix), represents how well the frequency of each letter at each position matches with a given background frequency. PSSM can be expressed as:

PSSM[i,j] = log2(PPM[i,j]/B[i])

where: 
  PPM[i,j] is the Position Probability Matrix.  
  B[i] is a background frequency column. 
  log2() is logarithm function of base 2.

The simplest background frequency model assumes that each letter appears equally in the entire population. So for DNA sequences, the simplest background frequency column is B = (Ba, Bc, Bg, Bt) = (0.25, 0.25, 0.25, 0.25).

In Biopython, we can use the log_odds() to calculate the PSSM against the simplest background frequency model. Note that log_odds() uses B = (0.25, 0.25, 0.25, 0.25) by default.

fyicenter$ python
>>> from Bio import motifs
>>> samples = [
...   "AAGAAT",
...   "ATCATA",
...   "AAGTAA",
...   "AACAAA",
...   "ATTAAA",
...   "AAGAAT"
... ]

>>> m = motifs.create(samples)
>>> ppm = m.counts.normalize()
>>> print(ppm)
        0      1      2      3      4      5
A:   1.00   0.67   0.00   0.83   0.83   0.67
C:   0.00   0.00   0.33   0.00   0.00   0.00
G:   0.00   0.00   0.50   0.00   0.00   0.00
T:   0.00   0.33   0.17   0.17   0.17   0.33

>>> pssm = ppm.log_odds()
>>> print(pssm) 
        0      1      2      3      4      5
A:   2.00   1.42   -inf   1.74   1.74   1.42
C:   -inf   -inf   0.42   -inf   -inf   -inf
G:   -inf   -inf   1.00   -inf   -inf   -inf
T:   -inf   0.42  -0.58  -0.58  -0.58   0.42

We can verify the calculation using the math.log(x,2) function for a couple of locations in the matrix.

>>> import math
>>> math.log(ppm["A",0]/0.25, 2)
2.0

>>> math.log(ppm["A",1]/0.25, 2)
1.4150374992788437

In order to avoid -inf in the PSSM, we can also add a set of pseudocounts into the PPM.

>>> pseudocounts = {"A": 0.25, "C": 0.25, "G": 0.25, "T": 0.25}
>>> ppm = m.counts.normalize(pseudocounts)

>>> background = {"A": 0.25, "C": 0.25, "G": 0.25, "T": 0.25}
>>> pssm = ppm.log_odds(background)
>>> print(pssm)
        0      1      2      3      4      5
A:   1.84   1.28  -2.81   1.58   1.58   1.28
C:  -2.81  -2.81   0.36  -2.81  -2.81  -2.81
G:  -2.81  -2.81   0.89  -2.81  -2.81  -2.81
T:  -2.81   0.36  -0.49  -0.49  -0.49   0.36

 

Sequence Score against PSSM with Bio.motifs

Motif PCM, PFM, PPM, PWM with Bio.motifs

Biopython for Sequence Motif Analysis

⇑⇑ OBF (Open Bioinformatics Foundation) Tools

2023-07-01, 296🔥, 0💬