Motif PCM, PFM, PPM, PWM with Bio.motifs

Q

How to Calculate Motif PCM, PFM, PPM, and PWM with Bio.motifs Module?

✍: FYIcenter.com

A

There are several statistical matrices associated with a sequence motif sample set.

PCM (Position Count Matrix) - PCM contains Counts of occurrences of each letter at each position. We can get PCM with "counts" property of a motif object in Biopython.

fyicenter$ python
>>> from Bio import motifs
>>> samples = [
...   "AAGAAT",
...   "ATCATA",
...   "AAGTAA",
...   "AACAAA",
...   "ATTAAA",
...   "AAGAAT"
... ]

>>> m = motifs.create(samples)
>>> print(m.counts)
        0      1      2      3      4      5
A:   6.00   4.00   0.00   5.00   5.00   4.00
C:   0.00   0.00   2.00   0.00   0.00   0.00
G:   0.00   0.00   3.00   0.00   0.00   0.00
T:   0.00   2.00   1.00   1.00   1.00   2.00

PFM (Position Frequency Matrix) - PFM is the same as PCM.

PPM (Position Probability Matrix) - PPM is the normalized (divided by the total number of samples) version of PCM. PPM can be expressed as:

PPM[i,j] = PCM[i,j]/s

where s is the size of sample pool.  

In Biopython, we can calculate PPM with the normalize() function.

>>> ppm = m.counts.normalize()
>>> print(ppm)
        0      1      2      3      4      5
A:   1.00   0.67   0.00   0.83   0.83   0.67
C:   0.00   0.00   0.33   0.00   0.00   0.00
G:   0.00   0.00   0.50   0.00   0.00   0.00
T:   0.00   0.33   0.17   0.17   0.17   0.33

With PPM, we can predict the probability of a given sequence in the entire population. For example, the consensus sequence "AAGAAA" has the highest probability of 15%.

>>> m.consensus
Seq('AAGAAA')

>>> p = ppm["A",0]*ppm["A",1]*ppm["G",2]*ppm["A",3]*ppm["A",4]*ppm["A",5]
>>> p
0.154320987654321

And the anticonsensus sequence "CCACCC" has the lowest probability of 0%.

>>> m.anticonsensus
Seq('CCACCC')

>>> p = ppm["C",0]*ppm["C",1]*ppm["A",2]*ppm["C",3]*ppm["C",4]*ppm["C",5]
>>> p
0.0

In fact, many sequences will have the 0% probability, if they contain a letter with 0 probability at any position.

To avoid this 0% probability issue, a pseudocount can be added at each position in the normalization process.

A standard approach is to assume that all letters should have an equal probability at each position. So PPM modified with standard pseudocounts, referred as pseudocount of 1, can be expressed as:

PPMp[i,j] = PCM[i,j]+(1/n)/(s+1) 

where: 
  s is the size of sample pool.  
  n is the number of letters. 

For the above motif example, the modified PPM can be calculated as below using the "numpy" library.

# PPMp[i,j] = (PCM[i,j]+0.25)/7

pcm = m.counts
>>> pcmp = numpy.array([pcm["A"], pcm["C"], pcm["G"], pcm["T"]])
>>> print(pcmp)
[[6 4 0 5 5 4]
 [0 0 2 0 0 0]
 [0 0 3 0 0 0]
 [0 2 1 1 1 2]]

>>> ppmp = (pcmp+0.25)/7
>>> print(ppmp)
[[0.89285714 0.60714286 0.03571429 0.75       0.75       0.60714286]
 [0.03571429 0.03571429 0.32142857 0.03571429 0.03571429 0.03571429]
 [0.03571429 0.03571429 0.46428571 0.03571429 0.03571429 0.03571429]
 [0.03571429 0.32142857 0.17857143 0.17857143 0.17857143 0.32142857]]

Or you can use "pseudocounts" parameter with the normalize() function to calculate modified PPM.

>>> ppmp = m.counts.normalize(pseudocounts=0.25)
>>> print(ppmp)
        0      1      2      3      4      5
A:   0.89   0.61   0.04   0.75   0.75   0.61
C:   0.04   0.04   0.32   0.04   0.04   0.04
G:   0.04   0.04   0.46   0.04   0.04   0.04
T:   0.04   0.32   0.18   0.18   0.18   0.32

A more accurate approach is to use known probabilities of letters in the entire population. For human genome, the GC letters is about 40%, and the AT letters is 60%.

>>> human = {"A": 0.6, "C": 0.4, "G": 0.4, "T": 0.6}
>>> ppmp = m.counts.normalize(pseudocounts=human)
>>> 
>>> print(ppmp)
        0      1      2      3      4      5
A:   0.82   0.57   0.07   0.70   0.70   0.57
C:   0.05   0.05   0.30   0.05   0.05   0.05
G:   0.05   0.05   0.42   0.05   0.05   0.05
T:   0.07   0.33   0.20   0.20   0.20   0.33

PWM (Position Weight Matrix) - PWM is PPM with non-uniform pseudocounts, where different weights are applied on different letters.

 

Motif PSSM with Bio.motifs

Read Motif in JASPAR Format with Bio.motifs

Biopython for Sequence Motif Analysis

⇑⇑ OBF (Open Bioinformatics Foundation) Tools

2023-07-01, 342🔥, 0💬