Collections:
Motif PCM, PFM, PPM, PWM with Bio.motifs
How to Calculate Motif PCM, PFM, PPM, and PWM with Bio.motifs Module?
✍: FYIcenter.com
There are several statistical matrices associated with a sequence motif sample set.
PCM (Position Count Matrix) - PCM contains Counts of occurrences of each letter at each position. We can get PCM with "counts" property of a motif object in Biopython.
fyicenter$ python >>> from Bio import motifs >>> samples = [ ... "AAGAAT", ... "ATCATA", ... "AAGTAA", ... "AACAAA", ... "ATTAAA", ... "AAGAAT" ... ] >>> m = motifs.create(samples) >>> print(m.counts) 0 1 2 3 4 5 A: 6.00 4.00 0.00 5.00 5.00 4.00 C: 0.00 0.00 2.00 0.00 0.00 0.00 G: 0.00 0.00 3.00 0.00 0.00 0.00 T: 0.00 2.00 1.00 1.00 1.00 2.00
PFM (Position Frequency Matrix) - PFM is the same as PCM.
PPM (Position Probability Matrix) - PPM is the normalized (divided by the total number of samples) version of PCM. PPM can be expressed as:
PPM[i,j] = PCM[i,j]/s where s is the size of sample pool.
In Biopython, we can calculate PPM with the normalize() function.
>>> ppm = m.counts.normalize() >>> print(ppm) 0 1 2 3 4 5 A: 1.00 0.67 0.00 0.83 0.83 0.67 C: 0.00 0.00 0.33 0.00 0.00 0.00 G: 0.00 0.00 0.50 0.00 0.00 0.00 T: 0.00 0.33 0.17 0.17 0.17 0.33
With PPM, we can predict the probability of a given sequence in the entire population. For example, the consensus sequence "AAGAAA" has the highest probability of 15%.
>>> m.consensus Seq('AAGAAA') >>> p = ppm["A",0]*ppm["A",1]*ppm["G",2]*ppm["A",3]*ppm["A",4]*ppm["A",5] >>> p 0.154320987654321
And the anticonsensus sequence "CCACCC" has the lowest probability of 0%.
>>> m.anticonsensus Seq('CCACCC') >>> p = ppm["C",0]*ppm["C",1]*ppm["A",2]*ppm["C",3]*ppm["C",4]*ppm["C",5] >>> p 0.0
In fact, many sequences will have the 0% probability, if they contain a letter with 0 probability at any position.
To avoid this 0% probability issue, a pseudocount can be added at each position in the normalization process.
A standard approach is to assume that all letters should have an equal probability at each position. So PPM modified with standard pseudocounts, referred as pseudocount of 1, can be expressed as:
PPMp[i,j] = PCM[i,j]+(1/n)/(s+1) where: s is the size of sample pool. n is the number of letters.
For the above motif example, the modified PPM can be calculated as below using the "numpy" library.
# PPMp[i,j] = (PCM[i,j]+0.25)/7 pcm = m.counts >>> pcmp = numpy.array([pcm["A"], pcm["C"], pcm["G"], pcm["T"]]) >>> print(pcmp) [[6 4 0 5 5 4] [0 0 2 0 0 0] [0 0 3 0 0 0] [0 2 1 1 1 2]] >>> ppmp = (pcmp+0.25)/7 >>> print(ppmp) [[0.89285714 0.60714286 0.03571429 0.75 0.75 0.60714286] [0.03571429 0.03571429 0.32142857 0.03571429 0.03571429 0.03571429] [0.03571429 0.03571429 0.46428571 0.03571429 0.03571429 0.03571429] [0.03571429 0.32142857 0.17857143 0.17857143 0.17857143 0.32142857]]
Or you can use "pseudocounts" parameter with the normalize() function to calculate modified PPM.
>>> ppmp = m.counts.normalize(pseudocounts=0.25) >>> print(ppmp) 0 1 2 3 4 5 A: 0.89 0.61 0.04 0.75 0.75 0.61 C: 0.04 0.04 0.32 0.04 0.04 0.04 G: 0.04 0.04 0.46 0.04 0.04 0.04 T: 0.04 0.32 0.18 0.18 0.18 0.32
A more accurate approach is to use known probabilities of letters in the entire population. For human genome, the GC letters is about 40%, and the AT letters is 60%.
>>> human = {"A": 0.6, "C": 0.4, "G": 0.4, "T": 0.6} >>> ppmp = m.counts.normalize(pseudocounts=human) >>> >>> print(ppmp) 0 1 2 3 4 5 A: 0.82 0.57 0.07 0.70 0.70 0.57 C: 0.05 0.05 0.30 0.05 0.05 0.05 G: 0.05 0.05 0.42 0.05 0.05 0.05 T: 0.07 0.33 0.20 0.20 0.20 0.33
PWM (Position Weight Matrix) - PWM is PPM with non-uniform pseudocounts, where different weights are applied on different letters.
⇐ Read Motif in JASPAR Format with Bio.motifs
2023-07-01, 342🔥, 0💬
Popular Posts:
Molecule Summary: ID: FYI-1000200 SMILES: COc1cccc(Nc2ncnn3ccc(CN4 CC[C@@H](N)[C@H](O)C4)c2 3)c1Recei...
Where to find FAQ (Frequently Asked Questions) in understanding what is molecule and its basic prope...
Molecule Summary: ID: FYI-1001939 SMILES: CC1=CN(C(=O)NC1=O)C2CC(C (O2)CO)N=[N+]=[N-]Received at FYI...
Molecule Summary: ID: FYI-1000315 SMILES: CCCOc1ccc(CNC(=O)CSc2nc( C)n[nH]2)cn1Received at FYIcenter...
Molecule Summary: ID: FYI-1000457 SMILES: Nc3nc1c(NCN1[C@@H]2O[C@H ](COP(=O)([O-])OP(=O)([O -])OC[C@H]...