Motif ICM with Bio.motifs

Q

How to Calculate Motif ICM with Bio.motifs Module?

✍: FYIcenter.com

A

ICM (Information Content Matrices) represents how important of each position over others. ICM can be expressed as:

ICM[i,j] = PPM[i,j]*(ICt - U[j])

where: 
  PPM[i,j] is the Position Probability Matrix
  ICt is the total IC: log2(n)
  n is the number of letters 
  U[j] is the uncertainty per position: - sum_over_i(PPM[i,j]*log2(PPM[i,j]))

To calculate motif ICM, we can get the PPM first using Biopython.

fyicenter$ python
>>> from Bio import motifs
>>> samples = [
...   "AAGAAT",
...   "ATCATA",
...   "AAGTAA",
...   "AACAAA",
...   "ATTAAA",
...   "AAGAAT"
... ]

>>> m = motifs.create(samples)
>>> ppm = m.counts.normalize()
>>> print(ppm)
        0      1      2      3      4      5
A:   1.00   0.67   0.00   0.83   0.83   0.67
C:   0.00   0.00   0.33   0.00   0.00   0.00
G:   0.00   0.00   0.50   0.00   0.00   0.00
T:   0.00   0.33   0.17   0.17   0.17   0.33

Then we can calculate the ICM using "numpy" and "math" libraries.

>>> import numpy
>>> import math 

>>> n = len(ppm)
>>> n
4

>>> ic_t = math.log(n, 2)
>>> ic_t
2.0

>>> ppm_a = numpy.array([ppm["A"], ppm["C"], ppm["G"], ppm["T"]])
>>> print(ppm_a)
[[1.         0.66666667 0.         0.83333333 0.83333333 0.66666667]
 [0.         0.         0.33333333 0.         0.         0.        ]
 [0.         0.         0.5        0.         0.         0.        ]
 [0.         0.33333333 0.16666667 0.16666667 0.16666667 0.33333333]]

>>> log2_ppm_a = numpy.log2(ppm_a)
>>> print(log2_ppm_a)
[[ 0.         -0.5849625         -inf -0.26303441 -0.26303441 -0.5849625 ]
 [       -inf        -inf -1.5849625         -inf        -inf        -inf]
 [       -inf        -inf -1.                -inf        -inf        -inf]
 [       -inf -1.5849625  -2.5849625  -2.5849625  -2.5849625  -1.5849625 ]]

>>> ppm_log2_ppm_a = ppm_a * log2_ppm_a
>>> print(ppm_log2_ppm_a)
[[ 0.         -0.389975           nan -0.21919534 -0.21919534 -0.389975  ]
 [        nan         nan -0.52832083         nan         nan         nan]
 [        nan         nan -0.5                nan         nan         nan]
 [        nan -0.52832083 -0.43082708 -0.43082708 -0.43082708 -0.52832083]]

>>> ppm_log2_ppm_a = numpy.nan_to_num(ppm_log2_ppm_a)
>>> print(ppm_log2_ppm_a)
[[ 0.         -0.389975    0.         -0.21919534 -0.21919534 -0.389975  ]
 [ 0.          0.         -0.52832083  0.          0.          0.        ]
 [ 0.          0.         -0.5         0.          0.          0.        ]
 [ 0.         -0.52832083 -0.43082708 -0.43082708 -0.43082708 -0.52832083]]

>>> u_a = - numpy.sum(ppm_log2_ppm_a, axis=0)
>>> print(u_a)
[-0.          0.91829583  1.45914792  0.65002242  0.65002242  0.91829583]

>>> icm = ppm_a * (ic_t - u_a)
>>> print(icm)
[[2.         0.72113611 0.         1.12498132 1.12498132 0.72113611]
 [0.         0.         0.18028403 0.         0.         0.        ]
 [0.         0.         0.27042604 0.         0.         0.        ]
 [0.         0.36056806 0.09014201 0.22499626 0.22499626 0.36056806]]

As you can see, the total ICM value of the first position is the highest value of 2, the most important, or the most conserved. The total ICM value of the third position is the lowest value, less important, or less conserved.

 

Motif ICM as Relative Divergence with Bio.motifs

Compare Motifs Using PSSM with Bio.motifs

Biopython for Sequence Motif Analysis

⇑⇑ OBF (Open Bioinformatics Foundation) Tools

2023-05-31, 295🔥, 0💬