Bioinformatics FAQ (Frequently Asked Questions) - Glossary of bioinformatics terms
Part:
1
2
3
(Continued from previous part...)
What
is a scoring matrix?
The following explanation was edited from a contribution by
Amelie Stein.
The aim of a sequence
alignment, is to match "the most similar elements" of two
sequences. This similarity must be evaluated somehow. For example,
consider the following two alignments:
(a) AIWQH
AL-QH
|
(b) AIWQH
A-LQH
|
They seem quite similar: both contain one "indel" and one
substitution, just at different positions. However, if we think of
the letters as amino acid residues rather than elements of strings,
alignment (a) is the better one, because isoleucine (I) and leucine
(L) are similar sidechains, while tryptophan (W) has a very
different structure. This is a physico-chemical measure; we might
prefer these days to say that leucine simply substitutes for
isoleucine more frequently---without giving an underlying "reason"
for this observation.
However we explain it, it is much more likely that a mutation
changed I into L and that W was lost, as in (a), than that W changed
into L and I was lost. We would expect that a change from I to L
would not affect the function as much as a mutation from W to
L---but this deserves its own topic.
To quantify the similarity achieved by an alignment, scoring
matrices are used: they contain a value for each possible
substitution, and the alignment score is the sum of the
matrix's entries for each aligned amino acid pair. For gaps
(indels), a special gap score is necessary---a very simple
one is just to add a constant penalty score for each indel. The
optimal alignment is the one which maximizes the alignment
score.
PAM matrices are a common family of score matrices. PAM
stands for Percent Accepted Mutations,
where "accepted" means that the mutation has been adopted by the
sequence in question. Thus, using the PAM 250
scoring matrix means that about 250 mutations per 100 amino acids
may have happened, while with PAM 10 only 10 mutations per 100 amino
acids are assumed, so that only very similar sequences will reach
useful alignment scores.
PAM matrices contain positive and negative values: if the
alignment score is greater than zero, the sequences are considered
to be related (they are similar with respect to the used scoring
matrix), if the score is negative, it is assumed that they are not
related. "Relationship" here may refer to evolution as well as
functionality of the proteins, and of course the choice of the
matrix affects the result, so one has to make an assumption on the
similarity of the sequences in order to receive a useful result:
rather distant sequences won't produce a good alignment using PAM
10, and the optimal aligment of two very similar sequences with PAM
500 may be less useful than that with PAM 50.
Finally, it should be noted that only some scoring matrices use
similarity to evaluate alignments, but others use
distance, so the be careful interpreting the results!
After this brief and necessarily superficial overview, you
might want to read some more about scoring matrices.
Part:
1
2
3
|