Bioinformatics basics - Sequence analysis

BioTech FYI Center

Bioinformatics- Sequence analysis

Bioinformatics - Virtual Drug Development Bioinformatics - Drug discovery

Bioinformatics- Sequence analysis

Sequence analysis is the application of Information Technologies to Molecular Biology. It deals with biological sequences, and processes them to extract significant information that may yield new insights and guidelines in the understanding of biological organisms

Basics for sequence analysis

Proteins

A protein is typically built of a series of basic blocks called amino acids , chained together in a linear sequence of blocks. Amino acids may come in a variety of shapes and properties: they may be small or bulky, hidrophobic or hidrophyllic, electrically charged or neutral, etc... hence allowing for very complex shapes and interactions to be produced.

Amino acids are commonly referred to by name or by an abbreviation, usually in three or one letter. This allows for more efficient descriptions of how they are chained together to build a protein:

Neutral-Nonpolar

3-letter

1-letter

Glycine

Gly

G

L-Alanine

Ala

A

L-Valine

Val

V

L-Isoleucine

Ile

I

L-Leucine

Leu

L

L-Phenylalanine

Phe

F

L-Proline

Pro

P

L-Methionine

Met

M

Neutral-Polar

L-Serine

Ser

S

L-Threonine

Thr

T

L-Tyrosine

Tyr

Y

L-Tryptophan

Trp

W

L-Asparagine

Asn

N

L-Glutamine

Gln

Q

L-Cysteine

Cys

C

Acidic

L-Aspartic

Asp

D

L-Glutamic

Glu

E

Basic

L-Lysine

Lys

K

L-Arginine

Arg

R

L-Histidine

His

H

Nucleic Acids

For them the number of basic building blocks is a lot smaller, each nucleic acid chain being composed of series of only four possible different nucleotides which furthermore provide for a very limited set of interactions.

Nucleic acids come in two flavors: DNA (DeoxyriboNucleic Acid) and RNA (RiboNucleic Acid). Both of them consist of a series of nucleotides that are glued one after the other to constitute the sequence of blocks that make up the functional chain.

Nucleotides are composed of a phosphate group, a sugar (ribose in RNA, and deoxyribose in DNA) and a base which marks the specific difference among nucleotides. The base may be one of guanine, cytosine, adenine and thymine in the case of DNA or guanine, cytosine, adenine or uracil for RNA. They can be referred to by their one letter abbreviations G, C, A, T and U. Interactions are mainly driven by the stablishment of hydrogen bonds, which can only be established among thymine (or uracil) and adenine (two hydrogen bonds) and cytosine and guanine (three hydrogen bonds).

As we said previously, the main role of nucleic acids is to convey all the genetic information needed to make proteins and control the building process. Protein sequences are coded by nucleic acids using groups three of nucleotides that code for a given amino acid: the code is more or less universal with little exceptions, and includes redundancy to increase the fidelity of the reading process when making duplicates or translating the information:

UUU	Phe	UCU	Ser	UAU	Tyr	UGU	Cys
UUC	Phe	UCC	Ser	UAC	Tyr	UGC	Cys
UUA	Leu	UCA	Ser	UAA	Stop	UGA	Stop
UUG	Leu	UCG	Ser	UAG	Stop	UGG	Trp
CUU	Leu	CCU	Pro	CAU	His	CGU	Arg
CUC	Leu	CCC	Pro	CAC	His	CGC	Arg
CUA	Leu	CCA	Pro	CAA	Gln	CGA	Arg
CUG	Leu	CCG	Pro	CAG	Gln	CGG	Arg
AUU	Ile	ACU	Thr	AAU	Asn	AGU	Ser
AUC	Ile	ACC	Thr	AAC	Asn	AGC	Ser
AUA	Ile	ACA	Thr	AAA	Lys	AGA	Arg
AUG	Met	ACG	Thr	AAG	Lys	AGG	Arg
GUU	Val	GCU	Ala	GAU	Asp	GGU	Gly
GUC	Val	GCC	Ala	GAC	Asp	GGC	Gly
GUA	Val	GCA	Ala	GAA	Glu	GGA	Gly
GUG	Val*	GCG	Ala	GAG	Glu	GGG	Gly

* GUG may also code for the initiator Met. This triplet is therefore "ambiguous".

Regulation of expression is encoded as specific patterns that are to be recognized by the translation machinery under appropriate circumstances.

Sequence databases:

For overview of database - bioinformatics database - an overview

For complete sequence database listing - Biological databases

Overview of sequence analysis tools

Sequence Comparison
An alignment is an arrangement of two sequences, which shows where the two sequences are similar, and where they differ. An optimal alignment, of course, is one that exhibits the most similarities, and the least differences. Broadly, there are three categories of methods for sequence comparison.

� Segment methods/ compare all overlapping segments of a predetermined length (e.g., 10 amino acids) from one sequence to all segments from the other. This is the approach used in dotplots.

� Optimal global alignment methods/ allow the best overall score for the comparison of the two sequences to be obtained, including a consideration of gaps. These programs align sequences over their whole length.

� Optimal local alignment algorithm/s seek to identify the best local similarities between two sequences also including explicit consideration of gaps. Alignment may only be over a short span of sequence.

Dotplots
The most intuitive representation of the comparison between two sequences is using dotplots. One sequence is represented on each axis and significant matching regions are distributed along diagonals in the matrix.

There are two different algorithms that are commonly used in creating dotplots. The first method involves matching identical regions of sequence and plotting a dot in these areas. The second involves using "sliding windows" to compare two sequences using a threshold score ` * ' value. A window size is selected as a run of adjacent nucleotide or amino acid residues, and a score chosen to reflect the degree of similarity of sequence required. Each window of sequence A is compared to each window of sequence B, and a dot is only placed in that region if the match scores or exceeds the set threshold level.

Online tool links:

Dotlet Programme

Learn dotlet by example

Sequence alignment
The algorithms we will be using are more rigorous than those used for searching databases; so even if you have retrieved a sequence from a database using something like BLAST. The basic idea behind the sequence alignment programs is to align the two sequences in such a way as to produce the highest score - a scoring matrix is used to add points to the score for each match and subtract them for each mismatch. The matrices commonly used for scoring protein alignments are more complex than the simple match/mismatch matrices used for DNA sequences such as the one we saw earlier; the scores that form the protein matrices are designed to reflect similarity between the different amino acids rather than simply scoring identities. Over time various mutations occur in sequences; the scoring matrices attempt to cope with mutations, but insertions and deletions require some extra parameters to allow the introduction of gaps in the alignment. There are penalties both for the creation of gaps and for the extension of existing ones; the default gap parameters given in alignment programs have been found to be empirically correct with test sequences but you should experiment with different gap penalties.

BLAST
BLAST (Basic Local Alignment Search Tool) is a heuristic method to find the highest scoring locally optimal alignments between a query sequence and a database. Previous versions of BLAST did not allow gapped alignments, but BLAST2 (from the HGMP-RC telnet and www menus) does. A gapped BLAST search allows gaps (deletions and insertions) to be introduced into the alignments that are returned. Allowing gaps means that similar regions are not broken into several segments. The scoring of these gapped alignments tends to reflect biological relationships more closely.

The BLAST algorithm and family of programs rely on work on the statistics of local sequence alignments by Altschul et al[]. The statistics allow us to estimate the probability of obtaining an alignment with a particular score. The BLAST algorithm permits nearly all sequence matches above a cutoff ` * ' to be located efficiently in a database.

The algorithm operates as follows:
� BLAST scans the database for words (typically 3-mers for proteins) that score at least T (a designated threshold value) when aligned with a word in the query sequence - such aligned pairs are called hits.

� If a second non-overlapping hit is found within a distance A of the first and on the same diagonal, the first hit is extended between the database and query sequences in both directions. Extension continues, scoring all the time, until the running score drops below the maximum score seen so far by a value X. The resulting local alignment is called an HSP (high-scoring segment pair) or MSP (maximum scoring segment pair).

� If the alignment score of the HSP exceeds a given value Sg (the gapped score), then a gapped extension of the HSP is initiated.

Earlier versions of BLAST looked only for single hits and extended them all; however, the extensions did not incorporate gaps and thus missed some potentially interesting matches. The gapped extension currently used, takes much longer to execute, but speed is improved overall by the requirement for two non-overlapping close hits before the initial extension is triggered, and the value of Sg is chosen so that only about one extension is triggered per 50 database sequences.

These modifications to BLAST mean that it now runs three times faster than earlier versions and in trials it found more statistically significant alignments than the old BLAST .

BLAST FAMILY OF PROGRAMS
The BLAST family of programs allows all combinations of DNA or protein query sequences with searches against DNA or protein databases. (Most of the time use of these is behind an interface.)

� blastp: compares an amino acid query sequence against a protein sequence database.

� blastn: compares a nucleotide query sequence against a nucleotide sequence database.

� blastx: compares the six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database. � tblastn: compares a protein query sequence against a nucleotide sequence database dynamically translated in all six reading frames (both strands).

� tblastx: compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.

� PSI-Blast: Position-Specific Iterated BLAST . This is potentially a very sensitive method to pull out significant hits in a protein-protein database search. This first performs a gapped BLAST database search and then uses the information from any significant alignments returned to construct a position-specific score matrix, which replaces the query sequence for the next round of database searching. PSI-Blast may be iterated until no new significant alignments are found. We'll look at this tomorrow when we do some protein analysis.

Online tool links:

NCBI - Blast

WU - Blast@ EBI

Global sequence alignment
A global alignment is one that compares the two sequences over their entire lengths, and is appropriate for comparing sequences that are expected to share similarity over the whole length. The alignment maximises regions of similarity and minimises gaps using the scoring matrices and gap parameters provided to the program.

Online tool link:

Clustalw @ EBI

Local sequence alignment
global sequence alignment algorithms align sequences over their entire lengths. You do need to think about whether that type of alignment makes sense for your sequences. For our example, where we expect each exon to be represented in the sequences and in the same order, it has worked well - however, how well do you think this approach would work with, for example, multidomain proteins that share one domain but not others, or sequences where there have been regions of duplication? A second comparison method, local alignment, searches for regions of local similarity and need not include the entire length of the sequences.

Online tool links:

Pairwise alignment at EBI

Pairwise alignment at NCBI

Protein Sequence Analyisis
You can get a variety of clues by looking for patterns and motifs in your sequence:

� These are often derived from multiple sequence alignments.

� Conserved protein domains or regions can be very useful in trying to determine which protein family a sequence belongs to, catalytic sites, carbohydrate binding sites etc.

� Various research groups have created their own databases and search tools; it might be worth using a variety of these.

FIND HOMOLOGOUS ( PARALOGOUS AND ORTHOLOGOUS) SEQUENCES

Using a database similarity search can give you a great deal of information:

� Homologues may be well annotated and their function documented in the literature.

� Simply comparing your sequence with homologues can tell you a lot.

� Phylogenetic analysis may reveal evolutionary relationships between proteins and help you decide which family or super family a protein belongs to.

� N.B. Be aware of convergent evolution.
HAVING SOME IDEA OF STRUCTURE MAY HELP YOU PREDICT POSSIBLE FUNCTIONS

Knowing the protein fold(s) together with conserved domains (or even residues) may tell you what type of functions this protein could have.

Bioinformatics- Sequence analysis