BioTech FYI Center - Resources

Protein Sequence Databases

Protein Sequence Databases

Protein Sequence Databases
There are various ways to obtain certain types of information about a protein. The method(s) and database(s) that you will want to use will depend a great deal on what kind of information you want.

Databases such as Swiss-Prot can be used to obtain the primary structures (amino acid sequences) of proteins. As with nucleic acids, there are two basic methods for searching sequence databases: (1) particular keyword label (e.g., "cytochrome c"). PIR in particular allows a variety of different types of keyword characteristics to be searched; or (2) search engines can be used to hunt for sequences that are similar to one another. As an analogy, you could search a phonebook for a number (sequence) associated with a particular name (protein), or you could search a phonebook to determine the names of all the people who had phone number ending in 7675.

Databases such as Blocks or Prodom provide information about sequence and structural patterns in proteins. These databases group proteins that contain similar active sites or substructures, and thus differ from search engines that blindly compare all primary sequences. They are particularly useful because the structure of a protein will determine its function, yet there are no good ways to compare overall structure. As an analogy, if you wanted to know who was related to who in a town, it might be easier to look at last names in a phonebook rather than pictures in a yearbook.

Databases such as NRL-3D and Entrez contain information about the overall three-dimensional structures of proteins. As an analogy, if you looked up a name in the phonebook, this would be the yearbook that would show his or her picture.


In order to find out virtually anything you want to know about an enzyme, use the EC Enzyme Database. This database can be searched using the name of the enzyme. A series of different enzymes )with associated EC numbers) will come up. When you click on a particular type of enzyme (EC number), it will lead you into a page that contains links to other types of information, including: the reaction catalyzed, associated metabolic diseases (OMIM), and what is known about enzymes from particular organisms. When you in turn click on an enzyme from a particular organism you will be led to a page that contains links to protein sequence, pattern, and structure for that enzyme.

CATH -- The CATH database, maintained at University College London, provides a hierarchical domain classification of protein structures in the Brookhaven protein databank. The site's glossary - like BioTech a "work in progress" - may prove helpful for those new to the language of protein classification. On the other hand, if you are new to protein classifications then CATH may be too arcane for you.

DBGET -- DBGET is a science links database that summarizes the major databases for nucleic acids, proteins, ligands, medicine, etc. It could prove useful for those trying to cross-reference information.

Genobase -- Incorporating the information on EMBL, GenBank, Swiss-Prot and others, Genobase is a comprehensive molecular biology database covering nucleic acid, proteins, structure, etc.

Swiss-Prot - - Ideal for initial searches for protein information, Swiss Prot generates search returns that are straightforward and very informative. The database is organized by EMBL accession numbers but is searchable by description, identification, author, date, and more.

PIR -- The Protein Information Resource - A collection of other databases, PIR compiles protein information based on what is known about each protein. As such, this could be a very useful tool for anyone seeking data on obscure proteins. Conversely, it could overwhelm those looking for information on well-studied proteins with too much information.

OWL-Web - Rather simple, this database compiles information from some of the bigger databases such as Swiss Prot, GenBank, and others. Unlike the more comprehensive databases, though, OWL tends to present more concise versions of the information. This makes OWL ideal for students who do not want massive amounts of data return or for anyone who wants just the basic facts about the protein.

GenQuest - This server allows for quick comparisons between unknown sequences and those found within the Genome Sequence Database, Swiss Prot, Prosite and PDB.

ExPASy Molecular Biology Server - Aside from the sequence identification provided by most databases, the ExPASy site provides a number of tools for protein analysis including peptide mass calculations, amino acid matching between sequences, nucleic acid sequence translation to protein, and much more. There are also tools to aid in structure projections and visualization.

PRODOM - Based on the homologous domains from Swiss Prot, this database provides information on the domain arrangement of proteins and consensus sequences.

Blocks - This is a database that searches for sequence homology based on blocks, "multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins." These blocks are determined by cross-referencing several databases for highly conserved sequence regions.

Kabat Database of Proteins of Immunological Interest - Because this is a gopher-based database, searchers should have some idea of what they are looking for.

SBASE - A protein domain sequence databse, SBASE is cross- referenced with most of the other major sequence databases. This database, therefore, can provide the same information as others, but the interface is not as user-friendly.

Protein Structure Resources

Chemical MIME Connection - All the molecules and chemical structure images off the Web can be quite confusing without a little understanding of some of the image formats and viewing software. Marilynn Dunker's site briefly explains the different types of computer images one might find and provides links to other sites where software and images can be downloaded. This site could be particularly invaluable for beginners.

SCOP - Structural Classification of Proteins - As its name implies, SCOP classifies all proteins for which science has structural information in order to examine the relationships between proteins. The result is a database with a wealth of structural information on folding patterns, sequence, phylogeny and more.

NRL-3D - Not only does NRL 3D provide protein sequence and structural information, it also serves as a link between the Protein Data Bank and certain structural manipulation software which cannot interpret the information from PDB. Although very informative, searching can be a little tricky, and reading the instructions is very highly recommended.

Image Library of Biological Macromolecules - This gopher-based site contains hundreds of images of molecules and complexes along with the reference information. Images are well-categorized although entering with a specific goal would be helpful.

Protein Motions Database - This database provides information on the movements of proteins.

Amino Acids

Amino Acid Properties - This site contains a grotesquely large amount of information about amino acids including structure, pKa, geometry, solubility, images, etc..

Enzyme Databases

EC Enzyme - This database was designed to provide information on enzymes as they were discovered and characterized. Recently, though, the server has been finicky in responding and may not provide much if any information at all.

REBASE: The Restriction Enzyme Database - REBASE is a comprehensive database with everything one would want to know about an enzyme. User-friendly, it provides references and other resources in addition to the expected sequence, function and structural information.

Electrophoresis Databases

SWISS-2D: Two-Dimensional Polyacrylamide Gel Electrophoresis Database - Swiss-2D references proteins to 2D PAGE maps. This provides a source for comparisons of proteins relative to others by size, shape, etc.

Quest - This center focuses on the design and analysis of protein databases. As such, this site provides links to other similar resources. Protein information at this site is derived from 2D PAGE gels.

BioTech's Science Resources