BioTech FYI Center

Biological databases

Biological databases

Biological databases

Biological Database/Molecular biology database/bioinformatics database:

Summary: Overview of biological databases, uses of biological database, major biological databases and principal requirements of biological database

A biological database is a large, organized body of persistent data, usually associated with computerized software designed to update, query, and retrieve components of the data stored within the system. A simple database might be a single file containing many records, each of which includes the same set of information. For example, a record associated with a nucleotide sequence database typically contains information such as contact name; the input sequence with a description of the type of molecule; the scientific name of the source organism from which it was isolated; and, often, literature citations associated with the sequence.

For researchers to benefit from the data stored in a database, two additional requirements must be met:
1.Easy access to the information; and
2.A method for extracting only that information needed to answer a specific biological question.

Currently, a lot of bioinformatics work is concerned with the technology of databases. These databases include both "public" repositories of gene data like GenBank or the Protein DataBank (the PDB), and private databases like those used by research groups involved in gene mapping projects or those held by biotech companies. Making such databases accessible via open standards like the Web is very important since consumers of bioinformatics data use a range of computer platforms: from the more powerful and forbidding UNIX boxes favoured by the developers and curators to the far friendlier Macs often found populating the labs of computer-wary biologists. RNA and DNA are the proteins that store the hereditary information about an organism. These macromolecules have a fixed structure, which can be analyzed by biologists with the help of bioinformatic tools and databases.
A few popular databases are GenBank from NCBI (National Center for Biotechnology Information), SwissProt from the Swiss Institute of Bioinformatics and PIR from the Protein Information Resource.

GenBank (Genetic Sequence Databank) is one of the fastest growing repositories of known genetic sequences. It has a flat file structure that is an ASCII text file, readable by both humans and computers. In addition to sequence data, GenBank files contain information like accession numbers and gene names, phylogenetic classification and references to published literature.There are approximately 191,400,000 bases and 183,000 sequences as of June 1994.

The EMBL Nucleotide Sequence Database is a comprehensive database of DNA and RNA sequences collected from the scientific literature and patent applications and directly submitted from researchers and sequencing groups. Data collection is done in collaboration with GenBank (USA) and the DNA Database of Japan (DDBJ). The database currently doubles in size every 18 months and currently (June 1994) contains nearly 2 million bases from 182,615 sequence entries.

This is a protein sequence database that provides a high level of integration with other databases and also has a very low level of redundancy (means less identical sequences are present in the database).

The PROSITE dictionary of sites and patterns in proteins prepared by Amos Bairoch at the University of Geneva.

The 'ENZYME' data bank contains the following data for each type of characterized enzyme for which an EC number has been provided: EC number, recommended name, Alternative names, Catalytic activity, Cofactors, Pointers to the SWISS-PROT entree(s) that correspond to the enzyme, Pointers to disease(s) associated with a deficiency of the enzyme.

The RCSB PDB contains 3-D biological macromolecular structure data from X-ray crystallography, NMR, and Cryo-EM. It is operated by Rutgers, The State University of New Jersey and the San Diego Supercomputer Center at the University of California, San Diego.

The GDB Human Genome Data Base supports biomedical research, clinical medicine, and professional and scientific education by providing for the storage and dissemination of data about genes and other DNA markers, map location, genetic disease and locus information, and bibliographic information.

The Mendelian Inheritance in Man data bank (MIM) is prepared by Victor Mc Kusick with the assistance of Claire A. Francomano and Stylianos E. Antonarakis at John Hopkins University.

PIR (Protein Information Resource) produces and distributes the PIR-International Protein Sequence Database (PSD). It is the most comprehensive and expertly annotated protein sequence database. The PIR serves the scientific community through on-line access, distributing magnetic tapes, and performing off-line sequence identification services for researchers. Release 40.00: March 31, 1994 67,423 entries 19,747,297 residues.

Protein sequence databases are classified as primary, secondary and composite depending upon the content stored in them. PIR and SwissProt are primary databases that contain protein sequences as 'raw' data. Secondary databases (like Prosite) contain the information derived from protein sequences. Primary databases are combined and filtered to form non-redundant composite database

Genethon Genome Databases:
PHYSICAL MAP: computation of the human genetic map using DNA fragments in the form of YAC contigs. GENETIC MAP: production of micro-satellite probes and the localization of chromosomes, to create a genetic map to aid in the study of hereditary diseases. GENEXPRESS (cDNA): catalogue the transcripts required for protein synthesis obtained from specific tissues, for example neuromuscular tissues.

21 Bdb: LBL's Human Chr 21 database:
This is a W3 interface to LBL's ACeDB-style database for Chromosome 21, 21Bdb, using the ACeDB gateway software developed and provided by Guy Decoux at INRA.

MGD: The Mouse Genome Databases:
MGD is a comprehensive database of genetic information on the laboratory mouse. This initial release contains the following kinds of information: Loci (over 15,000 current and withdrawn symbols), Homologies (1300 mouse loci, 3500 loci from 40 mammalian species), Probes and Clones (about 10,000), PCR primers (currently 500 primer pairs), Bibliography (over 18,000 references), Experimental data (from 2400 published articles).

ACeDB (A Caenorhabditis elegans Database) :
Containing data from the Caenorhabditis Genetics Center (funded by the NIH National Center for Research Resources), the C. elegans genome project (funded by the MRC and NIH), and the worm community. Contacts: Mary O'Callaghan ( and Richard Durbin.

ACeDB is also the name of the generic genome database software in use by an increasing number of genome projects. The software, as well as the C. elegans data, can be obtained via ftp.

ACeDB databases are available for the following species: C. elegans, Human Chromosome 21, Human Chromosome X, Drosophila melanogaster, mycobacteria, Arabidopsis, soybeans, rice, maize, grains, forest trees, Solanaceae, Aspergillus nidulans, Bos taurus, Gossypium hirsutum, Neurospora crassa, Saccharomyces cerevisiae, Schizosaccharomyces pombe, and Sorghum bicolor.

MEDLINE is NLM's premier bibliographic database covering the fields of medicine, nursing, dentistry, veterinary medicine, and the preclinical sciences. Journal articles are indexed for MEDLINE, and their citations are searchable, using NLM's controlled vocabulary, MeSH (Medical Subject Headings). MEDLINE contains all citations published in Index Medicus, and corresponds in part to the International Nursing Index and the Index to Dental Literature. Citations include the English abstract when published with the article (approximately 70% of the current file).

The Database Industry:
Because of the high rate of data production and the need for researchers to have rapid access to new data, public databases have become the major medium through which genome sequence data are published. Public databases and the data services that support them are important resources in bioinformatics, and will soon be essential sources of information for all the molecular biosciences. However, successful public data services suffer from continually escalating demands from the biological community. Waterman describes the current situation in the following way: It is probably important to realize from the very beginning that the databases will never completely satisfy a very large percentage of the user community. The range of interest within biology itself suggests the difficulty of constructing a database that will satisfy all the potential demands on it. There is virtually no end to the depth and breadth of desirable information of interest and use to the biological community.

EMBL and GenBank are the two major nucleotide databases. EMBL is the European version and GenBank is the American. EMBL and GenBank collaborate and synchronize their databases so that the databases will contain the same information. The rate of growth of DNA databases has been following an exponential trend, with a doubling time now estimated to be 9-12 months. In January 1998, EMBL contained more than a million entries, representing more than 15 500 species, although most data is from model organisms such as Saccharomyces cerevisiae, Homo sapiens, Caenorhabditis elegans, Mus musculus and Arabidopsis thaliana. These databases are updated on a daily basis, but still you may find that a sequence referred to in the latest issue of a journal is not accessible. This is most often due to the fact that the release-date of the entry did not correlate with the publication date, or that the authors forgot to tell the databases that the sequences have been published. If you find such a case, please report it to EMBL and, or to GenBank.

The principal requirements on the public data services are:

Data quality - data quality has to be of the highest priority. However, because the data services in most cases lack access to supporting data, the quality of the data must remain the primary responsibility of the submitter.
Supporting data - database users will need to examine the primary experimental data, either in the database itself, or by following cross-references back to network-accessible laboratory databases.
Deep annotation - deep, consistent annotation comprising supporting and ancillary information should be attached to each basic datat object in the database.
Timeliness - the basic data should be available on an Internet-accessible server within days (or hours) of publication or submission. Integration - each data object in the database should be cross-referenced to representation of the same or related biological entities in other databases. Data services should provide capabilities for following these links from one database or data service to another.

The Creation of Sequence Databases:
Most biological databases consist of long strings of nucleotides (guanine, adenine, thymine, cytosine and uracil) and/or amino acids (threonine, serine, glycine, etc.). Each sequence of nucleotides or amino acids represents a particular gene or protein (or section thereof), respectively. Sequences are represented in shorthand, using single letter designations. This decreases the space necessary to store information and increases processing speed for analysis.
While most biological databases contain nucleotide and protein sequence information, there are also databases which include taxonomic information such as the structural and biochemical characteristics of organisms. The power and ease of using sequence information has however, made it the method of choice in modern analysis.
In the last three decades, contributions from the fields of biology and chemistry have facilitated an increase in the speed of sequencing genes and proteins. The advent of cloning technology allowed foreign DNA sequences to be easily introduced into bacteria.
In this way, rapid mass production of particular DNA sequences, a necessary prelude to sequence determination, became possible.
Oligonucleotide synthesis provided researchers with the ability to construct short fragments of DNA with sequences of their own choosing. These oligonucleotides could then be used in probing vast libraries of DNA to extract genes containing that sequence. Alternatively, these DNA fragments could also be used in polymerase chain reactions to amplify existing DNA sequences or to modify these sequences. With these techniques in place, progress in biological research increased exponentially.

For researchers to benefit from all this information, however, two additional things were required:
1) ready access to the collected pool of sequence information and
2) a way to extract from this pool only those sequences of interest to a given researcher

. Simply collecting, by hand, all necessary sequence information of interest to a given project from published journal articles quickly became a formidable task. After collection, the organization and analysis of this data still remained. It could take weeks to months for a researcher to search sequences by hand in order to find related genes or proteins.
Computer technology has provided the obvious solution to this problem. Not only can computers be used to store and organize sequence information into databases, but they can also be used to analyze sequence data rapidly. The evolution of computing power and storage capacity has, so far, been able to outpace the increase in sequence information being created. Theoretical scientists have derived new and sophisticated algorithms which allow sequences to be readily compared using probability theories. These comparisons become the basis for determining gene function, developing phylogenetic relationships and simulating protein models. The physical linking of a vast array of computers in the 1970's provided a few biologists with ready access to the expanding pool of sequence information. This web of connections, now known as the Internet, has evolved and expanded so that nearly everyone has access to this information and the tools necessary to analyze it. Databases of existing sequencing data can be used to identify homologues of new molecules that have been amplified and sequenced in the lab. The property of sharing a common ancestor, homology, can be a very powerful indicator in bioinformatics.

Acquisition of sequence data:
Bioinformatics tools can be used to obtain sequences of genes or proteins of interest, either from material obtained, labelled, prepared and examined in electric fields by individual researchers/groups or from repositories of sequences from previously investigated material.

Analysis of data:
Both types of sequence can then be analysed in many ways with bioinformatics tools. They can be assembled. Note that this is one of the occasions when the meaning of a biological term differs markedly from a computational one. Computer scientists, banish from your mind any thought of assembly language. Sequencing can only be performed for relatively short stretches of a biomolecule and finished sequences are therefore prepared by arranging overlapping "reads" of monomers (single beads on a molecular chain) into a single continuous passage of "code". This is the bioinformatic sense of assembly.

They can be mapped (that is, their sequences can be parsed to find sites where so-called "restriction enzymes" will cut them). They can be compared, usually by aligning corresponding segments and looking for matching and mismatching letters in their sequences. Genes or proteins which are sufficiently similar are likely to be related and are therefore said to be "homologous" to each other---the whole truth is rather more complicated than this. Such cousins are called "homologues". If a homologue (a related molecule) exists then a newly discovered protein may be modelled---that is the three dimensional structure of the gene product can be predicted without doing laboratory experiments.

Bioinformatics is used in primer design. Primers are short sequences needed to make many copies of (amplify) a piece of DNA as used in PCR (the Polymerase Chain Reaction). Bioinformatics is used to attempt to predict the function of actual gene products. Information about the similarity, and, by implication, the relatedness of proteins is used to trace the "family trees" of different molecules through evolutionary time. There are various other applications of computer analysis to sequence data, but, with so much raw data being generated by the Human Genome Project and other initiatives in biology, computers are presently essential for many biologists just to manage their day-to-day results Molecular modelling / structural biology is a growing field which can be considered part of bioinformatics. There are, for example, tools which allow you (often via the Net) to make pretty good predictions of the secondary structure of proteins arising from a given amino acid sequence, often based on known "solved" structures and other sequenced molecules acquired by structural biologists. Structural biologists use "bioinformatics" to handle the vast and complex data from X-ray crystallography, nuclear magnetic resonance (NMR) and electron microscopy investigations and create the 3-D models of molecules that seem to be everywhere in the media.

Note: Unfortunately the word "map" is used in several different ways in biology/genetics/bioinformatics. The definition given above is the one most frequently used in this context, but a gene can be said to be "mapped" when its parent chromosome has been identified, when itís physical or genetic distance from other genes is established and less frequently. When the structure and locations of its various coding components (its "exons") are established.

Biological databases