BioTech FYI Center - Resources

Nucleic Acid Sequence Databases

Nucleic Acid Sequence Databases

Introduction to Nucleic Acid Sequence Databases

There are three major sites for finding information about nucleic acids (DNA and/or RNA sequences) on the Web, and all of them contain basically the same information. The methods and databases that you will want to use will depend mainly on how much data you want and in what form.

GenBank is your best bet for most sequence searches; it is updated daily, has detailed online help, and lets you do keyword searches of an organism's or enzyme's name to get sequence information. This service can be very slow during peak hours, however.

EMBL (the European Molecular Biology Laboratory) is a flat-file database that isn't quite as easy to use as GenBank, and is usually slow for people in North America since it's based in Europe, but can be useful if you're looking for a limited amount of data and when you are not trying to identify a gene by sequence analysis.

DDBJ (the DNA Databank of Japan) is hard for beginners to use, but it is best for people who would prefer a Japanese-language interface.

Within GenBank and similar databases, use BLAST (Basic Local Alignment Search Tool) if you wish to find what sequences are similar to a sequence that you already have. If you want to locate Expressed Sequence Tags ("single-pass" cDNA sequences), use NCBI's dbEST; if you want to locate Sequence Tagged sites, use dbSTS.

Another option is Entrez, which lets you do keyword searches to retrieve citations and records in the area of molecular biology from the databases of the National Center for Biotechnology Information and nucleotide sequences (in both text and graphical format) from GenBank.

About online nucleic acid databases


Summary: For most sequence searches, GenBank is your best bet. It offers a daily exchange of information with other major sequence databases, has a variety of user interfaces, fairly detailed online help (with e-mail addresses for more information if what is already available is not sufficient), and a speedy interface. Because of its popularity, however, GenBank can also be very slow during peak research hours. Very detailed searches or searches with massive amounts of output might be completed more quickly after hours.

Established by the National Center for Biotechnology Information (NCBI), GenBank is a collection of all known DNA sequences from scientists around the world. As of July 1, 1996, approximately 286,000,000 bases and 352,400 sequences are stored in GenBank, and many more are added each day.
Searching GenBank is fairly straightforward and can be done with a variety of search tools. If you are using a forms-capable WWW browser (such as Netscape 1.0 or higher) and if you have never used GenBank before, you will probably want to start your search with a general query. Other means of searching GenBank include:

  • BLAST (Basic Local Alignment Search Tool) Searches
  • dbEST (Database of Expressed Sequence Tags)
  • dbSTS (Database of Sequence Tagged Sites)
Submitting sequences to GenBank is also very easy and is required by most journals before articles pertaining to the sequence are published (this provides easy access to the information for the journal's readers). You can submit sequences via the WWW with BankIt.


Summary: EMBL is good to use when you need a limited amount of data and when you are not trying to identify a gene by sequence analysis. However, because EMBL and all of its mirror sites are located in Europe, your connection will be slow more often than not. All of the information submitted to EMBL is mirrored daily in both GenBank and DDBJ, so searching elsewhere might provide the same amount of information in less time.

EMBL is the database for the European Molecular Biology Laboratory. It is a flat-file database that is searched by a multitude of various search engines. EMBL sequences are stored in a form corresponding to the biological state of the information in vivo. Thus, cDNA sequences are stored in the database as RNA sequences, even though they usually appear in the literature as DNA.

DBGET is a science links database that summarizes the major databases for nucleic acids, proteins, ligands, medicine, etc. It could prove useful for those trying to cross-reference information.

dbEST is a subdivision of GenBank specific for queries on expressed sequence tags ("single pass cDNA sequences").


Summary: Because DDBJ mirrors its information daily with GenBank and EMBL, beginning sequence searchers might want to try a database with a friendlier searching interface. However, DDBJ also offers all of its pages in Japanese as well, so if you are more comfortable reading the Japanese versions of the pages, it can be very useful.

DDBJ, the DNA Data Bank of Japan, was established in 1986 to be one of the major international DNA Databases (with GenBank and EMBL). It is certified to collect information from researchers and assign accession numbers to submitted entries.

Searching DDBJ is somewhat awkward, as the only way to access most of the data is by its accession number via anonymous FTP.

GSDB:Genome Sequence Databases

Ribosomal Database Project
The RDP is a Gopher collection of ribosomal sequence data

Nucleic Acid Structure Resources
  • RNA Secondary Structures - This site provides information on secondary structures of rRNAs and group I introns. This site also contains some great links to other ribosomal and structural sites.

  • Image LIbrary of Biological Macromolecules - This gopher-based site contains hundreds of images of molecules and complexes along with the reference information. Images are well-categorized although entering the with a specific goal would be helpful.

Codon Usage Tables
  • Indiana University's Gopher-Based Codon List - Indiana University's codon tables summarizes amino acid information for over 50 organisms.

  • Codon Lists at Harvard - More comprehensive than Indiana University's codon lists, Harvard's codon lists includes more organisms and more information. For the novice or uninitiated, though, it can be quite intimidating.

Nucleic Acid Sequence Databases