BioTech FYI Center - Resources

BioPerl FAQ (Frequently Asked Questions)

Part:   1  2  3  4  5  6  7 

(Continued from previous part...)

What was wrong with Bio::Tools::Blast?

Bio::Tools::Blast* is no longer supported, as of BioPerl version 1.1. Nothing is really wrong with it, it has just been outgrown by a more generic approach to reports. This generic approach allows us to just write pluggable modules for FASTA and BLAST parsing while using the same framework. This is completely analogous to the Bio::SeqIO system of parsing sequence files. However, the objects produced are of the Bio::SearchIO rather than Bio::Seq variety.

I want to parse FASTA or NCBI -m7 (XML) format, how do I do this?

It is as simple as parsing text BLAST results - you simply need to specify the format as fasta or blastxml and the parser will load the appropriate module for you. You can use the exact logic and code for all of these formats as we have generalized the modules for sequence database searching. The page describing Bio::SearchIO provides a table showing how the formats match up to particular modules. Note that, for parsing BLAST XML output, you will need XML::SAX and that XML::SAX::ExpatXS is recommended to speed up parsing.

How can I generate a pairwise alignment of two sequences?

Look at Bio::Factory::EMBOSS to see how to use the water and needle alignment programs that are part of the EMBOSS suite. Bio::Factory::EMBOSS is part of the bioperl-run package.

Or you can use the pSW module for DNA alignments or the dpAlign module for protein alignments. These are part of the bioperl-ext package; download it via Getting BioPerl.

You can also use prss34 (part of FASTA package) to assess the significance of a pairwise alignment by shuffling the sequences.

How do I get the frame for a translated search?

I'm using Bio::Search* and its frame() to parse BLAST but I'm seeing 0, 1, or 2 instead of the expected -3, -2, -1, +1, +2, +3. Why am I seeing these different numbers and how do I get the frame according to BLAST?

These are GFF frames - so +1 is 0 in GFF, -3 will be encoded with a frame of 2 with the strand being set to -1.

Frames are relative to the hit or query sequence so you need to query it based on sequence you are interested in:

$hsp->hit->strand;
$hsp->hit->frame;

or

$hsp->query->strand;
$hsp->query->frame;

So the value according to a blast report of -3 can be constructed as:

my $blastframe = ($hsp->query->frame + 1) * $hsp->query->strand;
Can I get domain number from hmmpfam or hmmsearch output?

For example:

 SH2_5: domain 2 of 2, from 349 to 432: score 104.4, E = 1.9e-26

Not directly but you can compute it since the domains are numbered by their order on the protein:

my @domains = $hit->domains;
my $domainnum = 1;
my $total = scalar @domains;
foreach my $domain ( sort { $a->start <=> $b->start } $hit->domains ) {
  print "domain $domainnum of $total,\n";
  $domainnum++;
}
Does Bio::SearchIO parse the HTML output that BLAST creates using the -T option?

Yes, with a twist. You can modify Bio::SearchIO's _readline() method such that it reads in the HTML and strips it of tags using the HTML::Strip module.

Please note: We do not suggest parsing BLAST HTML output if it can be avoided. We actively support XML, tabular, and text output parsing of NCBI BLAST reports only; we have never supported parsing of NCBI BLAST HTML output directly through BioPerl and will not attempt to rectify problems where HTML output parsing post-stripping of the tags breaks but parsing text output works. Consider this fair warning.

use Bio::SearchIO;
use HTML::Strip;
my $hs = HTML::Strip->new();
# replace the blast parser's _readline method with one that
# auto-strips HTML:
package Bio::SearchIO::blast;

sub Bio::SearchIO::blast::_readline {
 my ($self, @args) = @_;
 my $line = $self->SUPER::_readline(@args);
 return unless defined $line;
 return $hs->parse($line);
}
# now parse using the BLAST format module
 my $in = new Bio::SearchIO(-format => 'blast', -file   => $file);

Annotations and Features

How do I retrieve all the features from a sequence?

How about all the features which are exons or have a /note field that contains a certain gene name?

To get all the features:

my @features = $seq->all_SeqFeatures();

To get all the features filtering on only those which have the primary tag (ie. feature type) exon.

my @genes = grep { $_->primary_tag eq 'exon'}
$seq->all_SeqFeatures();

To get all the features filtering on this which have the /note tag and within the note field contain the requested string $noteval.

my @f_with_note = grep {  my @a = $_->has_tag('note') ? $_->each_tag_value('note') : ();
                                         grep { /$noteval/ } @a;  }  $seq->all_SeqFeatures();
How do I parse the CDS join or complement statements in GenBank or EMBL files to get the sub-locations?

For example, how can I get the the coordinates 45 and 122 in join(45..122,233..267) ?

You could use primary_tag to find the CDS features and the Bio::Location::SplitLocationI object to get the coordinates:

foreach my $feature ($seqobj->top_SeqFeatures){
  if ( $feature->location->isa('Bio::Location::SplitLocationI') and $feature->primary_tag eq 'CDS' ) {
     foreach my $location ( $feature->location->sub_Location ) {
       print $location->start , ".." , $location->end, "\n";
     }
  }
}

(Continued on next part...)

Part:   1  2  3  4  5  6  7 

BioPerl FAQ (Frequently Asked Questions)