Skip to content

Welcome courses Facilities research Members News Search
  You are not logged in Log in
You are here: Home » biocourses » Bioinformatics for Biologists - BIOS 599 - Spring 2004 » projects » Analyzing Coding regions of nucleic sequences using EMBOSS package tools

« October 2008 »
Su Mo Tu We Th Fr Sa
      1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30 31  

Analyzing Coding regions of nucleic sequences using EMBOSS package tools

There are both coding and non-coding regions in a DNA sequence. Coding regions are those that code for a protein. There are a number of tools in the EMBOSS package to use to analyze coding regions of a DNA sequence. These tools are important for determining the function of genes and finding related genes in other species. There are tools in EMBOSS for finding Open Reading Frames (ORFs), determining codon usage statistics, translation, and profiling. Tools for each of these functions are listed below.

Open Reading Frames
Defining the ORF for a sequence is the first step in characterizing the coding regions of a sequence. Once the proper ORF is determined you can begin to characterize those coding regions. There are a number of programs on EMBOSS to help with predicting the correct ORF. Some programs give a graphical output to display the possible ORF and some will extract the sequence of possible ORFs based on length. All of these programs find start and stop codons in the sequence to predict the ORF.

Plotorf
The first step to finding an ORF could be to look at the sequence, in all the possible open reading frames, in graphical format in a program like plotorf on EMBOSS. This program takes an input sequence and produces a graph showing the ORFs in all six frames (Forward 1, 2, 3 and reverse 1, 2, 3,). It defines an ORF as the region between start and stop codons and will miss exons that do not begin with a start codon. Although this program can give a visualization of the possible ORF, it is not useful for further investigation of these ORFs.

Using Plotorf
Input file – DNA sequence in any format. It will only read one sequence at a time.
Parameters – You can change the start and stop codons used by the program and specify where you want the program to start and stop reading the input sequence. This can not be done on the windows interface, but can be done from the command line. .
Output file – The default output file is a postscript file but this can be changed. The program outputs a graph that I found is easiest to open in Photoshop. This sample output below show three possible sets of ORFs for one sequence file. The dark regions indicate possible ORFs.

Getorf
Getorf on EMBOSS extracts the sequence of possible ORFs. You can specify the minimum length of sequence that the program will return. This sequence can be between stop codons or between start and stop codons. It will also give back the nucleotide or the translated sequence. This program can be useful in the defining the ORF for a given sequence.

Using getorf
Input file - DNA sequence in any format
Parameters – You need to choose whether you want the program to return sequence between stop codons or between start and stop codons. It will also read a certain number of bases flanking a start or stop codon but you must specify the length of the flanking region.
There is also an option for changing the genetic code used by the program. There are a variety of options for different mitochondrial codes and bacterial organisms. These genetic code tables are from NCBI. There is also an option for circular DNA.
You can have the program return translated protein sequence or nucleic acid sequence. Output file – Program returns only the sequence of the possible ORFs. The sample file below shows part of an output file. The program returns the name of the sequence, the position of the ORF it extracted and the sequence of that section.

Codon Usage
The genetic code is redundant and there is more than one combination of nucleic acids that can code for a particular amino acid residue. These alternative synonymous codons are not used randomly. This non-random use of codons can be taxon specific. By looking at the codon usage bias for a sequence you can determine if you are looking at a coding region of the sequence as well as get some information of where that gene may have originated based on the taxon specific nature of codon usage.

CAI
The EMBOSS tool CAI can be used to calculate a codon adaptation index. CAI takes the sequence it is given and compares it to a codon usage table for the organism you specify in the parameters of the program. For example if a drosophila sequence is entered you can choose the drosophila codon usage table to run the program. The codon usage table is created from highly expressed genes in the organism. This is done because codon bias is most often found in highly expressed genes due to natural selection. Codons that are more efficiently translated are selected for, which changes the frequency of these codons.
The Codon Adaptation Index is a normalized codon preference statistic. The codon preference statistic gives the ratio of the likelihood of finding a particular codon in a highly expressed gene to the likelihood of finding that codon in a random sequence with the same amino acid composition. Therefore a sequence with a high CAI value means that there is a greater probability of finding that codon composition in the codon usage table of highly expressed genes (in that organism) than finding it in a random sequence. In other words a high CAI value confirms that you are looking at a coding region of sequence and also gives information on the expression level of that gene (high values indicate high expression).
The nature of this statistic does cause a problem when looking at sequence of a gene that is not highly expressed. A Low CAI value could mean that the sequence is coding for a gene with low expression or that the sequence is a non-coding region. There can be slight differences in value between low expression and non-coding regions. Non-coding regions will have a CAI value of less than 0.2 while low expression genes will have a value of 0.2 to 0.4. This can be a grey area and further investigation is needed to distinguish between low expression genes and non coding regions.

How to Use CAI
Input Data – DNA sequence (one or more sequences can be entered

Codon usage file – the program gives a list of available codon usage files. The file chosen should be from the organism of interest or a similar organism available in the program. There are no other parameters for this program.
Output file – will give a number value or Codon adaptation Index vale for each sequence that is ran through the program.

Translation
Translation of a DNA sequence to protein sequence can be useful for a number of reasons. In this paper I am addressing the advantages of using protein sequence for profiling. The translation programs are very straight forward and will translate in the given ORF or in all ORFs. When translating a sequence it is important to pay attention and use the correct ORF in order to obtain the correct protein sequence.

Transeq
Transeq is a program of EMBOSS that allows a nucleic sequence to be translating into protein sequence. A nucleic sequence can be translated into any ORF or in all ORF. The input data just needs to be a nucleic sequence and the output file will give that sequence in protein sequence in the designated ORF.

Using Transeq
Input File – DNA sequence File
Parameters- In this program can choose the frame the sequences is translated in.
The program allow you to choosew between a variety of genetic codes for different mitochondrial and nuclear codes. You can also choose the region of the sequence to translate
Output file – Translated protein sequence

Profiles
Profiles can be used to analyze sequence data and find proteins of similar function or to assign function to an unknown protein sequence of interest. This method allows you to take a number of sequences and make a profile that can be used to search for other sequences of similar structure. This allows you to be able to search for areas of conservation in a protein family while still allowing there to be flexibility in the areas that are more variable within that family. This can be more beneficial than searching with one sequence, where there would be equal penalty for variation in conserved regions and variable regions within the protein.
By picking the sequences that are used to make a profile this method can return distantly related protein sequences ( by creating a profile using a wider variety of sequences) or be more specific ( by creating a profile using more closely related sequences). The nature of the profile you create determines the variation you will find in your search results.

Prophecy
Prophecy in an EMBOSS program that will create a frequency matrix from a multiple alignment. This is a simple frequency matrix that gives a score for finding each base at a specific position. This program can also be used to create gribskov and henikoff matrices but a simple frequency matrix is required for input into Profit (a related tool talked about next). The input for this program must be a multiple alignment file and must not contain any gaps for the program to run correctly. The program will run with a gapped sequence alignment but it will not be a suitable input for profit in the next step.

Using Prophecy
Input file – Multiple sequence alignment of protein sequence (like the output from emma or clustal w also found on EMBOSS)
Parameters – Choose the type of matrix the program will create. Use frequency for input into profit (other include henikoff and gribskov). You can change the gap opening and extension penalties as well as the threshold value to report.
Output file – Creates a frequency matrix. The output also gives a consensus sequence of your multiple alignment input.

Profit
Profit is a program on EMBOSS that allows one to search a sequence or database using a frequency matrix that was created using prophecy. The matrix is moved through each position of a sequence and each position is given a score based on the score for that residue in that position given by the matrix. The sum of the scores for each position is used as the score for the sequence. Only sequences that are above a particular threshold are returned by the program.

Using Profit
Input file – input file must be a simple frequency matrix (that can be created from a multiple alignment in prophecy). The program also requires a protein sequence or database to search the profile against.
Parameters – There are no other parameters for this program
Output file – Reports sequences from the query sequence or database that are above the threshold score. It returns three columns of data. The first gives the name of the returned sequences. The second give the start position of the returned sequence and the third gives the percentage of the maximum score possible that sequence received.

References

Gribskov, M. Mclachlan, A.D. and Eisenberg, D.   1987. Profile Analysis: Detection of Distantly Related proteins.   Proc. Natl. Acad. Sci.   Vol 84: 4355-4358.

Jansen, R. Bussemaker, H. Gerstein, M. 2003.   Revisiting the codon adaptation index from a whole genome perspective: analyzing the relationship between gene expression and codon occurrence in yeast using a variety of models.   Nucleic Acid Research. Vol 31:8:2242-2251.

Sharp, P.M. Li, W. 1987.   The codon adaptation index – a measure of directional synonymous codon usage bias, and its potential applications.   Nucleic Acid Research. Vol 15: 3: 1281-1294.

EMBOSS instructional package.   http://egg.isu.edu/inquiry .

Integrated Approaches for Functional Genomics.   Lesk A., Helmer-Citterich M..

http://www.functionalgenomics.org.uk/sections/programme/structural.htm

Sequence conservation.   http://helix.biology.mcmaster.ca/721/outline2/node58.html

The CAI Calculator: What's CAI and how to calculate?   Wu, G. Freeland, S.

http://www.evolvingcode.net/tool_link.php

Associating Distantly related proteins and finding structural motifs.   Devereux, J.

http://www-igbmc.u-strasbg.fr/BioInfo/GCGdoc/Program_Manual/Multiple_Sequence_Analysis/profileanalysis.html