Skip to content

Welcome courses Facilities research Members News Search
  You are not logged in Log in
You are here: Home » biocourses » Bioinformatics for Biologists - BIOS 599 - Spring 2004 » projects » A User's guide for analyzing the non-coding regions of a genome using iNquiry tools on EGG

« October 2008 »
Su Mo Tu We Th Fr Sa
      1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30 31  

A User's guide for analyzing the non-coding regions of a genome using iNquiry tools on EGG

I)Introduction:

Genomic sequences consist of coding and non-coding regions.
A segment of DNA that does not comprise a gene and thus does not code for a protein is defined as a non-coding region. Non-coding regions are interspersed throughout DNA in all the genomes.
There are number of reasons to analyze the non-coding regions of the genome. They possess characteristic features that help to define the coding region of the genome as well as to provide other important information about the processes like regulation of gene expression and function.

II)iNquiry Tools:

iNquiry tools designed for the analysis of the non-coding region are grouped under seven major classes. They are:

  1. CpG islands.
  2. Primers.
  3. Restriction.
  4. Transcription.
  5. Composition.
  6. Motifs.
  7. Repeats.

The usage of these tools ranges from as simple as finding the composition of the given nucleic acid sequence to as complex as designing primers and hybridization oligos. Each of these classes consist of a set of tools, which can be tuned to suit the user's needs.

III) CpG Islands:

A) Introduction:
CpG islands can be defined as a "regions of DNA with a high G+C content and a high frequency of CpG dinucleotides relative to the bulk genome". CpG clusters are useful landmarks in genome sequences for identifying genes in plants, especially with large genomes and vertebrates. Most of the CpG islands are located before the transcription initiation sites, while some exist downstream of the transcription initiation sites. In addition, CpG islands are generally resistant towards methylation and this property is believed to aid the creation and maintenance of the “CpG rich DNA regions” within the CpG depleted bulk DNA.

B) Using the CpG tools:
The nucleic acid sequence in fasta format is inputted for which CpG islands are to be searched. The minimum length of sequence required for all these tools is 200 bps. Almost all these tools use sliding window approach. The user can specify the window length or the default value can be used. The output is in the form of a table or graph depending on the tool used.

C) An example – cpgreport:
This tool possesses a simple as well as an expert form allowing the user to modify the parameters.
This tool uses the running rather than the window approach and hence there is need to specify the threshold CpG score. The running approach gives the instantaneous score of the CpG at any given spot on the sequence from the starting point. The output is in the form of the table, which furnishes the information about CpG rich region, number of CpG in a region, percentage (G + C) in CpG region and finally ratio of CpG to GpC in CpG region

D) Other CpG tools :

Other CpG Tools Brief Description
cpgplot
  • The input is the nucleic acid sequence. The window length is modified as per requirement.
  • It gives the output in the form of a plot.
  • The plot consists of number of base pairs on X-axis and Observed/Expected on Y-axis.
geecee
  • It sums total G + C content in the sequence and reports it as a fraction of the whole sequence.
  • The value ranges from 0.0 to 1.0.
newcpgseek
  • It is similar to cpgreport using sliding window approach but the threshold is fixed as 17.
  • It delivers easy readable output than the cpgreport.

E) Literature relevant to CpG islands:

  1. Ashikawa, I. (2001) Plant Journal , 26 (6) : 617-25.
  2. Rombauts S, Florquin K, Lescot M, Marchal K, Rouze P, van de Peer Y. (2003) Plant Physiology , 132 (3) : 1162-76

IV) Primers:

A) Introduction:
Primers can be defined as a short synthetic oligonucleotide that is used in many molecular techniques ranging from PCR, mutagenesis and DNA sequencing
Primers can be designed for any region in the genomic sequence including STS (sequence tagged site). STS is a defined as a unique DNA sequence of approximately 100 to 500 bps. The major role of the STS is in the construction of physical maps. The uniqueness of the STS is debated because it is not assured that the given sequence is unique but there maybe very low probability of it being getting duplicated else where in the genome.

B) Using Primer tools:
These tools are used to design primers; forward, reverse, or both as well as hybridization oligos. Some important things should be considered before designing the primers.
The optimal size of the fragment to be amplified should not be larger than 1 KB as well as the actual region of the interest should be of size 600- 700 bps as the sequence data obtained for first ~ 200 bps on the primers is often less accurate. In addition, the sequence obtained more than 600-700 bp from the primer has lower accuracy. Thus, the sequence of the interest should fall under most accurate region of the “ Chromatogram” . The chromatogram is defined as " A graphical or other presentation of detector response, concentration of analyze in the effluent or other quantity used as a measure of effluent concentration versus effluent volume or time". In case of the larger fragments, it is a good idea to design overlapping primers. Other important factors that influence the primer designing are:

        1. Length of the primer: It should range from 18 to 30 bases for the optimal results.
        2. Base composition of the (G + C) content: If the GC content is below 50 the primer may have to be extended beyond the recommended range to keep the Tm in the recommended range thereby affecting the priming efficiency and stringency of a primer.
        3. Base composition at the primer ends: The ends with lower GC content are recommended to avoid primer mispairing.
        4. Melting temperature (Tm) – can be calculated based on content. An optimal melting temperature should range within 52 C to 65 C, especially for high GC templates as this leads to secondary priming artifacts and noisy sequence.
        5. 3' complementary of two primers: If the primers are complementary at 3' end, this will result in the formation and amplification of primer dimers instead of the fragment on interest.
        6. Self-complementary of the primer: It should be avoided to hinder formation of the primer-dimers complex.
        7. Stretches of the Cs/Gs at 3' ends: Long stretches of GC are not recommended to avoid primer mispairing.

All these factors are flexible to serve the user's need.
Details can be accessed on “http://frodo.wi.mit.edu/primer3/primer3_code.html ”

C) An example – eprimer3:
This tool possesses a simple as well as an expert form wherein there exists flexibility of modifying the parameters to suit users need in the expert form. The simple form uses the default settings for the primer design.
There are five choices for the task selection i.e. design forward only, reverse only, PCR primers only, PCR primers and hybridization oligos and hybridization oligos only.
The output gives the relevant information about product size/length, start and end site, Tm, GC percentage as well as the actual sequence of the primer.

D) Other Primer tools:

Other Primer Tools Brief Description
primersearch
  • The output informs if the given set of primers and the nucleic acid of interest are complementary.
  • The input consists of already existing set of primers file and nucleic acid sequence for which primers are required.
  • Mismatches can be allowed.
stssearch
  • The output tells if the set of primer is unique (STS) for that particular database.
  • The input consists of existing set of sts primers file and DNA database for which sts primers are required.

E) Literature relevant to Primer design:

1) S . Rozen, H. Skaletsky. (2000) In S. Krawetz and S. Misener, eds . Bioinformatics Methods and Protocols in the series Methods in Molecular Biology . 365-86.

2) Wang J, Li KB, Sung WK. (2004) Bioinformatics, 0 : 2591-0.

V) Restriction:

A) Introduction:
Restriction enzymes are enzymes from bacterial origin that cut DNA at highly specific sites. They find a variety of uses in molecular biology techniques.
They are used for the preparative purposes in techniques like cloning, for the analytical purposes i.e. to make the restriction maps, a type of physical map of the DNA molecules.
The restriction site is used as a molecular marker, especially for RFLPs (Restriction fragment length polymorphisms) to distinguish among/between different alleles of the same gene.
Isochizomers are defined as a pair of restriction enzymes specific for the same recognition sequence.

B) Using Restriction tools:
The tools covered under this class are designed to give different output as per requirement of the experiment. All of them need a nucleic acid sequence in fasta format as an input. There are flexible parameters for the selection of the size of the fragment, overhang, sticky/blunt end and so forth. The output is in the form of tables and charts. All the tools are linked to the Rebase.com restriction enzyme database.

C) An example – restrict:
This tool comes in simple as well as expert form. The expert form allows users control the parameters that govern the restriction site identification. The simple form lets one to change only two options i.e. length of the recognition site (4 base cutters, 6 base cutters) and enzyme list (specific or generalized). The output is in the form of table revealing the information about the start and end of restriction site, restriction sequence for enzyme, 5' and 3' site and name of the restriction enzyme.

D) Other Restriction tools:

Other Restriction Tools Brief Description
recorder
  • This tool is useful in designing point mutations.
  • The input is a nucleic acid sequence that is scanned by recorder for a single base change.
  • It then reports that single base in the restriction site that can be changed removing the restriction site but the translation frame being maintained.
Redate
  • This tool give the output that includes the cut site, Isochizomers, relevant literature references, list of commercial suppliers etc.
  • The input is a nucleic acid sequence for which the information regarding enzyme analysis is required.
restover
  • This tools looks for the specified overhang in the inputted nucleic acid sequence and the enzyme, which can be used to achieve the desired overhang.
  • The input includes the nucleic acid sequence of interest and the sequence of the overhang required.
silent
  • It functions exactly opposite of “recorder”. It scans the nucleic sequence for a single base change introducing the enzyme cut site without changing the translation frame.
  • The input is the nucleic acid sequence.

E) Literature relevant to Restriction analysis:

1) Mummey DL, Stahl PD (2004) Microbial Ecology, 003: 1000-4.

2) Garros C, Koekemoer LL, Kamau L, Awolola TS, Van Bortel W, Coetzee M, Coosemans M, Manguin S (2004) American Journal of Tropical Medicine and Hygiene, 70 : 260-265.

VI) Transcription:

A)Introduction:
Determination of the regulation of gene expression is a challenging area.
In a given set of genes, it is generally assumed that almost all of the genes are likely to be regulated by some commonly known transcription factors. Many of these factors act by binding to specific DNA sequences, cis-acting elements. Once a gene is identified, it is necessary to determine the regulatory elements involved that control the expression, which mainly includes the transcription factors.

B) Using Transcription tools:
There is only one tool available for this purpose on the iNquiry and it is called tfscan. The input is a nucleic acid sequence in fasta format that is then scanned by the tool to uncover the different binding sites and different binding factors involved in the gene regulation. There are few parameters available to obtain the output. The output is in the tabular form.

C) An example – tfscan: The inputs are a nucleic acid sequence and there are only two parameters that can be manipulated i.e. class selection and number of mismatches allowed.
The first parameter, class selection allows the user to select the transcription factor class from available five choices, which are vertebrate, insect, fungi, plant and others.
While in case of number of mismatches, higher the number allowed, the more are the false positives and less stringency.

D) Literature relevant to Transcription:

1) Pickert L, Reuter I, Klawonn F, Wingender (1998) Bioinformatics , 14(30) : 244-51.

2) Caselle M, Di Cunto F, Provero P (2002) BMC Bioinformatics , 3(1) : 7-14.

VII) Motifs:

A) Introduction:
A motif is a short conserved sequence or pattern of nucleotides or amino acids, often suggesting conservation of a function. Motif finding tools are used to build consensus in order to determine the structure and function, find the binding sites for various regulatory elements and enzymes.

B) Using Motif tools:
There are three motif tools , all of which require a nucleic acid sequence in fasta format as an input.   In addition, a specific pattern of nucleic acid/ protein must be specified. One other parameter that can be modified is the number of mismatches allowed; higher the number of mismatches allowed, the higher are the chances of false positives. The output is in the form of the tabular form which informs the user about the start and end site of the motif and number of mismatches, if any.

C) An example – fuzztran:
The input is a nucleic acid sequence while the output is the protein motif. As the output is the protein motif search there are number of parameters that can be modified. The search pattern here is the protein pattern. The number of mismatches, translation frame as well as codon usage can be changed as per need. Given nucleic acid sequence can be translated in six different frames, three forward and three reverse. The translation frame gives you choice to select frame for different organisms or a "standard" frame.The output gives the score that depends on number of mismatches allowed.   It also reports the frame of translation used, start and end site of nucleic acid as well as the translated protein and the resultant motif.

D) Other Motif tools:

Other Motif Tools Brief Description
fuzznuc
  • The input is a nucleic acid sequence and the required nucleic acid pattern to be searched. Number of mismatches can be changed as well as complementary strand can be scanned for the existence of same pattern.
  • The output is in tabular form displaying the start and end site of the pattern sequence as well as mismatches, if any and finally the sequence found.

Dreg

  • The input is a nucleic acid sequence and the required nucleic acid expression pattern to be searched.
  • The output is in the form of a table depicting the start and end site of the pattern along with the sequence.

E) Literature relevant to Motif:

1) Liu Y, Liu XS, Wei L, Altman RB, Batzoglou S (2004) Genome Research, 14(3) : 451-8.

2) Conlon EM, Liu XS, Lieb JD, Liu JS. (2003) Proceedings of National Academy of Sciences , 100(6) : 3339-44.

VIII) Composition:

A) Introduction:
Composition tools are used to determine information about various physiochemical properties of nucleic acids. These properties are important for designing primers, building phylogenies, finding regulatory elements, etc. Isochore can be defined as large regions of the genome that contain local similarities in base composition.

B) Using Composition tools:
There are eight tools on iNquiry for the determining nucleic acid composition. The input is the nucleic acid sequence in fasta format for which the physiochemical property is to be determined. There are number of parameters, depending on the tool used, which can be modified to suit the user's need. Many of the tools require two input sequences, one of them being used as the training set to build the consensus.

C) An example – compseq:
The input is the nucleic acid sequence and this tool can be used to find the dimer/trimers in a given sequence. There is also an option to enter two sequences, one of which is first used to build the consensus and then the second sequence is compared with first to find the degree of conservation between the two sequences. The sequence used to build a consensus should be a known sequence from a known organism. The second sequence can be an unknown sequence from same or different organism. The output is in the form of a table with the details regarding word size, times of occurrence, observed and expected frequency and ration of Obs/Exp frequency.

D) Other Composition tools:

Other Composition Tools Brief Description

banana

  • The input is the B- DNA sequence for which bending and curvature is to be determined.
  • The output is in the graphical as well as tabular form depicting the position of the bend as well as base at which the bending occurs.

Btwisted

  • The input is the B- DNA sequence for twisting is to be calculated.
  • The output consists of the total number of twists, average base pairs per turn and stacking energy.

chaos

  • The input is the nucleic acid sequence, which is a game presentation plot.
  • The output tells the percentage and number of each base in the sequence.

wordcount

  • The input is the nucleic acid sequence. This tool counts the words of the specific size in the given sequence.
  • The output is represented in the form of table that tells about the words of the specific size and the times they occur in the given sequence

dan

  • The input is the nucleic acid sequence. This tool is used to calculate the melting temperature of the given sequence.
  • The output is displayed as a table, which shows the start and end, site of the sequence as well as the Tm and GC percentage.

freak

  • The input is the one or more nucleic acid sequences and the tool looks for the specified base sequence in the input. It uses sliding window approach.
  • The output is in the form of table that calculates the frequency of the specified residues in one window.

Isochore

  • The input is the nucleic acid sequence. This tool looks for the isochors in the given sequence and calculates their occurrence in the sequence.
  • The output is represented in the form of a table and a graph.

E) Literature relevant to Composition:

1) Manolis Kellis, Bruce W. Birren, Eric S. Lander (2004) Nature , 428 , 617–24.

2) Severson DW, DeBruyn B, Lovin DD, Brown SE, Knudson DL, Morlais I. (2004) Journal of Heredity, 95(2) : 103-13.

IX) Repeats:

A) Introduction:
A tandem repeat in DNA is two or more adjacent, approximate copies of a pattern of nucleotides. They are classified into three classes depending on their size:
Satellites – size ranging from 100 kb to over 1 MB.
Minisatellites – size ranging from 1 kb to 20 kb long.
Microsatellite – size ranging from one to 6 bps long.
Short Tandem Repeats (STRs), a type of microsatellite, are short sequences of DNA, normally of length 2-5 base pairs, that are repeated numerous times in a head-tail manner. The polymorphisms in STRs are due to the different number of copies of the repeat element that can occur in a population of individuals. This high degree of polymorphism in STRs is very useful for DNA analysis in forensics cases and paternity testing.

B) Using Repeat tools:
Four tools in the iNquiry find different types of repeats.
All of the tools need a nucleic acid sequence in fasta format as an input. Additionally, the consensus of the repeat sequence, the number of mismatches allowed, maximum and minimum repeat size as well as threshold value needs to be specified. The threshold score is calculated from addition of the score for the matches, a positive score and gap and mismatch penalties, a negative score.

C) An example – etandem:
Input is a sequence of a nucleic acid sequence from which the repeats are to be found. Maximum and minimum repeat size as well as threshold score needs to be specified. The perfect match scores + 1 while one mismatch scores – 1 and thus the score is calculated. The threshold value can be decreased to allow higher number of mismatches and thus high chance of false positives. The output is in tabular form representing start and end of the site, score, size of repeat and consensus sequence.

D) Other Repeat tools:

Other Repeat Tools Brief Description

einverted

  • The input is the nucleic acid sequence. Gap score; mismatch score and threshold score need to be specified.
  • The output is in the form of an alignment due to alignment sequence and its complement; self-by-self blast.

Equicktandem

  • The input is a nucleic acid sequence and parameters include the size of the repeat and threshold value.
  • The output is in the form of a table giving the start and end of the site, repeat size and score for each repeat.

palindrome

  • The input is a nucleic acid sequence and parameters include the maximum and minimum size of the repeat, gaps and mismatches allowed.
  • The output is in the form of the alignment of the palindromic repeats.

E) Literature relevant to Repeats:

1) F. Denœud, G. Vergnaud (2004) BMC Bioinformatics , 5: 4.

2) Christian M. Ruitberg, Dennis J. Reeder and John M. Butler. (2001) Nucleic Acid Research, 29(1) : 320-22.

X) CONCLUSION:

It is important to note that tools used for analyzing non-coding region of a genome play important role in in-silico gene identification, building phylogenies, study of the gene expression patterns and many other applications. They are often linked together to design a chain of tools for the analyzing the genomes in genome sequencing projects of vertebrates and other higher organisms.  

REFERENCES:

1) Ashikawa, I. (2001) Plant Journal , 26 (6) : 617-25.

2) Rombauts S, Florquin K, Lescot M, Marchal K, Rouze P, van de Peer Y. (2003) Plant Physiology , 132 (3) : 1162-76.

3) Benson, G. (1999) Nucleic Acids Research 27(2) : 573-80.

4) wehi.edu.au/dsl/primer.html

5) http://cancer-seqbase.uchicago.edu/primers.html

6) bioweb.uwlax.edu/GenWeb/Molecular/Seq_Anal/seq_anal.htm

7) www.primerdesign.co.uk/primerdesign/primer_design.html

8) www.flybase.org

9) www.accessexcellence.org/AE/AEC/CC/ restriction .html

10) opbs.okstate.edu/~melcher/MG/MGW4/MG421.html

11) EMBOSS instructional package http://egg.isu.edu/inquiry

12)www.web-books.com/MoBio/Free/Ch3G1.htm

13) Gardiner-Garden M, Frommer M. (1987) Journal of Molecular Biology, 196(2) : 261-82.

14)www.iupac.org/goldbook/C01071.pdf