| Introduction to Recombinant Genetics- Biology 350 | |
|
|
Coding Potential of dna sequences Once a genome or large DNA segment has been sequenced, one of the first questions to be answered is, "are there genes encoded in this sequence?" First, the DNA can be compared to all known sequences in the hopes that a match will be found with corresponding genes. You can use the NCBI ORF Finder which identifiies and translates open reading frames and then searches for matching proteins with a BLAST search. NCBI ORF Finder If a match is found then you can use the matching ORF as evidence that the gene is real and if the match from the database has a function, then you can even assign a putative function to your newly identified gene. If matches come from cDNA or ESTs from the same organism, then this is strong evidence that mRNA is expressed. If the protein is not found in the database you will not find matches. That does not mean that there are no potential coding regions in the sequence. If the sequence is from a prokaryote then prokaryotic sequences for transcription initiation, ribosome binding and transcription termination can aid in identifing ORFs that have the possibility of being transcribed and translated. If the sequence is a eukaryote, there are less defined promoter sequences plus you have the additional problem of identifying coding regions in exons that are interspersed with introns. Codon bias can aid in identifying true coding regions. Rare codons, or those that are serviced by very low abundance tRNAs, are not usually found in expressed ORFs. To get a codon usage table, take known protein coding reading frames SBMV c2 & c4 and placing them into CodonW. Codon usage for whole genomes can also be calculated: Excel file for E. coli codon usage. Usage of codons can be by analysed and compared to genomic usage tables.
There are several programs that use multiple search techiques, training, and refinement of eukaryotic gene location. GRAIL is the most widely used ORF identification tool. GRAIL provides analysis of protein coding potential of a DNA sequence.
GRAIL uses variable-length windows tailored to each potential exon candidate,
defined as an open reading frame bounded by a pair of start/donor, acceptor/donor
or acceptor/stop sites. This scheme facilitates the use of more genomic
context information (splice junctions, translation starts, non-coding
scores of 60-base regions on either side of a putative exon) in the exon
recognition process. GRAIL finds about 91% of all coding regions with
an apparent false positive rate of 8.6%. GenLang is a syntactic pattern recognition system, which uses the tools
and techniques of computational linguistics to find genes and other higher-order
features in biological sequence data. Patterns are specified by means
of rule sets called grammars, and a general purpose parser, implemented
in the logic programming language Prolog, then performs the search. GeneID is a program that predicts genes in genomic sequences using a heirarchical approach that first identifies start and stop codons as well as splice sites. Next exons are built and last the gene structure is constructed. The latest version also takes into account sequence similarities and experimental or predictions from other programs. GeneScan is a general purpose eukaryotic gene prediction program that is highly accurate in its predictions. GeneFinder offers some unusual custom algorithms: "The algorithm first predicts all possible potential internal exons, and potential 5' and 3'-exon for each internal by linear discriminant functions combining characteristics describing various contextual features of these exons. Then by method of dynamic programming it searches for optimal combination of these exons and construct gene model." Glimmer is a system for finding genes in microbial DNA, especially the genomes of bacteria, archaea, and viruses. Glimmer (Gene Locator and Interpolated Markov ModelER) uses interpolated Markov models (IMMs) to identify the coding regions and distinguish them from noncoding DNA. AAT (Analysis and Annotation Tool for Finding Genes in Genomic Sequences) Michigan (USA)Eukaryotic Promoter Prediction by Neural Network LBNL (USA) FramePlot NIH-NET (Japan) Gene Finder, (Human, Mouse, Arabidopsis, Fission Yeast) CSHL (USA) Geneid Gopher interface and e-mail, SDSC (USA) GeneID-3 Search Form IMIM, Barcelona, (Spain) Genie (gene finder based on Hidden Markov Models) LBLN (USA) GenLang (tRNA, group I intron, protein gene : "Linguistic Methods") CBIL Pennsylvania (USA) GenScan (Identification of complete gene structures in genomic DNA) Stanford (USA) GenView (Protein-Coding Gene Prediction) ITBA (Italy) GRAIL ORNL Oak Ridge (Exons, repeats, Poly A sites, CpG) (USA) GLIMMER: Gene Locator and Interpolated Markov Modeler. (A system for finding of genes in microbial DNA) HCpolya (Hamming Clustering poly-A prediction in Eukaryotic Genes) ITBA (Italy) HCtata (Hamming-Clustering Method for TATA Signal Prediction in Eukaryotic Genes) ITBA (Italy) NetGene2 CBS (Denmark) NetPlantGene V2.0 (neural network predictions of splice site prediction in Arabidopsis thaliana DNA) CBS (USA) ORF Finder (graphical analysis tool) NCBI (USA) ORFGene (Gene Structure Prediction using Homologous Proteins) ITBA (Italy) PredictGenes CBRG Zurich (Switzerland) Procrustes WWW server Gene Recognition via Spliced Alignment, USC (USA) Proscan II (predicts putative eukaryotic Pol II promoter sequences) Putative DNA Sequencing Errors Check EMBL-Bork (Germany) RecSta (coding region (CDS or exon) prediction program using COA) Lyon (France) Splice Site Prediction by Neural Network LBNL (USA) SpliceView (Splice Prediction by using Consensus Sequences) ITBA (Italy) tRNAscan (Search for transfer RNA genes in genomic sequence) Washington (USA) |
| © 2005 by CA Rinehart | Index • Syllabus • CourseInfo LogIn • References • Assignment • Next |
| This material is intended for use only by WKU students registered for Biology 350. Other uses prohibited. | |