The words you are searching are inside this book. To get more targeted content, please make full-text search by clicking here.
Discover the best professional documents and content resources in AnyFlip Document Base.
Search
Published by dslagoriya, 2019-03-07 20:50:16

1-s2.0-S0378111910001575-main

1-s2.0-S0378111910001575-main

Gene 461 (2010) 1–4

Contents lists available at ScienceDirect

Gene

journal homepage: www.elsevier.com/locate/gene

Review

An overview of the current status of eukaryote gene prediction strategies

Roy D. Sleator ⁎

Department of Biological Sciences, Cork Institute of Technology, Rossa Avenue, Bishopstown, Cork, Ireland

article info abstract

Article history: As sequence data continues to be generated at a logarithmic rate our dependence on effective in silico gene
Received 22 March 2010 prediction methods is also increasing. Herein, I review the current state of eukaryote gene prediction
Received in revised form 15 April 2010 methods; their strengths, weaknesses and future directions.
Accepted 16 April 2010
Available online 27 April 2010 © 2010 Elsevier B.V. All rights reserved.

Received by A.J. van Wijnen

Keyword:
Gene prediction

1. Introduction more than 95% of human genes show evidence of at least one alternative
splice site.
The publication of the draft human genome sequence in 2001
marked a watershed in the genomics era (Lander et al., 2001; Venter Current in silico gene prediction methods involve two distinct
et al., 2001). However, rather than heralding the end of large scale aspects: the first centres on the type of information utilized by gene
sequencing projects; the completion of the human genome project prediction programs (i.e. the evidence for the existence of a gene based
enabled sequencing facilities to turn their considerable resources to on functional signal recognition), while the second involves the algo-
even more ambitious projects (Sleator et al., 2008). One exciting rithms employed by these programs to accurately predict gene structure
example is the human microbiome initiative which aims to sequence and organisation (Fig. 1).
the totality of all microbes in, or on, the human body, thus providing us
with an extended view of ourselves as super-organisms (Turnbaugh Although gene prediction has been the subject of several excellent
et al., 2007; Sleator, 2010). review papers (Do and Choi, 2006; Brent, 2007; Flicek, 2007); the
current study is designed to appeal to both the expert and non-expert
However, as the sequence data continues to increase logarithmically reader alike; providing a concise overview of the current status of
our ability to annotate the information and to accurately pinpoint eukaryote gene prediction models, beginning with a brief overview of
coding regions has lagged considerably behind. While prokaryote gene sensor recognition and gene-finders, their strengths and weaknesses,
annotation can be complicated by overlapping regions which makes before concluding with an outline of some outstanding issues which still
identification of translation start sites difficult (Palleja et al., 2008), need to be addressed.
eukaryote gene structure prediction is even more complex (Lewis et al.,
2000). In addition to low-density coding sequence (∼3% in human 2. Functional sensor recognition
DNA); eukaryote coding regions (exons) are often widely interspersed
with non-coding intervening sequences (introns) (Lander et al., 2001; Just as large-scale genome sequencing projects relied on the
Venter et al., 2001). Furthermore, eukaryote coding sequences are existence of molecular markers to facilitate genome assembly, gene
subject to alternative-splicing; a process of shuffling genetic information annotation strategies rely on sensors within the DNA sequence to allow
which facilitates the synthesis of more than one protein from a single accurate delineation of gene structure and organisation (Mathe et al.,
gene sequence (Schellenberg et al., 2008). Indeed, it is estimated that 2002). Two types of sensor (content and signal) are routinely used to
locate genes in the genomic sequence:
Abbreviations: MM, Markov model; PWM, Positional weight matrices; WAM,
Weight array model; IMM, Interpolated Markov models; HMM, Hidden Markov (i) Content sensors classify DNA into coding regions and non-coding
models; GHMM, Generalized Hidden Markov models; SMCRF, Semi-Markov condi- regions (introns, intergenic regions and un-translated regions
tional random field; EHMM, Evolutionary Hidden Markov Models; UTR's, un-translated (UTR's)). Content sensors can be further divided into extrinsic
regions; FASTA, Fast-all; BLAST, basic local alignment search tool; TSS, Transcriptional and intrinsic sensors. Based on the assumption that coding
signal sensors; GFF, general feature format; ncRNA, non-coding RNAs; miRNAs, MicroRNAs. sequences are more conserved than non-coding ones (Mathe
et al., 2002); extrinsic content sensors exploit homology searches
⁎ Fax: +353 21 432 6851. to identify highly conserved exons. Using local alignment
E-mail address: [email protected]. methods (ranging from the optimal Smith–Waterman algorithm

0378-1119/$ – see front matter © 2010 Elsevier B.V. All rights reserved.
doi:10.1016/j.gene.2010.04.008

2 R.D. Sleator / Gene 461 (2010) 1–4

Fig. 1. Schematic overview of eukaryote gene prediction methods and the underling sensors routinely used to locate genes in genomic sequences.

to fast heuristic approaches such as FASTA and BLAST), two 3. Gene predictor programs—strengths and weaknesses
approaches can be employed: inter-genomic or cross species
comparisons. While these strategies may be effective, they are Gene prediction programs can be divided into two classes: empirical
limited by the constraints of phylogenetic distance. The second and Ab initio gene-finders.
approach overcomes this limitation by employing intra-genomic
comparisons which (providing data for multigenic families) (i) Empirical gene predictors, also referred to as sequence similarity
represents a large percentage of existing genes (e.g. up to 80% for based gene-finders, identify genes based on homology searches
Arabidopisis). A significant failing of extrinsic approaches is that of known databases (genomic DNA, cDNA, dbEST, or protein).
they are limited to homologies within the database; if no The comparison of two (or more) homologous genomic
homologs exist no data can be extracted. sequences (either inter- or intraspecies) facilitates the identifi-
Intrinsic content sensors, on the other hand, focus on specific cation of conserved exons. When combined with signal sensors
innate characteristics of the DNA sequence itself, which help to this information allows us to refine region boundaries and more
predict the likelihood of whether the sequence in question accurately model gene structure and organisation.
“codes” for a protein or not. The most obvious indicator of coding
versus non-coding sequence identified to date is hexamer fre- (ii) Ab initio (or de novo) gene-finders rely on sequence information
quency (i.e. 6 nucleotide long words) (Mathe et al., 2002). Other afforded by both signal and content sensors (Do and Choi, 2006).
useful intrinsic content sensors include nucleotide composition, The algorithms employed by these programs to model gene
codon usage and base occurrence periodicity. Coding regions are structure include neural networks, Fourier transforms and most
defined by three Markov models (MMs; see Box 1), one for each commonly Markov models (Outlined in Box 1). Ab initio gene-
position inside a codon. These three-periodic MMs are based on finders can be categorized based on the number of genome
the kmer (especially hexamer) composition of coding sequence sequences employed for gene analysis and include single, dual
and are trained on a set of known sequences before being used to and multiple-genome predictors (Brent and Guigo, 2004).
detect a particular content. Region-specific content sensors for Single-genome predictors, such as GENSCAN (Burge and Karlin,
coding and non-coding regions or even for different subtypes of 1997), which focus exclusively on one genome, are compara-
non-coding regions have been developed (Mathe et al., 2002). tively faster and easier to run than the equivalent dual or multi-
(ii) Signal sensors detect the presence of functional sites specific to a genome predictors. Furthermore, given that only one genome is
gene. To date signals relating to transcription, translation and considered, single-genome predictors are not restricted by
splicing have all been employed to facilitate gene identification phylogenetic distance; i.e. the availability of closely related
and structure prediction. Transcriptional signal sensors (TSS) genomic sequences. However, dependence on a single genome
include the initiator or cap signal located at the transcriptional can be restrictive, particularly given that newly sequenced
start site and the upstream TATA box promoter signal, as well genomes may contain as few as 50% known genes from which
as the polyadenylation signal (a consensus AATAAA hexamer) to estimate model parameters (Guigo et al., 2000). To help over-
located 20 to 30 bp downstream of the coding region. Transla- come this limitation, dual-genome predictors, such as TWINSCAN
tional signals include the “Kozak signal” located immediately (Flicek et al., 2003), have been developed to exploit sequence
upstream of the start codon (Kozak, 1996). However, given that conservation between two related genomes (e.g. mouse and
higher eukaryote genes in particular harbour multiple exons, man). Alignments are performed first and the resulting data is
accurate gene structure prediction in these organisms relies used to inform prediction algorithms such as Hidden Markov
heavily on the identification of splice site signals (Stamm, 2008), Models (outlined in Box 1). However, there remain inherent
specifically donor and acceptor sites (GT-AG on the introns uncertainties in reconstructing the lineages of genomic regions
sequence) and branch points (CU[A/G]A[C/U] located 20–50 bp for two such distantly related organisms, as human and mouse,
upstream of the AG acceptor). owing to the extent of genomic restructuring which has occurred
since their last common ancestor. Given that the genomes of

R.D. Sleator / Gene 461 (2010) 1–4 3

Box 1
An overview of Markov Models in sequence analysis and gene prediction.

A Markov model (MM) is a stochastic model which assumes that the probability of a particular nucleotide occurring at a
given position depends only on the k previous nucleotides. In this case k is the order of the MM, the larger k the finer the MM
can characterize dependencies between adjacent nucleotides. Such a model is defined by the conditional probabilities P(X|k
previous nucleotides), where X = A, T, G or C. In order to build a Markov model, a learning set of sequences, on which these
probabilities will be estimated, is required.
The most frequently used categories of MMs in eukaryote gene prediction methods are outlined below:

Positional weight matrices (PWM) The simplest MMs are homogeneous zero order MMs which assume that each
base occurs independently with a given frequency. Such simple models are
Weight array model (WAM) often used for non-coding regions.
An inhomogeneous higher order MM capable of capturing potential dependencies
Three-periodic Markov model between adjacent positions of a signal.
Characterize coding sequences. Coding regions are defined by three MMs, one for
Interpolated Markov models (IMM) each position inside a codon.
IMMs combine statistics from several MMs, from order zero to a given order k
Hidden Markov models (HMM) (typically k = 8), according to the information available.
Generalized Hidden Markov models HMMs allow for insertions and deletions and so variation in signal length.
GHMMs allow a string, rather than a single symbol, as the output of a state.
(GHMM)
Semi-Markov conditional random field A more flexible variation of GHMM which allows a wider range of biological
features to be incorporated with fewer technical concerns (Bernal et al., 2007)
(SMCRF) EHMMs model molecular evolution as a Markov process in two dimensions: a
Evolutionary Hidden Markov Models substitution process over time at each site in the aligned genomes, which is guided
by a phylogenetic tree; and a process by which the rate of evolution changes from
(EHMM) one site to the next (Brent and Guigo, 2004)

the higher primates can be aligned much more accurately than data, the design of the programs themselves can be problematic. Until
those of human and mouse, it would appear that multi-genome recently little commonality existed between newly developed gene
predictors involving several closely related species (using prediction programs (Mathe et al., 2002). Little or no equivalence in
EHMMs) are significantly more attractive than dual-genome outputs or vocabulary made cooperative data analysis by more than
predictors focusing on two distantly related genomes (Boffelli one program difficult if not impossible. By designing a general feature
et al., 2003). format (GFF) to standardise all gene predictor outputs it will be
(iii) Combining gene-predictor outputs. Coupling the extrinsic ap- possible to develop common tools for down-stream analysis:
proach of empirical gene-finders with intrinsic ab initio predic- evaluation, graphical representation and ultimately the development
tion programs significantly improves gene prediction protocols of combination predictors.
(Allen et al., 2004). GenomeScan, developed by Yeh and others
(2001), is an extension of GENSCAN which incorporates Other factors complicating eukaryote genome prediction include the
similarity with a protein retrieved by BLASTX, thus combining presence of extended introns (e.g. the human dystrophin gene consists
extrinsic and intrinsic approaches to gene identification. Using of N99% of introns, some of which are N100 kb). This is particularly
GenomeScan regions of higher similarity (on the basis of BLASTX problematic when bordering short exons, for example some Arabidopsis
E-values) are accorded more confidence than comparable genes contain exons which are only 3 bp long making them extremely
regions of lower similarity. Thus, GenomeScan predictions may difficult to detect, especially given that missing such exons does not
sometimes ignore a region that has either weak intrinsic introduce a frame shift (Mathe et al., 2002).
properties (e.g. poor splice signals) or is inconsistent with other
extrinsic information. As a result GenomeScan accuracy is sig- In addition, unusual examples of eukaryote gene structure and
nificantly higher than GENSCAN when related sequences are function continue to be identified; overlapping genes for example,
available. although more characteristic of prokaryotes, have nonetheless been
reported in the genomes of both plants and animals (Quesada et al.,
4. Conclusions and future prospects 1999). Furthermore, though originally believed to occur exclusively in
prokaryotes, polycistronic genes have also been identified in the
Although significant advances continue to be made in the gene eukaryote Caenorhabditis elegans (Blumenthal, 1998). As non-canonical
prediction arena, several issues still need to be addressed (Do and cases continue to be uncovered; ever increasing levels of sophistication
Choi, 2006). As outlined previously by Claverie (1997), existing will be required from newly designed gene prediction methods.
sensors relying on known sequences, either in the form of training
sets or databases, are highly conservative and as such relatively Additionally, to further complicate the issue, as well as protein
inflexible. Furthermore, accuracy of gene prediction is highly coding genes, a large proportion of the human genome is composed of
dependent on database quality; while in extrinsic gene prediction RNA sequences that do not encode proteins (Taft et al., 2010). Also
erroneous data only affects the analysed data itself, in intrinsic known as non-coding RNAs (ncRNA) these genes are predicted to play
prediction it can lead to corrupted training sets which dramatically an important role in the regulation of eukaryote gene expression
affects overall program performance. In addition to problems with the (Forrest et al., 2009; Oulas et al., 2009). Indeed, MicroRNAs (miRNAs)—a
subgroup of ncRNAs are predicted to control the activity of approxi-
mately 30% of all protein-coding genes in mammals (Li et al., 2009).
Given that a significant fraction of ncRNAs are short and/or poorly

4 R.D. Sleator / Gene 461 (2010) 1–4

conserved in sequence, the conceptually simple approach of homology- Claverie, J.M., 1997. Computational methods for the identification of genes in vertebrate
based transfer becomes a complex and technically demanding task; one genomic sequences. Hum. Mol. Genet. 6, 1735–1744.
which is further complicated by a paucity of information on RNA
families. Although several recent efforts to customize sequence-based Do, J.H., Choi, D.K., 2006. Computational approaches to gene prediction. J. Microbiol. 44,
search tools for ncRNA applications have shown some success; such as 137–144.
the use of semi-global alignments and the development of methods for
fragmented pattern search (Mosig et al., 2009), much still needs to be Flicek, P., 2007. Gene prediction: compare and CONTRAST. Genome Biol. 8, 233.
achieved in this area. Flicek, P., Keibler, E., Hu, P., Korf, I., Brent, M.R., 2003. Leveraging the mouse genome for

Finally, irrespective of the level of sophistication achieved, or the gene prediction in human: from whole-genome shotgun reads to a global synteny
reliability of the data obtained, gene prediction methods remain just map. Genome Res. 13, 46–54.
that—predictions. In silico analysis must always be confirmed by in Forrest, A.R.R., Abdelhamid, R.F., Carninci, P., 2009. Annotating non-coding transcrip-
vitro and/or in vivo “wet lab” experimentation to confirm the tion using functional genomics strategies. Brief. Funct. Genomics Proteomics 8,
existence of a putative gene and the functionally of its predicted 437–443.
protein product. Guigo, R., Agarwal, P., Abril, J.F., Burset, M., Fickett, J.W., 2000. An assessment of gene
prediction accuracy in large DNA sequences. Genome Res. 10, 1631–1642.
Acknowledgments Kozak, M., 1996. Interpreting cDNA sequences: some insights from studies on
translation. Mamm. Genome 7, 563–574.
The author wishes to acknowledge the continued financial Lander, E.S., et al., 2001. Initial sequencing and analysis of the human genome. Nature
assistance of the Health Research Board (HRB), the Food Institutional 409, 860–921.
Research Measure (FIRM) through the Department of Agriculture and Lewis, S., Ashburner, M., Reese, M.G., 2000. Annotating eukaryote genomes. Curr. Opin.
the Alimentary Pharmabiotic Centre (APC) through Science Founda- Struct. Biol. 10, 349–354.
tion Ireland (SFI). Li, M., Marin-Muller, C., Bharadwaj, U., Chow, K.H., Yao, Q., Chen, C., 2009. MicroRNAs:
control and loss of control in human physiology and disease. World J. Surg. 33,
References 667–684.
Mathe, C., Sagot, M.F., Schiex, T., Rouze, P., 2002. Current methods of gene prediction,
Allen, J.E., Pertea, M., Salzberg, S.L., 2004. Computational gene prediction using multiple their strengths and weaknesses. Nucleic Acids Res. 30, 4103–4117.
sources of evidence. Genome Res. 14, 142–148. Mosig, A., Zhu, L., Stadler, P.F., 2009. Customized strategies for discovering distant
ncRNA homologs. Brief. Funct. Genomic Proteomic 8, 451–460.
Bernal, A., Crammer, K., Hatzigeorgiou, A., Pereira, F., 2007. Global discriminative Oulas, A., Reczko, M., Poirazi, P., 2009. MicroRNAs and cancer—the search begins! IEEE
learning for higher-accuracy computational gene prediction. PLoS Comput. Biol. 3, Trans. Inf. Technol. Biomed. 13, 67–77.
e54. Palleja, A., Harrington, E.D., Bork, P., 2008. Large gene overlaps in prokaryotic genomes:
result of functional constraints or mispredictions? BMC Genomics 9, 335.
Blumenthal, T., 1998. Gene clusters and polycistronic transcription in eukaryotes. Quesada, V., Ponce, M.R., Micol, J.L., 1999. OTC and AUL1, two convergent and
Bioessays 20, 480–487. overlapping genes in the nuclear genome of Arabidopsis thaliana. FEBS Lett. 461,
101–106.
Boffelli, D., et al., 2003. Phylogenetic shadowing of primate sequences to find functional Schellenberg, M.J., Ritchie, D.B., MacMillan, A.M., 2008. Pre-mRNA splicing: a complex
regions of the human genome. Science 299, 1391–1394. picture in higher definition. Trends Biochem. Sci. 33, 243–246.
Sleator, R.D., 2010. The human superorganism—of microbes and men. Med. Hypotheses
Brent, M.R., 2007. How does eukaryotic gene prediction work? Nat. Biotech. 25, 74, 214–215.
883–885. Sleator, R.D., Shortall, C., Hill, C., 2008. Metagenomics. Lett. Appl. Microbiol. 47,
361–366.
Brent, M.R., Guigo, R., 2004. Recent advances in gene structure prediction. Curr. Opin. Stamm, S., 2008. Regulation of alternative splicing by reversible protein phosphory-
Struct. Biol. 14, 264–272. lation. J. Biol. Chem. 283, 1223–1227.
Taft, R.J., Pang, K.C., Mercer, T.R., Dinger, M., Mattick, J.S., 2010. Non-coding RNAs:
Burge, C., Karlin, S., 1997. Prediction of complete gene structures in human genomic regulators of disease. J. Pathol. 220, 126–139.
DNA. J. Mol. Biol. 268, 78–94. Turnbaugh, P.J., Ley, R.E., Hamady, M., Fraser-Liggett, C.M., Knight, R., Gordon, J.I., 2007.
The human microbiome project. Nature 449, 804–810.
Venter, J.C., et al., 2001. The sequence of the human genome. Science 291, 1304–1351.
Yeh, R.F., Lim, L.P., Burge, C.B., 2001. Computational inference of homologous gene
structures in the human genome. Genome Res. 11, 803–816.


Click to View FlipBook Version