Genomics and Its Significance – I

All biological organisms have an underlying map that governs the multitude of complex attributes, from anatomy and physiology to behavioral patterns that is characteristic to them. Therefore, understanding the map is crucial to solving many of the problems and challenges surrounding organisms. This map is what is generally referred to as the genome, i.e., the complete ensemble of DNA in a haploid set of chromosomes of an organism and the field of genomics involved in its study.

The following first part of the article is an attempt to briefly explore the field of genomics and how it has evolved throughout the years which will be followed by its significance and impact on society in the second part of the article.

Practices central to genomics

Mendel’s observations on heritable traits paved the way for the field of genetics that primarily focusses on the elements, i.e., genes, responsible for different traits characterizing a particular organism or species and their heritability from one generation to another. The field of genomics revolves around the complete genetic make-up that is the genome, therefore it describes all genes, the interactions between them and the environment. In order to achieve this, genomics have several practices central to it, which are sequencing, mapping and assembly of genomes, development of technologies to analyze the raw sequence data, and finally analyzing the data to produce useful information using databases and computational methods.

Sequencing and assembly

Rosalind Franklin’s x-ray crystallographic observations, followed by the discovery of the structure of the DNA by Watson and Crick during which they also determined DNA as a form of storage of genetic information, fortified the need to identify the exact order of deoxyribonucleotides in a given DNA sequence. Henceforth, sequencing became crucial to genomics.
Fredrick Sanger, Allan Maxam and Walter Gilbert were some of the pioneers in the development of DNA and RNA sequencing methods. However, the first generation of sequencing methods were limited to sequencing of relatively smaller nucleic acid molecules and genes.

Figure 1. Bacteriophage MS2 viewed from outside of its protein capsid

 The first ever attempt at sequencing a complete genome (3569 bp) was that of the phage MS2 RNA genome in 1976. As the need for sequencing genomes of larger organisms grew, scientists, with the use of genome or DNA libraries, developed genome sequencing methods such as clone by clone method and whole genome shotgun sequencing.

Their increased automation and improved sequencing machines later saw a reduction in the time and cost of sequencing.

These sequencing technologies were complemented by methods of genome mapping, where the locations of genes were mapped based on different criteria. The first form of genetic mapping was introduced by Thomas Morgan while experimenting on the fruit fly, where he observed gene linkage and recombination. By using linkage to identify the relative positions of genes, a map of the fruit fly chromosome was created, which is referred to as the genetic linkage map. The discovery of polymorphic DNA markers allowed for the creation of genetic (linkage) maps with higher resolutions.  Physical maps were another development where instead of locating the relative positions of genes, overlapping physical fragments of DNA are aligned to create a map predicting the true positions of genes.

Figure 2. Illustration of a genetic and physical map.

These developments together with improved genomic libraries paved the way for sequencing and assembly of larger genomes such as that of phage λ and Epstein-Barr virus B95-8 strain, and led to even larger genome projects including yeast genome sequencing, and the most widely known human genome project during the period of 1990 to early 2000s. Furthermore, all these sequencing and genome assembly efforts were made possible with the advent of computing technologies (efficient algorithms) in the 1980s. Some of the first genome assembling algorithms were greedy assemblers and later, assemblers based on overlap-layout-consensus, Eulerian path (based on de-Bruijn graphs) and align-layout-consensus (reference genome-based assembly) were introduced.

Processing raw sequence data

            Given that enormous amounts of data were generated from sequencing efforts, scientists were now confronted with the problem of what could be done with all this raw data. The first efforts that were taken in resolving this resulted in the formation of a global database that allowed for the storage of sequence data as well as global access to it. The first ever global database to be formed was the Nucleotide Sequence Data Library by European Molecular Biology Laboratory (EMBL) which is now part of the European Nucleotide Archive (ENA). Later, it was followed by creation of the GenBank at NCBI of NIH and the DNA Data Bank of Japan (DDBJ). By joint agreement of the above 3 parties, the International Nucleotide Sequence Database Collaboration (INSDC) was formed to facilitate universal access to sequence information in all databases regardless of which database is queried. In fact, later it was made compulsory for researchers to submit any sequence data to the database ahead of publishing their results.

            Concurrent developments in information technology and computer sciences in the latter part of the 20th century provided the necessary technology for the analysis of sequence data. It eased the processing of raw sequence data, enabling the extraction and visualization of valuable data engraved in the genome of an organism. The process of extracting, identifying, and defining features associated with a genome is called genome annotation. Bioinformatics play a major role in this regard.

            Information or features that is to be extracted and identified include genomic regions that do not encode for proteins, protein encoding regions and finally identifying the functions of these elements in relation to a single cell or organism. Genes are identified by the use of homology-based methods (extrinsic methods) or ab initio methods (intrinsic methods). Development of algorithms that produced alignments between sequences (global or local) provided the steppingstones to achieving this task.

Extrinsic methods resort to sequence similarity between the genomic DNA and available protein sequences, EST (Expressed Sequence Tags), cDNA, or other genomic DNA (i.e., by comparative genomics) to identify genes. Software programs like GeMoMa (Gene Model Mapper) use this method. Intrinsic methods utilize intrinsic properties characteristic to the sequence such as GC content, codon composition (codon usage bias), start and stop codons, translation initiation codon etc. Early gene predicting programs like DAGGER, GeneMark, GeneModeler used this method, whereas GENSCAN, GenomeScan, FGENESH, Twinscan like softwares utilize an integrative approach where they use both homology and ab initio methods to predict genes.

Prediction of non-coding regions such as those regions transcribing for rRNAs, tRNAs and regulatory regions, although have been predicted using sequence similarity and programs like tRNAScanSE32 (for de novo prediction of tRNAs), is still seen as a challenge since only a limited proportion of them have been identified even through experimental methods. Finally functional annotation of genes is carried out using a standard vocabulary called the Gene Ontology that was created by the joint effort of researchers of 3 databases Saccharomyces Genome Database, FlyBase, and the Mouse Genome Database. Gene ontology describes genes in terms of molecular functions, broader biological processes, and cellular components where the gene products are found or function in.

The preceding exploration certainly does not suffice in describing the efforts taken in developing the relevant technologies as well as the developments that are currently undertaken trying to decode the genomes of organisms but only peek into the world of genomics. The next part of the article will dive into the significance and impact that genomics have had on present society and its advancements.

Savindu Weerathunga

3rd Year

References

  1. Chaitanya, K. V. (2019). From Archaea to Eukaryotes. In Chaitanya, K. V. (Ed.). Genome and genomics. Springer Singapore. https://doi.org/10.1007/978-981-15-0702-1
  2. García-Sancho, M., & Lowe, J. (2023). A History of Genomics across Species, Communities and Projects. Springer Nature. https://doi.org/10.1007/978-3-031-06130-1
  3. Giani, A. M., Gallo, G. R., Gianfranceschi, L., & Formenti, G. (2020). Long walk to genomics: History and current approaches to genome sequencing and assembly. Computational and Structural Biotechnology Journal, 18, 9–19. https://doi.org/10.1016/j.csbj.2019.11.002
  4. Mathé, C. (2002). Current methods of gene prediction, their strengths and weaknesses. Nucleic Acids Research, 30(19), 4103–4117. https://doi.org/10.1093/nar/gkf543
  5. (2019, March 9). Genetics vs. Genomics Fact Sheet. Genome.gov. https://www.genome.gov/about-genomics/fact-sheets/Genetics-vs-Genomics

Image Courtesy

  1. Figure 1 – https://shorturl.at/cnAV0
  2. Figure 2 – https://shorturl.at/mBJQ2

Leave a Reply

Your email address will not be published. Required fields are marked *