Team:UNAM-Genomics Mexico/Optimization





See also our optimization control.


Many times synthetic biology aims to collect certain gene from different organisms into only one, in order to bring out the most efficient components of one system. Nevertheless different organisms have different codon usage affecting the organism fitness and reducing the system efficiency. In order to avoid that problem our team developed a software that change the codon usage of a gene(s) to mimic the usage codon of the host organism and a better ΔG for traduccion start.

Optimization description

Certain tRNAs are more abundant than others, and the most abundant are the ones that use the constitutively expressed genes, the less abundant are used by genes that are specifically involved in a particular pathway which is not necessary all the time.

The codon adaptation index (CAI) is a quantity that measures the divergence between the codon usage of a sequence and the codon usage of a special gene set; this set is chosen depending on which codon usage you want to mimic. In order to do not knock down a particular pathway and to ensure the availability of charged tRNAS, the set of genes we want to mimic is made by constitutively expressed ones.

For the iGEM’s project of the UNAM-genomics_Mexico team, we wanted to optimize the CAI of our sequences in order to do not compromise the cell, as well as increase the . As we wanted to express our gene under the nodulation stage, we asked if the nodulation genes have different codon usage compared with the whole genome’s codon usage. For that we compute the w values of CAI, which indicates the observed usage of a codon compared with the expected usage, divided by the most used synonymous codon, so bigger is better. These w values were calculated for the nodulation genes’ codons and all genes’ codons and then we evaluated the difference between both w values for each of the 64 codons. The same procedure was applied to compare the constitutive expressed genes and all genes.

Comparisons (Fig. 1) showed on one hand, that the nodulation genes have almost the same codon usage of the genome and on the other hand that the constitutively expressed genes have a very different codon usage, thereafter doing the optimization based on the nodulation genes wouldn’t have any consequence and conversely doing it based on the constitutively expressed genes surely will have a positive effect.

Figure 1 Normalized difference between the W values of each codon’s CAI of two set of genes (Constitutively expressed and Nodulation) and the W values of their respective codons’ CAI of the whole genome.

With this information and the information of the 5’ UTR ΔG we use a software that we developed in order to mimic the codon usage of the constitutively genes and to avoid secondary structures in 5’ UTR.

This program compiles a serie of scripts whose together optimize a sequence(s) for codon usage and ΔG in the 5' UTR which avoid secondary structures and enable a correct ribosome binding. This optimization ensures an effective translation initiaton and that the translation of the gene(s) won't compromise the fitness of the host. The program will operate in two steps:

The optimization of usage codon by the Codon Adaptation Index(CAI)

In this section the program will call two perl scripts, the first computes the codon usage of genes' set to which you want to mimic, it's recommended to use genes that are constitutively expressed becasue it has been demostrated that those genes have a codon usage capable to mantain the celular fitness; this is going to give you a table with the wCAI value of each codon. Then this table is going to be passed to a second script, which will extract the table information and will replace all the codons of your sequence with the best codons according to the given set of genes.

The optimization of the 5' UTR

The adaptiveness of each one of the 64 codons in the desired organism is calculated. A genetic code table is generated, marking codons allowed when their adaptiveness is higher than a threshold level, or as prohibited when it is not. A different table is generated for each organism and each threshold.

A Perl subroutine receives as input the amino acid sequence of the gene to optimize. It splits the first seven amino acids and, using the genetic code table, generates all the nucleotide sequences that codify this seven amino acids with adaptiveness above the threshold. These sequences are printed into a temporary file.

A second Perl subroutine receives as input a length and a number of sequences to generate. It calls RSATools , which generates the amount of sequences required of the given length. These sequences will be randomized, respecting an Escherichia coli nucleotide model, with controlled parameters such as GC content and complexity. The subroutine processes the RSAT output and prints the sequences into a second temporary file.

A third subroutine combines all the coding sequences with all the random sequences, adding an RBS sequence (provided by the user) at the beginning of each string. It prints the composed sequences into a third file and erases the second and first files.

A fourth subroutine calls Hybrid-ss to calculate the ∆G inherent to each sequence. The subroutine receives a temperature and salt concentrations as parameters for hybrid. Hybrid-ss receives a DNA or RNA string as an input, and calculates the folding free energy of the transcribed sequence for the given temperature and salt concentrations. More positive values translate into less folding, a property desired for an mRNA to be translated efficiently. Hybrid-ss generates a file with sequence and ∆G values. A fifth subroutine processes Hybrid-ss’ file into tables ready to be analyzed with R. All five subroutines are embedded in the same Perl script.

Finally, an R script analyzes the tables in search of sequences with the least folding. These sequences will have been optimized to avoid, to a certain degree, folding after their transcription.


  • Paul M.Sharpl and Wen-Hsiung Li. The codon adaptation index - a measure of directional synonymous codon usage bias, and itspotential application. Nucleic Acids Research. 1987.
  • Grzegorz Kudla, Andrew W. Murray, David Tollervey, Joshua B. Plotkin. Coding-Sequence Determinants of Gene Expression in Escherichia coli. Sicence. 2009.
  • Mark Welch Sridhar Govindarajan, Jon E. Ness, Alan Villalobos, Austin Gurney, Jeremy Minshull, Claes Gustafsson. Design Parameters to Control Synthetic Gene Expression in Escherichia coli. PLoS one, 2009
  • Sivan Navon, Yitzhak Pilpel. The role of codon selection in regulation of translation efficiency deduced from synthetic libraries. Genome Biology. 2011
  • Morgane Thomas-Chollier, Olivier Sand, Jean-Vale´ ry Turatsinze, Rekin’s Janky, Matthieu Defrance, Eric Vervisch, Sylvain Broheé and Jacques van Helden. RSAT: regulatory sequence analysis tools. Nucleic Acids Research, 2008