Team:Harvard/Bioinformatics

From 2011.igem.org

(Difference between revisions)
Line 31: Line 31:
===Helix Dependencies===
===Helix Dependencies===
-
====Probability data====
+
Amino acids do not exist in a vacuum: they must somehow be affected by amino acids around them. Besides pairing data, we realized that other interactions could be taking place, and that we would miss these interactions if we only looked at pairing.
-
*The following are graphs of the probability of finding each amino acid at each position on the alpha helix.  
+
 
 +
We created these graphs of the frequency of amino acids in each position, and then [http://en.wikipedia.org/wiki/Blink_comparator blink] the graphs against each other to see what changes. We looked at the probability graphs to determine which amino acid positions on the finger's helix interact with which bases. Some interactions are fairly well estabilished, while others have been more recently proposed [Persikov]
 +
 
{|
{|
| [[File:HARVGnn_freqs.png|thumb|left|Probability data for the 783 fingers that bind to '''GNN''' triplets. Note the high probability of leucine at position 4 and arginine at position 6.]]
| [[File:HARVGnn_freqs.png|thumb|left|Probability data for the 783 fingers that bind to '''GNN''' triplets. Note the high probability of leucine at position 4 and arginine at position 6.]]
Line 50: Line 52:
|}
|}
-
[Reformat?]
+
(A more rigorous way to calculate this is to calculate the entropy change as you change the amino acids in each position. But that is computationally intensive)
-
====Identifying Dependencies====
+
By doing this, we were able to see several patterns.  
-
We looked at the probability graphs to determine which amino acid positions on the finger's helix interact with which bases. Some interactions are fairly well estabilished, while others have been more recently proposed [Persikov]
+
   
-
 
+
-
To identify these interactions in our own data we looked at which helix positions varied most when you changed the bases. A more rigorous way to do this is to calculate the entropy change as you change the amino acids in each position.   
+
*'''xNN'''(Vary base 1): Amino acid 6 changes  
*'''xNN'''(Vary base 1): Amino acid 6 changes  
*'''NxN'''(Vary base 2): Amino acid 3 changes
*'''NxN'''(Vary base 2): Amino acid 3 changes
Line 69: Line 69:
Because there is not much data for 'CNN' and 'ANN' sequences (with 16 and 29 known fingers that bind to each triplet, respectively), we should use pseudocounts for these sequences, so that our frequency generator is not too biased toward probabilities that may not be significant.
Because there is not much data for 'CNN' and 'ANN' sequences (with 16 and 29 known fingers that bind to each triplet, respectively), we should use pseudocounts for these sequences, so that our frequency generator is not too biased toward probabilities that may not be significant.
-
 
-
Our generation program turns these frequencies into probabilities that position X contains amino acid X, given what triplet we are trying to bind.
 
==Programming==
==Programming==
===Probabilities and Randomization===
===Probabilities and Randomization===
 +
 +
Our generation program uses these amino acid frequencies as probabilities that position X contains amino acid X, given what triplet we are trying to bind. Using the dependencies we found, we change which frequency tables are used to generate the new helix.
See the image [[File:HARVBins.png|thumb]]  at right for a more through explanation.
See the image [[File:HARVBins.png|thumb]]  at right for a more through explanation.

Revision as of 19:51, 10 August 2011

bar

Contents

Terminology

  • Backbone: contains most of the amino acids of a zinc finger protein: zif268 is the most famous backbone.
  • Fingers: contain a backbone and a helix, bind to a 3-base DNA triplet
  • Helix: the alpha helix in a finger. It is responsible for binding to a DNA triplet. Helices are made up of 7 amino acids, and fit into a specified position in a backbone.
  • Zinc finger proteins (ZFPs): arrays of three fingers that bind to 9 bases (3 triplets) of DNA.

[Diagram]

Past Zinc Finger Designers

Designing new zinc finger proteins (ZFPs, which are arrays of three fingers) is not an easy task: how they bind and interact with DNA bases is not fully understood, and is an active area of research [Persikov]. Notable past attempts to create novel ZFPs [CODA, OPEN] tried a two distinct methods: CODA took a modular approach
HARVCODA diagram.png

OPEN took two known fingers from an array, and randomized protein sequences to try to generate a third finger to bind a new triplet:

Both techniques were successful in finding ZFPs to bind to novel DNA sequences [how successful?]

Our Approach

Improving on the concept of OPEN, we decided to design ZFPs where the first two DNA triplets can be bound, but the third cannot. For example, if the sequence GTG GGA CCA can be bound but GTG GGA TGG cannot, we would use the first two fingers and generate the third. OPEN simply randomized amino acid sequences to try to create a third finger: we wrote software that uses data from known fingers to "intelligently" generate new fingers.

Data and Analysis

OPEN provided us with a spreadsheet of ZFPs produced by their research. Anton Persikov, during his own ZFP research, has compiled a database of ZFPs from studies from 1980-2005 which he shared with us.

From these two datasets, we distilled over [3000] unique ZFPs which contained approximately [1400] unique fingers.

We analyzed this dataset for frequency (how often a given amino acid appears in a given position in the helix) and pairing (if amino acid A is in position 1, how often is amino acid B next to it).

Helix Dependencies

Amino acids do not exist in a vacuum: they must somehow be affected by amino acids around them. Besides pairing data, we realized that other interactions could be taking place, and that we would miss these interactions if we only looked at pairing.

We created these graphs of the frequency of amino acids in each position, and then [http://en.wikipedia.org/wiki/Blink_comparator blink] the graphs against each other to see what changes. We looked at the probability graphs to determine which amino acid positions on the finger's helix interact with which bases. Some interactions are fairly well estabilished, while others have been more recently proposed [Persikov]

Probability data for the 783 fingers that bind to GNN triplets. Note the high probability of leucine at position 4 and arginine at position 6.
Probability data for the 128 fingers that bind to TNN triplets. Note the high probability of leucine at position 4.
Probability data for the 16 fingers that bind to CNN triplets. There may not be enough data to consider this information statistically significant
Probability data for the 29 fingers that bind to ANN triplets. There may not be enough data to consider this information statistically significant
Probability data for the 298 fingers that bind to NGN triplets. The position 4 leucine motif remains. There is also a high probability (> 0.5) of a histidine at position 3 and an arginine at position 6.
Probability data for the 177 fingers that bind to NTN triplets. The position 4 leucine motif remains.
Probability data for the 244 fingers that bind to NCN triplets. The position 4 leucine motif remains. There is also a very high probability of an arginine at position 6.
Probability data for the 248 fingers that bind to NAN triplets. The position 4 leucine motif remains. There is also a very high probability (> 0.75) of an asparagine at position 3 and an arginine at position 6.
Probability data for the 234 fingers that bind to NNG triplets. The position 4 leucine motif remains. There is also a very high probability (> 0.75) of an asparagine at position 1 and a high probability (> 0.5) of an aspartic acid at position 2 and an arginine at position 6.
Probability data for the 247 fingers that bind to NNT triplets. The position 4 leucine motif remains. There is also a high (> 0.5) probability of an arginine at position 6.
Probability data for the 262 fingers that bind to NNC triplets. The position 4 leucine motif remains. There is also a very high (> 0.75) probability of an arginine at position 6.
Probability data for the 218 fingers that bind to NNA triplets. The position 4 leucine motif remains. There is also a very high (> 0.75) probability of a glutamine at position -1 and an arginine at position 6.

(A more rigorous way to calculate this is to calculate the entropy change as you change the amino acids in each position. But that is computationally intensive)

By doing this, we were able to see several patterns.

  • xNN(Vary base 1): Amino acid 6 changes
  • NxN(Vary base 2): Amino acid 3 changes
  • NNx(Vary base 3): Amino acid -1 and 2(?) changes

Our program looks at dependencies between amino acids when generating sequences.

We decided on these amino acid dependencies, using both established data and patterns we saw in the OPEN data:

  • -1 and 2
  • 2 and 1
  • 6 and 5

Because there is not much data for 'CNN' and 'ANN' sequences (with 16 and 29 known fingers that bind to each triplet, respectively), we should use pseudocounts for these sequences, so that our frequency generator is not too biased toward probabilities that may not be significant.


Programming

Probabilities and Randomization

Our generation program uses these amino acid frequencies as probabilities that position X contains amino acid X, given what triplet we are trying to bind. Using the dependencies we found, we change which frequency tables are used to generate the new helix.

See the image
HARVBins.png
at right for a more through explanation.