Team:Harvard/Brainstorming
From 2011.igem.org
Bioinformatics Brainstorming Notes
June 6th
- While many different zinc fingers exist naturally in organisms, the human transcription factor zinc finger zif268 (Fig.1) has undergone the most extensive characterization and manipulation. Consequently, zif268 has shaped the interaction between synthetic biology and zinc fingers more than any other single zinc finger protein.
- Zif268 displays the Cys2His2 motif (Fig.2) characteristic of many zinc fingers
June 13th
Functions to design:
- generate(matrix, pseudocounts (lambda), dependency tuples)
- takes a matrix of zinc-finger AA position counts, a list of dependent amino acid pairs, and a pseudocount multiplier and generates a list of potential amino acid sequences weighted by independent and dependent probabilities
- add_pseudo(dependent matrix row,independent matrix row)
- given a matrix row of dependent counts (i.e. how many times 'a' occurs at position n when 'b' is set to some AA at position m) and a row of independent matrix counts (how many times 'a' occurs at n regardless of b's AA) return an adjusted matrix row, based on the dependent matrix row, that has pseudocounts added to the values that are empty in the dependent matrix row but filled in the independent matrix row.
- generate_indep(matrix)
- randomly pick an amino acid, given a matrix row, from a weighted random distribution based on the values in the row
- generate_dep(indep_row, dep_row, lambda)
- add pseudo counts (call add_pseudo) and generate a dependent random call for a position (using generate_indep on the adjusted matrix)
New Triplet List File
I created a new triplet list from the zfDB table that we got. Its columns are:
Triplet Upstream Downstream Helix Sequence F Number
Upstream and Downstream are the two bases in those directions. One note of possible confusion; because the triplets are in reverse order, downstream is the next two bases in the following open triplet, and upstream is the previous two bases at the end of the previous open triplet. If you were reading 5' to 3', these would be reversed.
For future reference (mostly mine), the regex I used is:
find: \d+\t\d+\t\d+\t\d+\t(\w)(\w)(\w)\t(\w)(\w)(\w)\t(\w)(\w)(\w)\t(\w+)\t(\w+)\t(\w+)
replace: $1$2$3\t\t$4$5\t$10\t1\n$4$5$6\t$2$3\t$7$8\t$11\t2\n$7$8$9\t$5$6\t\t$12\t3
Persikov 2009 SVM
There's a paper in particular that I think would be useful for you to read, its 2009 Persikov on the dropbox. One of the things we need to do is get the SVM up and running with SVMlight. Check out the instructions here:
http://compbio.cs.princeton.edu/zf/help.html
It goes along with the persikov 2009 paper.
Input example for SVM:
testin.txt
0 3:1 59:1 94:1 318:1 # abcdef
command line:
./svm_classify testin.txt SVMp.mod testout.txt
testout.txt
2.5055193
The first input column states that this is a row to be classified. The 4 following columns indicate amino acid to base relationships as described in the model's README file:
Conversion table from SVM features to amino acid - nucleotide contacts. Canonials contacts are listed as (see Fig 1 at Persikov et al., 2008):
01 - between amino acid a6 and nucleotide b1 02 - between amino acid a3 and nucleotide b2 03 - between amino acid a-1 and nucleotide b3 04 - between amino acid a2 and nucleotide b4'
1 01Aa 2 01Ac 3 01Ag 4 01At 5 01Ca
316 04Wt 317 04Ya 318 04Yc 319 04Yg 320 04Yt
The first line in the models must be changed from:
SVM-light Version V6.01 to SVM-light Version V6.02
Zinc Finger Backbones
http://compbio.cs.princeton.edu/zf/sources.html
http://www.ebi.ac.uk/interpro/IEntry?ac=IPR007087
Zif268 Middle Finger with no adjacent linker:
FQCRICMRNFS_RSDHLTTHIR_TH
Type IIs Design
Good points Dan, hopefully there is still time to address. Absolutely we need positive controls on the chip, lets make sure those make it on there.
I've done some legwork on using they Type IIs cutters to do the plasmid assembly. Since I'm going to be out of town tomorrow and over the weekend (back sometime Sunday), I thought it was best to share this with you guys now.
The strategy would be to use GAAG as the 5 overhang site and CCGT as the 3' overhang site. Note that all overhangs are 5' overhangs on their strand, since there are no type IIs cutters listed on NEB that have short recognition sites and more than 1 base 3' overhang. I had initially wanted to have one end be 5' overhangs, and the other end 3' overhangs to prevent the vector from self-ligating non-specifically during assembly, but this may not be possible.
5' side of insert = gAAG overhang, with the AAG encoding the C-terminal most residue (lysine) of the upstream part. I realize this ends up adding an extra lysine at the omega subunit-F1 junction, but this is unlikely to make the linker unusable, and allows us to use our library sequences as F2 or F3 fingers, if we ever desired to. This is because the C-terminal residue of ZiF268 F2 (which we will use for non-library fingers) is K, and the g in the gAAG can be the last base of omega subunit (threonine) or the glutamic acid upstream of lysine in ZiF268 F2.
3' side of insert = CCGt, with CCG encoding the N-terminal proline of the next finger (or adding an extra proline before the stop codon, if used as an F3).
If there are strong objections to adding a K between omega and finger 1, we can also use a 5'-NNN overhang-producing type IIs cutter and put the 5' overlap in the c-terminal threonine of omega subunit, and the 3' overlap in the n-terminal proline of F2. However, this prevents us from using the library members as anything but F1 (unless we add a threonine between fingers, possibly).
Here I've used BsaI enzyme since it has a 6 bp recognition site not found in omega subunit or ZiF268, but there are several dozen other 5'-NNN or 5'-NNNN type IIs cutters on NEB alone.
Other options are to use USER cloning, which means we have to do separate PCRs for each backbone, and then mix them in equal ratios before inserting them. This may be more efficient than the ligation based approach, but means more PCR work that can also possibly introduce bias (since USER primers are annealing to different sequences).
If this is all terribly confusing, I apologize, and here are some Seqbuilder files that might be helpful (also in the dropbox under Sequences/chip sequence files).
Good luck finalizing the target sequences and the outputting sequences.