Team:METU-BIN Ankara/Project

From 2011.igem.org

Revision as of 22:53, 21 November 2011 by Gngr (Talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

METU-BIN iGEM Software TeamProject: Mining for BioBricks

Project

Introduction

While the interest to the field of synthetic biology is uprising, iGEM and Parts Registry is getting more important for researchers interested working in the field of synthetic biology. As the number of attendees and teams to iGEM competition increases each year, number of BioBricks submitted to the Parts Registry are increasing too. As we all know, Parts Registry is an inimitable library for synthetic biologists. Although it is an important source of information and tool for biologists, the parts are not organized very well. Therefore, while constructing devices, synthetic biologists usually faces following difficulties at the pre-experimental stage:

  • Searching through the Parts
  • Deciding if the parts needed are already in the database or not
  • Deciding which BioBrick will work effectively with which
  • Finding the most accurate combination between the parts
  • Constructing the most effectively working devices

Figuring out all these answers is time consuming and takes too much effort. Our main goal as 2011 METU-BIN iGEM Software team was to provide a web based tool that helps synthetic biologists at the pre-experimental step to search the Part Registry with input and output keywords and to design their genetic constructs using the biobricks provided in the 2011 distribution of DNA constructs.

M4B: Mining for Biobricks is developed to aid synthetic biologist. Our software proposes a method which can be utilized in pre-experimental step as a supporter data mining tool. Following the user defined input and output parameters, M4B lists all possible composite devices and ranks them according to the novel scoring measurement developed by METU-BIN.

Aim

Main goal of our 2011 project is to provide a web based tool that helps synthetic biologists at the pre-experimental step to design their genetic constructs according to their input and output parameters.

  • A network of all bioparts in 2011 distribution will be generated, which describes the functional relations between the subatomic bioparts.
  • A search algorithm will be developed to reveal all possible device combinations for the user defined input and output within bioparts of 2011 distribution.
  • Visualization tools will be applied for graphical representation of the results.
  • A web-based user interface will be provided for the developed software.

Algorithm

M4B is developed using Java Web Technologies. Web programming enables user- and PC-friendly products. Users do not need to install any program on their PC, and so they do not need to waste PC resources for programs. A web application can easily be reached using a simple web-browser from any PC and from anywhere in the world . It is the most important property of M4B.

Flowcharts for the Algorithms

In detail, we have used Java Applet to serve M4B on the net. And to store data of M4B, we have used an instance of MySQL database. We've created tables for all XML files of PartsRegistry. For example, they contain part info, deep subparts and specific subparts info, scars info, etc. In addition we've added a few more tables to implement algorithm. They are for interaction relations between parts, scores of main parts, scores of edges, etc.

For the implementation of M4B, a huge biobrick network is constructed using information in the PartsRegistry database. With information of interactions between edges, we are creating a part network. It is a hashtable of hashtables. By the use of hashtables, we can easily find the input part on network to start searching. All useful information about a part is stored in a type of Part. It has following fields.

  • partID
  • partType
  • promoterType
  • output
  • terminator
  • nextParts

After constructing the network, we are making a search for the list of all possible composite devices beginning with given input and output parameters. To traverse whole graph, we are using a customized version of the famous graph traversal algorithm, DFS (Depth First Search). In DFS, it is making a search on one way until the leaf node, then if it cannot find it goes back to last disjoint point and goes on searching until next leaf node.

Using the quick access property of hashtable we are finding the input part on network. If the given input is found on the graph once, we go over all the descendant nodes until finding the expected output. While searching an output node, all nodes traversed are being recorded. When the output is found, the recorded path is added to the “results list”. Else, it is discarded until the last disjoint point, and a new path is searched continuing from that point. After all possible paths are found, the list of paths is shown on the GUI.

One can also search for the devices starting with a constitutive promoter. To do this, user shouldn't enter the input parameter. By this empty input information, algorithm looks for constitutive promoters to start as an input point. "$" sign is used as symbolic input for constitutive promoters.

When we are searching for the expected output, we may find different outputs. If that output continues with a Promoter as an Input, it means that we are starting to add the parts of next device. A composite part can have more than one device. So we've added an input field to specify the maximum device count of results. User can change the number of device count using this field.

Let's talk about visualization of the graph. If user selects an input from the list, it parses all parts from selected device and looks for it is type from network. Then, it creates a node of that type. We are using icons of Parts Registry for the types of parts. And we are using yFiles for visualization.

If user selects a node in the graph, it queries the information about that part from tables created using XML files of Parts Registry.

Quick search when typing the input name is the user friendly property of M4B. No need for a “Find” button.

The other feature of the program is printing the results. Most of users wants to save their results. It enables saving.

Extraction of Relations of Bioparts and Construction of Relational Database

Parts Registry contains various types of biobricks and connections in between parts. In our software, we aimed to develop an algorithm which goes over the devices and finds connections between biobricks to present users all possible constructed devices. However, it was very difficult to go over biobricks and find the relations in not very well organized Parts Registry. For that reason, we have decided to form a ”Part Connections Database” by examining all the bilateral relations between biobricks which will be used by our program, M4B. By this way, the complex information in Parts Registry will be simplified and our algorithm will work faster and more effectively.

The biology group focused on all of the four 384 well plates of 2011 distribution and extracted information between connections and devices in a standard way. This information is stored into a relational database and enables queries from remote computer programs.

Our software group developed a toolbox to help biology group in storing the part relations into the new database of connections between biobricks. Using this tool, we were able to minimize the possibility of mistakes while entering bilateral relations manually. More importantly, this organized the database entries in a standardized way, according to the requirements of Mining for BioBricks to run the searching algorithm easily.

Figure 1: The toolbox which was used while building the “Part Connections Database” by our biology group. Figure 2: A screen shot from our temporary file of “relational database”

As there are various types of biobricks in the Parts Registry, we set up rules to organize the “relational database” in an appropriate way.

Rules

The most important rule is “A constructed device formed by combination of promoter-ribosome binding site-gene-terminator, respectively, is the minimal functional device.”

This is the major rule, because without these 4 major biobrick types, a device cannot work. In addition to this, without knowing the description of “basic device” it is impossible for us to present accurate results to users. Although Parts Registry contains various types of biobricks, in our database there are 4 basic biobricks which are promoters, ribosome binding sites, genes, and terminators. This helped us to simplify the complex information in the Parts Registry in our initial attempt of mining.

In addition to 4 basic biobrick types, there are inputs and outputs which are our software’s parameters. Inputs are chosen according to promoters. Both activators and inhibitors of promoters were listed in our database. Main source of activators and inhibitors was the information from the Parts Registry. The reason of listing inhibitors was to filter out constructed devices which do not produce the required outputs. When a constructed device is formed by multiple devices, although the input activates first device’s promoter, it may inhibit other promoters of following devices. Outputs were chosen according to the genes, the product they synthesize. Although, most of the time, input and output information was extracted from Parts Registry, sometimes external sources were used as well because of the uninformative part descriptions at Parts Registry. External sources such as NCBI, PubMed and wiki of the team who submitted the biobrick to iGEM were used.

Constitutive promoters are entered without any input, but especially their “constitutiveness” is recorded as a property in our relations database.

In our software, after constructing a device by combining biobricks to construct multiple devices, combination between devices is formed according to this rule: “Output of the first gene will be the input of the following promoter(s)”

According to this rule, first promoter will be activated by given Input and the product of first gene will be used as activator of the following promoters. By activation of the promoters, genes will synthesize the output. By this way, the constructed complex is expected to work effectively.

In Parts Registry, there are composite parts, translation units, generators, reporters, inverters, intermediate devices, and signaling devices. As Parts Registry 2011 distribution was being examined in detail, to simplify these parts, their connection information was simplified to promoter-ribosome binding site-gene-terminator relation by our biological knowledge. This rule is explained with details as seen below:

  • Reporters

    The biobricks which synthesize fluorescent proteins are classified as “reporters” as they give out light to report something. We accepted these parts as “genes” because they are synthesizing proteins.

  • Translational Units

    Combination of ribosome binding site and gene is classified as “translational units”. We accepted this as two biobricks rather than one. Therefore the bilateral relation between ribosome binding site and gene was entered in our database.

  • Inverters

    Inverters were challenging parts; they are formed from more than 2 biobricks. Rather than accepting this as one biobrick, we accepted this part as a combination of four biobricks. Therefore, we entered bilateral relations between each biobricks as ribosome binding site-gene, gene-terminator, and terminator- promoter. We also entered gene-output and input-promoter connections.

  • Composites
  • Generators
  • Signalling Devices
  • Intermediate Devices

    Composite parts, generators, signaling devices, and intermediate devices were also challenging parts like inverters. They don’t have a specific rule about how many and which types of biobricks they consist of. However, we accepted them as complex subdevices and entered bilateral relation between each connected biobricks.

In METU-BIN 2011 project, our main priority was to focus on (minimal) Functional devices. Therefore we named DNAs, RNAs, and Tags as “accessories” because they are not vital for a device to work. However, we know that using these accessories during construction helps devices work more effectively by adding some extra features. For that reason, it’s in on future plan to form a database for these accessories and present them to our users in their query results.

EDGE-SUM Scoring

In order to provide the biologists a reliability measure about the M4B suggested composite devices, we devised a scoring mechanism. Composite devices found by M4B could be ranked by simple measures such as device count, path length, or whether a specific part is included in the device or not. Instead of these simple scoring mechanisms, we decided to propose a more eloborate scoring algorithm that takes into account status, rating, quality, and results of a device available in the Parts Registry. Such a scoring mechanim makes use of information provided by the community and it will be useful to decide which devices are more “valuable” than others if a set of alternative devices are available for a given input-output pair.

To compute a score for a novel composite device, we first calculate a score for each edge in the composite device. To compute an edge score, we find all the existing devices in the Parts Registry that contain that specific edge and compute an accumulated score for that edge by summing up the scores of these existing Parts Registry devices that contain the edge. All the edge scores are normalized to the interval [0,1] by a sigmoid function. After finding the scores of all edges in a novel composite device found by M4B, we sum the logarithms of these edge scores which is equal to logarithm of the multiplication of scores. If we cannot compute a score for an edge (when the edge is completely new for example) we use a small number for that edge, similar to the idea of pseudocounts in profile HMMs. Since, this logarithm is a negative number we raise it to the power of e again to report the final score between 0.0 and 1.0. Values close to 1.0 indicate more reliable devices. EDGE-SUM SCORES are categorized into three classes according to their numeric range only. Gold-silver and bronze categories are assigned just for easy visualization of the results.

To calculate the score of an existing Parts Registry device, we use the following features of devices. In addition, we give a relative importance to each of these features as suggested by our biologist team members. These relative importance weights is just a heuristic and can be optimized in future versions of M4B.

part_status Available Planning Missing Deleted Unavailable Informational
1.0 0.1 0.0 0.0 0.0 0.2

best_quality Partially Confirmed None Confirmed Long Part Questionable Bad Sequencing
0.5 0.0 1.0 0.2 0.1 0.0

part_rating 1 NULL
1.0 0.0

part_results None Works Fails Issues NULL
0.0 1.0 0.0 0.2 0.0

For example, if a device is Available, Partially Confirmed and Works, its score will be "1.0+0.5+1.0 = 2.5". And, if an edge is included in this device, that edge is going to inherit the score of 2.5 and all these inherited scores from other devices that contain the edge is accumulated to compute the raw (unnormalized) score of that edge.

Results

Our software, M4B: Mining for BioBricks, provides many features saving of time, money and effort for scientists. With this software, scientists don't need to do long-term research and have expensive lab tools.

Since it's online and can be used via any updated browser, users don't need to install it and carry it with themselves in order to use it on another computer. With this feature, they can use our software on all the computers having updated system. Also, they don't have to provide any place in their storage device.

Recommended Minimum System Specifications

  • Intel Pentium 2Ghz
  • 512 MB Ram
  • Java Runtime Environment 1.6
  • 128K Internet Speed
  • Supported web browsers: Firefox, Chrome, Safari, Internet Explorer