Team:St Andrews/review
From 2011.igem.org
An Internal Review of iGEM
During the course of our project, we became interested in what makes a successful iGEM team. Upon a quick internet search, we found that no one has ever looked at this question objectively before. We realized that over the years, teams have entered a vast amount of data about themselves through the judging forms and on their wiki pages, and so we decided to collate all of this data in a database and run statistical tests to examine correlations between various factors. We examined variables such as the team's projected budget, number of advisers, and number of students amongst many other things affecting the chances of a team's success in iGEM. We thought that this might give future teams an idea of the resources required to do well in iGEM, as well as and examine the issue of how 'fair' iGEM is as a competition using hard, objective outcomes.
We have also made our dataset available to the public, so that future teams may carry out different sub-group analyses and add data that will be available in the coming years to examine how the trends we have identified may vary over time. The data table can be found here.
Data Collected
Information was compiled based on a variety of variables and placed into a tabulated spreadsheet. The amount of accumulated variables differed only slightly for each of the three years, as it was dependent on the availability of the data for each team.
For 2008, there were 33 primary variables originally created which included the following:
2009 incorporated one extra variable, which looked at the team’s predicted award versus the award received by the team itself. This information wasn’t accessible to process for 2008, as at this point judging forms weren’t employed online for the competition in 2008. Finally, university research score was included in the 2010 selection.
The next step in the data analysis was to create an ordinal measurement of the medal criteria and the predicted versus awarded medals, in addition to creating a student to advisor ratio. An ordinal scale is where a particular ranking order is given to the data. The scaling for the medal criteria, so that we incorporated teams that withdrew into our samples, is as follows:
This scale now allows us to analyse the relative success of each team.
Also the monetary variables were all altered by changing the currency base into US dollars and historical exchange rates were used from an online currency website (http://www.xe.com/) to keep these values comparable, at a specific time of each respective year.
In total, 303 iGEM teams worth of information was collected over the past three years. The majority of the data was found on the iGEM website for each respective year. Data regarding student and advisor numbers, sponsors, parts submitted, degrees and advisor specialties were obtained from of the various teams' wikis which were accessed through the iGEM website. For advisors’ specialities that couldn’t be found, we used the Google search engine to locate the information from their various universities or laboratories. After a thorough search, anyone that couldn’t be firmly identified was placed in the ‘no data’ category. The projected budget and budget at time of registration were taken from the resource description page on the team information page on the iGEM website. The medals or prizes awarded, as well as number of withdrawals for each year were found on the iGEM results page for each year. Each university’s citation score and world rank were found using ‘The Times Higher Education World Rankings’. Finally, endowment and public/private were obtained from the universities’ websites.
Analyzed Variables
Certain variables were chosen to be reviewed, as we didn’t have the time to look at the specific breakdown of individual parameters such as the various student, advisor and sponsor types. What we thought to be interesting variables to analyse were the total students, total advisors, total sponsors, student to advisor ratio, overall university rank, projected budget and the two outcomes; biobricks and the award received (based on our scale). To include software teams into the success scaling, we allocated a value of -1 to their respective biobricks tally so that they had a separate category to be considered from.
Computational Programs
The spreadsheet was compiled in a Microsoft Office Excel spreadsheet. PASW (Predictive Analysis SoftWare), the successor of SPSS (Statistical Package for the Social Sciences) was used for conceiving basic correlation relationship data and converting the file into a csv (comma separated variable) document to be used in R. R was used to run statistical models from. Some graphs where drawn in graph-pad prism.
Statistical Background and Theory
Correlation coefficient is a measure of the strength of association or relationship between two variables. (Field, 2009)
Statistical significance is based upon something called the p-value and relates to the existence of an effect. When a variable has a low p-value (typically 0.05 or 0.01 are taken as critical) this signifies that the variable holds a statistical significance over the data, although it doesn’t interpret the ‘size’ of the effect that it has. Thus, a variable can have a significant result in terms of the p-value; however parametrically the variable could have a small effect on the data itself. Simply put, if a p-value is small it provides evidence that the effect exists but says nothing about the size of an effect. The p-value is the probability, calculated assuming the null hypothesis is true, that sampling variation alone would produce data that is more discrepant than our data set.
Bivariate analysis is when two variables are examined and their correlation coefficient is noted to denote their dependency.
General linear mixed regression models were used to analyse our data. These allowed us to analyse the variables as a single whole model, and also to analyse their effect and importance in the summary produced in the output. If it was assumed that there is a dependency between two variables, which were being used in the model, this was also considered in the model’s coding.
Model Method
When using this modelling technique the data has to be ‘complete’, this means that every variable field has to have an entry so that it is accepted by the statistical package. If any fields are blank, that specific team, or case, has to be removed from the particular sample being studied as there is insufficient data to analyse. This leads to a complication, where when investigating various parameters there was a variation in the sample sizes that can be run through the model. Also we attempted to remove the random effects between the years and the divisions in the model’s coding, and each model examined different variables against each other, in terms of the award and therefore, in our context, the success of the team.
The p values were taken from the output model summary and these were used against the reference (2008 South African). In terms of statistical analysis performed, it doesn’t matter which reference team is used.
The general outline for the procedure used in processing the data can be found in the following.
Analysis
A general correlation matrix was used initially and this found very basic relations between the examined variables, however this approach was employed with care as chance relationships may be over-emphasized. Correlations, of note, that are prominent at the 1% significance level (or equivalently at the 0.01 p-value) against the award received by the team are biobricks submitted, the projected budget, total sponsors, total advisors and student to advisor ratio. As an example of variables that have very strong significance between them, are university research ranking and university citation score; however this proves that some relations are overly accentuated. This correlation is not particularly useful information in relation to our project, as these are obviously dependent on each other when it comes to ranking. With this in mind, we selected relationships that we think would have a significant bearing on the project and placed them into a general mixed linear model.
By incorporating all of the relevant variables, initially, into the model including the year and division, it was noted that the sample size was only 122 cases. This, although a small proportion of the original 303, is still able to display significance in the data. There are two very prominent variables that show significance at the 0.01 level, these are the number of biobricks submitted and the total number of advisors. Also significant at the 0.05 level is the total number of students.
Figure 1 - 1 = No Medal, 2 = Bronze, 3 = Silver, 4 = Gold, 5 = Finalist (teams that withdrew were excluded as they intrinsically didn't submit biobricks.) The plot is of means with 95% confidence intervals. Here almost every step up in terms of medals incurs a significant increase in number of biobricks submitted. Particularly the distance between confidence intervals from 4 (Gold) to 2 (Bronze) indicates that there is a very significant difference between the mean number of biobricks submitted by these groups.
As biobricks were adopted as an outcome of iGEM, we decided to focus on the input factors that may affect the team’s success.
By reselecting the data, our proportion is increased to 131 cases. This time the only variables that show significance are the total number of advisors and, once again at the 0.05 p-value threshold, the total number of students. There is obviously strong evidence to suggest that the total number of advisors, that a team has, does affect the success of the team.
The next stage was to investigate whether the total number of students, ultimately had an effect on awards received. This was investigated via two separate variables, the student total and the ratio of students to advisors, to examine the varying consequences.
With only these variables to consider, we utilised the whole population size of 303 cases. The statistical analysis quite clearly depicted that the total number of advisors is the key significant variable this time. The previous elements of student to advisor ratio and total students are no longer significant. Yet again, we have strong evidence to suggest that total advisor number is an important factor.
Although we have shown that there is a strong positive correlation between the number of advisors on a team and their success, we were unable to analyse the breakdowns further to see if there may be more applicable information to be gleaned.
Figure 2 - 0 = Withdrew, 1 = No Medal, 2 = Bronze, 3 = Silver, 4 = Gold, 5 = Finalist. The plot is of means with 95% confidence intervals. The difference between the mean number of advisors between groups 4 (gold) and 2 (bronze) is significant since the confidence intervals do not overlap. In addition there appears to be a general trend of more advisers bringing better results for teams, which is concerning in an undergraduate competition.
What does this mean for iGEM?
There is a statistically significant correlation between the number of advisors on an iGEM team and the medal that team receives. This doesn’t bode well for iGEM claiming itself to be a competition aimed at undergraduate students. Understandably, the advisors should play some role in the guidance of the team, but we believe that this statistical significance shows more than simple guidance. The fear is that what is meant to be, at its core, an undergraduate lab exercise, is becoming a competition that is overly influenced by the number of advisors a team has acquired.
When constructing an iGEM team, the iGEM website states that they “recommend finding 8 - 12 students from various disciplines, backgrounds, and levels of expertise” (1). They do not, however, give a recommended amount of advisors, and this lack of guidance is what we feel has led to a large variation in the advisor count amongst teams, ranging from 1 or 2 to as many as 19. There have been teams with more advisors than students, which on its own is a slightly worrying statistic, but we have found instances where the ratio of advisors to students has been as high as 2:1 in non-Software tracks and 3:1 in Software tracks.
Withholding ethical opinions about whether or not large amounts of advisors are appropriate in an undergraduate competition, it is apparent that what might have once been an even playing field is becoming considerably less so. The fact that we can identify any significant correlation between amount of advisors and medal type received is disconcerting enough, but in fact, the effect advisors have on the final medal is actually greater than that of any single type of student within the team, including both biologists and engineers. Evidence like this suggests that teams with large numbers of advisors may have an advantage over teams with fewer advisors.
This raises an ethical question that forces iGEM to define its true nature: is iGEM a competition, or is it a Registry of Standard Biological Parts? iGEM has not put a limit on the maximum number of advisors, and in order to continue receiving the number and caliber of biobrick parts it has in the past, perhaps it would be best to never set that limit. If a large number of advisors correlates to a better medal received, which, as described in the iGEM medal criteria, correlates to the quality and amount of biobricks the Registry receives by that team each year, then limiting the number of advisors may have a negative effect on the Registry’s future contents. However, if iGEM is truly a competition, then placing a limit on the maximum number of advisors per team would help to remove bias that some groups have over others. It is important to note that if iGEM is not a competition at heart, how does this fair in the minds of the students who spend 10 weeks (and often more) in laboratories working to achieve a gold medal in a system designed to produce the highest number and most characterized biobricks possible.
There is also a significant statistical correlation between the number of biobricks submitted versus the level of medal awarded. This may seem rather straightforward, as the competition is ultimately based on the submission of DNA into the Registry of Standard Biological Parts, and the more DNA submitted, the better chance a team has of registering a functioning part that awards them a gold medal. However, we feel this has been the reason behind a growing cause of concern for iGEM: the lack of part characterization. Working within the ten-week confines of the iGEM competition means that there are a limited number of man-hours available to assign to various tasks, and a team looking to secure a gold medal might be tempted to spend that time producing more biobricks than they have time to characterize. Teams in the past have produced upwards of 100 biobricks, including one team with 423 submitted parts. While submissions to the Registry are always appreciated and accepted, it detracts from the biobrick’s true value when the proper research has not been done to find information that could be vital to that biobrick’s usage in the future. If the point of the Registry is to be a source of genetic parts used to synthesize biological systems, but these parts lack the information required for others to utilize them, then the Registry of Standard Biological Parts cannot function as a registry at all.
There is, however, a gold medal criterion that awards iGEM teams a gold medal if they characterize another team’s biobrick. In the 2011 competition year, iGEM changed its medal standards to include biobrick characterization as a silver medal criterion, ensuring that teams aiming to receive a gold medal would need to successfully characterize at least one of their submitted parts. We hope that measures like these will allow the number of biobrick submissions to remain at an all time high, while helping the Registry to become well-characterized for use by both members of academia as well as future iGEM participants.
Acknowledgments
Invaluable assistance was gained from the Miss Lorna Sibbett and Dr Will Cresswell, both of whom are resident biology lecturers who deal with statistics. They helped develop theory, methods and code for statistical modelling to investigate the data.
References
Field, Andy. “Discovering Statistics Using SPSS”, London:SAGE Publications Ltd, 2009
(1) - "Synthetic Biology: Start A Team" Link to paper.