Team:St Andrews/review

From 2011.igem.org

(Difference between revisions)
Line 77: Line 77:
<p class="textpart">For 2008, there were 33 primary variables originally created which included the following:
<p class="textpart">For 2008, there were 33 primary variables originally created which included the following:
</p>
</p>
-
 
-
<h2>Analyzed Variables</h2>
 
<p class="textpart">2009 incorporated one extra variable, which looked at the team’s predicted award versus the award received by the team itself. This information wasn’t accessible to process for 2008, as at this point judging forms weren’t employed online for the competition in 2008. Finally, university research score was included in the 2010 selection.
<p class="textpart">2009 incorporated one extra variable, which looked at the team’s predicted award versus the award received by the team itself. This information wasn’t accessible to process for 2008, as at this point judging forms weren’t employed online for the competition in 2008. Finally, university research score was included in the 2010 selection.
</p>
</p>
Line 90: Line 88:
</p>
</p>
-
 
+
<h2>Analyzed Variables</h2>
-
 
+
<p class="textpart">
 +
Certain variables were chosen to be reviewed, as we didn’t have the time to look at the specific breakdown of individual parameters such as the various student, advisor and sponsor types. What we thought to be interesting variables to analyse were the total students, total advisors, total sponsors, student to advisor ratio, overall university rank, projected budget and the two outcomes; biobricks and the award received (based on our scale). To include software teams into the success scaling, we allocated a value of -1 to their respective biobricks tally so that they had a separate category to be considered from.
 +
</p>
<h2>Computational Programs</h2>
<h2>Computational Programs</h2>
 +
<p class="textpart"> The spreadsheet was compiled in a Microsoft Office Excel spreadsheet.
 +
PASW (Predictive Analysis SoftWare), the successor of SPSS (Statistical Package for the Social Sciences) was used for conceiving basic correlation relationship data and converting the file into a csv (comma separated variable) document to be used in R.
 +
R was used to run statistical models from.
 +
</p>
 +
<h2>Statistical Background and Theory</h2>
<h2>Statistical Background and Theory</h2>
 +
<p class="textpart">Correlation coefficient is a measure of the strength of association or relationship between two variables. (Field, 2009) </p>
 +
<p class="textpart">Statistical significance is based upon something called the p-value and relates to the existence of an effect. When a variable has a low p-value (typically 0.05 or 0.01 are taken as critical) this signifies that the variable holds a statistical significance over the data, although it doesn’t interpret the ‘size’ of the effect that it has. Thus, a variable can have a significant result in terms of the p-value; however parametrically the variable could have a small effect on the data itself. Simply put, if a p-value is small it provides evidence that the effect exists but says nothing about the size of an effect.
 +
The p-value is the probability, calculated assuming the null hypothesis is true, that sampling variation alone would produce data that is more discrepant than our data set. </p>
 +
<p class="textpart">Bivariate analysis is when two variables are examined and their correlation coefficient is noted to denote their dependency.</p>
 +
<p class="textpart">General linear mixed regression models were used to analyse our data. These allowed us to analyse the variables as a single whole model, and also to analyse their effect and importance in the summary produced in the output. If it was assumed that there is a dependency between two variables, which were being used in the model, this was also considered in the model’s coding.
 +
</p>
 +
<h2>Model Method</h2>
<h2>Model Method</h2>
including pdf
including pdf
 +
 +
<p class="textpart">When using this modelling technique the data has to be ‘complete’, this means that every variable field has to have an entry so that it is accepted by the statistical package. If any fields are blank, that specific team, or case, has to be removed from the particular sample being studied as there is insufficient data to analyse. This leads to a complication, where when investigating various parameters there was a variation in the sample sizes that can be run through the model. Also we attempted to remove the random effects between the years and the divisions in the model’s coding, and each model examined different variables against each other, in terms of the award and therefore, in our context, the success of the team.
 +
</p>
 +
<p class="textpart">The p values were taken from the output model summary and these were used against the reference (2008 South African). In terms of statistical analysis performed, it doesn’t matter which reference team is used.
 +
</p>
 +
 +
<p class="textpart">The general outline for the procedure used in processing the data can be found in the following.</p>
 +
<p class="textpart"> The data was sorted into a single csv file through PASW.
 +
Selection of the appropriate variables and data in question can be done on PASW, this narrows down the sample size of our data as mentioned previously.
 +
The new collection is then outputted as a second csv file.
 +
This is then inputted into R and used in a mixed linear regression model.
 +
The output summary from the model shows the p-value for each variable against the reference. Using this information, it can clearly be seen which variables are statistically significant.
 +
The subsequent step was then to remove a variable to see if the model was ‘improved’.
 +
The improvement was observed by comparing the two (or more) models via the AIC (Akaike information criterion) which describes how well the model is fitted to the data. By noting which version of the model has the ‘better’ AIC, we then dredge (a statistical technique used in data mining, sometimes inappropriately) to see the relative importance of the variables. Dredging is only used to highlight the possible combinations of the model that might make a better model, however this procedure is used with caution as can be seen as statistical misuse.
 +
</p>
<h2>Analysis</h2>
<h2>Analysis</h2>
 +
<p class="textpart">A general correlation matrix was used initially and this found very basic relations between the examined variables, however this approach was employed with care as chance relationships may be over-emphasized. Correlations, of note, that are prominent at the 1% significance level (or equivalently at the 0.01 p-value) against the award received by the team are biobricks submitted, the projected budget, total sponsors, total advisors and student to advisor ratio. As an example of variables that have very strong significance between them, are university research ranking and university citation score; however this proves that some relations are overly accentuated. This correlation is not particularly useful information in relation to our project, as these are obviously dependent on each other when it comes to ranking. With this in mind, we selected relationships that we think would have a significant bearing on the project and placed them into a general mixed linear model.
 +
</p>
 +
<p class="textpart">By incorporating all of the relevant variables, initially, into the model including the year and division, it was noted that the sample size was only 122 cases. This, although a small proportion of the original 303, is still able to display significance in the data. There are two very prominent variables that show significance at the 0.01 level, these are the number of biobricks submitted and the total number of advisors. Also significant at the 0.05 level is the total number of students.
 +
</p>
 +
<p class="textpart">As biobricks were adopted as an outcome of iGEM, we decided to focus on the input factors that may affect the team’s success.
 +
</p>
 +
<p class="textpart">By reselecting the data, our proportion is increased to 131 cases. This time the only variables that show significance are the total number of advisors and, once again at the 0.05 p-value threshold, the total number of students. There is obviously strong evidence to suggest that the total number of advisors, that a team has, does affect the success of the team.
 +
</p>
 +
<p class="textpart">The next stage was to investigate whether the total number of students, ultimately had an effect on awards received. This was investigated via two separate variables, the student total and the ratio of students to advisors, to examine the varying consequences.
 +
</p>
 +
<p class="textpart">With only these variables to consider, we utilised the whole population size of 303 cases. The statistical analysis quite clearly depicted that the total number of advisors is the key significant variable this time. The previous elements of student to advisor ratio and total students are no longer significant. Yet again, we have strong evidence to suggest that total advisor number is an important factor.
 +
</p>
 +
<p class="textpart">Although we have shown that there is a strong positive correlation between the number of advisors on a team and their success, we were unable to analyse the breakdowns further to see if there may be more applicable information to be gleaned.
 +
</p>
 +
<h2>What does this mean for iGEM?</h2>
<h2>What does this mean for iGEM?</h2>
<p class="textpart">Get 15 advisors.</p>
<p class="textpart">Get 15 advisors.</p>
<h2>Acknowledgments</h2>
<h2>Acknowledgments</h2>
 +
<p class="textpart">Invaluable assistance was gained from the Miss Lorna Sibbett and Dr Will Cresswell, both of whom are resident biology lecturers who deal with statistics. They helped develop theory, methods and code for statistical modelling to investigate the data.
 +
</p>
<h2>References</h2>
<h2>References</h2>
 +
<p class="textpart"> Field, Andy. “Discovering Statistics Using SPSS”, London:SAGE Publications Ltd, 2009
 +
</p>
</div>
</div>

Revision as of 23:52, 21 September 2011

An Internal Review of iGEM

Intro, reasons behind compilation

Data Collected

Information was compiled based on a variety of variables and placed into a tabulated spreadsheet. The amount of accumulated variables differed only slightly for each of the three years, as it was dependent on the availability of the data for each team.

For 2008, there were 33 primary variables originally created which included the following:

2009 incorporated one extra variable, which looked at the team’s predicted award versus the award received by the team itself. This information wasn’t accessible to process for 2008, as at this point judging forms weren’t employed online for the competition in 2008. Finally, university research score was included in the 2010 selection.

The next step in the data analysis was to create an ordinal measurement of the medal criteria and the predicted versus awarded medals, in addition to creating a student to advisor ratio. An ordinal scale is where a particular ranking order is given to the data. The scaling for the medal criteria, so that we incorporated teams that withdrew into our samples, is as follows:

This scale now allows us to analyse the relative success of each team.

Also the monetary variables were all altered by changing the currency base into US dollars and historical exchange rates were used from an online currency website (www.xe.com) to keep these values comparable, at a specific time of each respective year.

In total, 303 iGEM teams worth of information was collected over the past three years. The majority of the data was found on the iGEM website for each respective year. Data regarding student and advisor numbers, sponsors, parts submitted, degrees and advisor specialties were obtained from of the various teams' wikis which were accessed through the iGEM website. For advisors’ specialities that couldn’t be found, we used the Google search engine to locate the information from their various universities or laboratories. After a thorough search, anyone that couldn’t be firmly identified was placed in the ‘no data’ category. The projected budget and budget at time of registration were taken from the resource description page on the team information page on the iGEM website. The medals or prizes awarded, as well as number of withdrawals for each year were found on the iGEM results page for each year. Each university’s citation score and world rank were found using ‘The Times Higher Education World Rankings’. Finally, endowment and public/private were obtained from the universities’ websites.

Analyzed Variables

Certain variables were chosen to be reviewed, as we didn’t have the time to look at the specific breakdown of individual parameters such as the various student, advisor and sponsor types. What we thought to be interesting variables to analyse were the total students, total advisors, total sponsors, student to advisor ratio, overall university rank, projected budget and the two outcomes; biobricks and the award received (based on our scale). To include software teams into the success scaling, we allocated a value of -1 to their respective biobricks tally so that they had a separate category to be considered from.

Computational Programs

The spreadsheet was compiled in a Microsoft Office Excel spreadsheet. PASW (Predictive Analysis SoftWare), the successor of SPSS (Statistical Package for the Social Sciences) was used for conceiving basic correlation relationship data and converting the file into a csv (comma separated variable) document to be used in R. R was used to run statistical models from.

Statistical Background and Theory

Correlation coefficient is a measure of the strength of association or relationship between two variables. (Field, 2009)

Statistical significance is based upon something called the p-value and relates to the existence of an effect. When a variable has a low p-value (typically 0.05 or 0.01 are taken as critical) this signifies that the variable holds a statistical significance over the data, although it doesn’t interpret the ‘size’ of the effect that it has. Thus, a variable can have a significant result in terms of the p-value; however parametrically the variable could have a small effect on the data itself. Simply put, if a p-value is small it provides evidence that the effect exists but says nothing about the size of an effect. The p-value is the probability, calculated assuming the null hypothesis is true, that sampling variation alone would produce data that is more discrepant than our data set.

Bivariate analysis is when two variables are examined and their correlation coefficient is noted to denote their dependency.

General linear mixed regression models were used to analyse our data. These allowed us to analyse the variables as a single whole model, and also to analyse their effect and importance in the summary produced in the output. If it was assumed that there is a dependency between two variables, which were being used in the model, this was also considered in the model’s coding.

Model Method

including pdf

When using this modelling technique the data has to be ‘complete’, this means that every variable field has to have an entry so that it is accepted by the statistical package. If any fields are blank, that specific team, or case, has to be removed from the particular sample being studied as there is insufficient data to analyse. This leads to a complication, where when investigating various parameters there was a variation in the sample sizes that can be run through the model. Also we attempted to remove the random effects between the years and the divisions in the model’s coding, and each model examined different variables against each other, in terms of the award and therefore, in our context, the success of the team.

The p values were taken from the output model summary and these were used against the reference (2008 South African). In terms of statistical analysis performed, it doesn’t matter which reference team is used.

The general outline for the procedure used in processing the data can be found in the following.

The data was sorted into a single csv file through PASW. Selection of the appropriate variables and data in question can be done on PASW, this narrows down the sample size of our data as mentioned previously. The new collection is then outputted as a second csv file. This is then inputted into R and used in a mixed linear regression model. The output summary from the model shows the p-value for each variable against the reference. Using this information, it can clearly be seen which variables are statistically significant. The subsequent step was then to remove a variable to see if the model was ‘improved’. The improvement was observed by comparing the two (or more) models via the AIC (Akaike information criterion) which describes how well the model is fitted to the data. By noting which version of the model has the ‘better’ AIC, we then dredge (a statistical technique used in data mining, sometimes inappropriately) to see the relative importance of the variables. Dredging is only used to highlight the possible combinations of the model that might make a better model, however this procedure is used with caution as can be seen as statistical misuse.

Analysis

A general correlation matrix was used initially and this found very basic relations between the examined variables, however this approach was employed with care as chance relationships may be over-emphasized. Correlations, of note, that are prominent at the 1% significance level (or equivalently at the 0.01 p-value) against the award received by the team are biobricks submitted, the projected budget, total sponsors, total advisors and student to advisor ratio. As an example of variables that have very strong significance between them, are university research ranking and university citation score; however this proves that some relations are overly accentuated. This correlation is not particularly useful information in relation to our project, as these are obviously dependent on each other when it comes to ranking. With this in mind, we selected relationships that we think would have a significant bearing on the project and placed them into a general mixed linear model.

By incorporating all of the relevant variables, initially, into the model including the year and division, it was noted that the sample size was only 122 cases. This, although a small proportion of the original 303, is still able to display significance in the data. There are two very prominent variables that show significance at the 0.01 level, these are the number of biobricks submitted and the total number of advisors. Also significant at the 0.05 level is the total number of students.

As biobricks were adopted as an outcome of iGEM, we decided to focus on the input factors that may affect the team’s success.

By reselecting the data, our proportion is increased to 131 cases. This time the only variables that show significance are the total number of advisors and, once again at the 0.05 p-value threshold, the total number of students. There is obviously strong evidence to suggest that the total number of advisors, that a team has, does affect the success of the team.

The next stage was to investigate whether the total number of students, ultimately had an effect on awards received. This was investigated via two separate variables, the student total and the ratio of students to advisors, to examine the varying consequences.

With only these variables to consider, we utilised the whole population size of 303 cases. The statistical analysis quite clearly depicted that the total number of advisors is the key significant variable this time. The previous elements of student to advisor ratio and total students are no longer significant. Yet again, we have strong evidence to suggest that total advisor number is an important factor.

Although we have shown that there is a strong positive correlation between the number of advisors on a team and their success, we were unable to analyse the breakdowns further to see if there may be more applicable information to be gleaned.

What does this mean for iGEM?

Get 15 advisors.

Acknowledgments

Invaluable assistance was gained from the Miss Lorna Sibbett and Dr Will Cresswell, both of whom are resident biology lecturers who deal with statistics. They helped develop theory, methods and code for statistical modelling to investigate the data.

References

Field, Andy. “Discovering Statistics Using SPSS”, London:SAGE Publications Ltd, 2009