Thanks to special guest blogger this week, Diane Genereux of the Biology Department, for providing an introduction to the use of R (open source stats computing language) in conjunction with Westfield's new Bioinformatics class! If you'd like to contribute in the future, simply email Nicholas Aieta at naieta@westfield.ma.edu.
Sequences of DNA molecules recovered from various subway stations in New York City reflect intrusion of marine microbes during Hurricane Sandy, and mirror the distribution of various ethnic groups across the City' s boroughs (http://www.sciencedirect.com/science/article/pii/S2405471215000022; Afshinnekoo et al. 2015). Temporal and geographic patterns in the occurrence of Tweets containing words like "flu" and "muscle aches" are at least as successful as survey-based data from the Centers for Disease Control and Prevention in predicting local flu outbreaks (http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0019467" Signorini et al.2014). Population-level data on alertness in people of various ages reveal that nighttime use of light-emitting e-readers has severely reduced the duration and quality of sleep for the average American (http://www.pnas.org/content/112/4/1232.long Chang et al. 2014), with possible consequences for the epidemiology of cancer, obesity, and type 2 diabetes.
Sequences of DNA molecules recovered from various subway stations in New York City reflect intrusion of marine microbes during Hurricane Sandy, and mirror the distribution of various ethnic groups across the City' s boroughs (http://www.sciencedirect.com/science/article/pii/S2405471215000022; Afshinnekoo et al. 2015). Temporal and geographic patterns in the occurrence of Tweets containing words like "flu" and "muscle aches" are at least as successful as survey-based data from the Centers for Disease Control and Prevention in predicting local flu outbreaks (http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0019467" Signorini et al.2014). Population-level data on alertness in people of various ages reveal that nighttime use of light-emitting e-readers has severely reduced the duration and quality of sleep for the average American (http://www.pnas.org/content/112/4/1232.long Chang et al. 2014), with possible consequences for the epidemiology of cancer, obesity, and type 2 diabetes.
What is common among these diverse findings? For one thing, all reflect the capacity of data sets that are new, enormous, and often publicly available to reveal population-level patterns that would have gone undetected just a few years ago. In many cases, these new-found patterns provide unique and actionable insights into factors that can improve or impair health on population scale. Best of all, the sheer volume of the data available ensures that individuals with strong data-analysis skills have the opportunity to uncover such patterns --- even in without funding for data collection.
The advent of Big Data inspired this semester's new Bioinformatics course here in Biology Department. Each week, we read and discuss a Big Data paper, and then analyze some data of work on a specific set of computational and graphing skills useful for analyzing such data.
In developing this semester's course, I was not at all worried about how I'd find relevant, primary-source readings that could be engaging for students. Today, one need only open the newspaper to learn about the latest findings made possible by Big Data. I was, however, very worried about how I'd help students of diverse background in math/statistics/computational biology to develop a common set of skills in data analysis. In the past, I've tried to develop mathematical-biology courses around http://www.r-project.org R -- an open-source, statistical-computing language that is very popular among computational biologists --- with limited success. Learning statistical programming is admittedly challenging, and all the more so for beginning biology students who have not yet developed a strong background in statistical concepts.
Once again, Big Data papers suggested a solution. Nearly all of the papers we would be reading in our course report statistical analyses and include graphics made in R. Would it work, I wondered, if I were to write some code to approximate the published analyses, and then ask students to modify the existing code to address a question of their choice?
To my great surprise (and, admittedly, relief!) the answer often seems to be "yes!". So far this semester, students have modified and amended existing R code to perform linear regression, make box plots, and infer confidence intervals to analyze publicly available data sets on batting averages, rainfall, and recovery time following knee-replacement surgery in males as compared to females. We've discussed false positives and false negatives --- statistical concepts that are often elusive for beginning students -- in the context of the annual football combine, in which NFL recruiters seek to assess players' potential for success as as professional athletes using information on their sprinting speeds and bench-press abilities. Before the end of the semester, we'll spend a class session working on analyses using geography-encoded Tweets.
One of the nice things about asking students to adapt earlier analyses to address data sets of their own choosing, I think, is that students approach the project with some existing ideas about what could go wrong in the analysis. For example, during our discussion of the football combine, many students mentioned specific football players who had been terrific sprinters in the combine, but were later had only limited success in the NFL. In this sense, the results of the football combine were useful in motivating the statistical concept of false positives, which we then went on to discuss in a more rigorous way. How nice to be able to engage students' existing knowledge and intuition in clarifying what is often a challenging concept.
I've also been impressed to see that students retain a substantial skepticism in interpreting their results. For example, in their subway-microbes paper, the authors report that the abundance of Pseudomonas, a bacterial genus, changes markedly from morning to night. The authors interpret this to reflect shifts in the mean ages of subway users across the morning, noon, rush hour, and late-night sampling intervals. Several of the students pointed out that it was not reasonable to interpret the data in this way without knowing a bit more about whether, when, and how subway turnstiles were cleaned over the data-collection interval. In noting this problem, the students identified a critical caveat for nearly all Big Data analyses: unless one has collected a given data set oneself, one simply cannot have complete information about potential pitfalls in analyzing it. I was very happy to see that the students were at once enthusiastic in their analyses, and cautious in their interpretation!
In addition to challenges arising from the potential for over-hasty interpretation of results from Big Data, some challenges remain in the integration of R as a key tool in our classroom. R, like most programming languages, is exquisitely sensitive to coding syntax, meaning it's easy for a beginning student to become deeply frustrated when a stray or missing comma prevents a graph from being rendered from otherwise perfect code. Many times this semester, though, a student's intuition about a sports-data sets and encouragement from classmates has proven sufficient for him or her to push through such intermittent frustrations. The schedule for our class --- three hours, once per week--- is also helpful in provide adequate time for students to become immersed in identifying and resolving such programming issues.
Our class meets in the Natural History Museum (Wilson 223) each Thursday from 12:45 to 3:30p. Next week, we'll be discussing data from a population-level survey that revealed major differences in the diversity of gut microbes in Americans as compared to individuals from elsewhere in the world. We'll also be exploring strategies for presenting data using heat maps. Please do join us for an enjoyable -- but cautious --- foray into the world of Big Data!
No comments:
Post a Comment