Hypothesis tests end with clear answers stated in terms of probability. We either decide to reject H
0
or we fail to reject H
0 ,
based on our p-value. This is as solid an answer as we get in statistics. Changes in the way that data is collected has changed the way hypotheses are evaluated. Data collection has exploded and data storage costs have decreased dramatically. Some of the approaches in a traditional introductory Statistics courses are reversed in this trend.
For instance, I teach that a random sample is more powerful in predicting population outcomes than a poorly attempted census. Now with large datasets this may not be as true. The enormity of data in certain instances drives the size of standard deviation so small, that the traditional precautions about data collection are less relevant. Students should become familiar with the ways that these trends have changed data production.
Other changes brought by big data allow for real-time prospective studies. The quickness of accumulation and turnaround of data provide extremely lively observations.
An example of this is the correlation data developed by Google to predict spread of flu in the United States. By comparing incidence of searches made in a geographic region with the incidence of flu that developed in the days following these searches, Google developed a powerful real-time flu predictor. It was nearly as accurate and available significantly earlier than information gathered by Centers for Disease Control, who used reports from doctors who had seen flu patients for surveillance data.
Viktor Meyer-Schonberger and Kenneth Cukier relate this groundbreaking work in the book
Big Data
. They describe how Google's "…software found a combination of 45 search terms that, when used together in a mathematical model, had a strong correlation between their prediction and the official figures nationwide. Like the CDC they could tell where the flu had spread, but unlike the CDC they could tell it in near real-time, not a week or two after the fact."
7
This method provides a new type of powerful exploratory analysis. Although some of the tools may not be within classroom reach, the imagination of how one could use searchable and stored data to create models, is within the realm of classroom discussion and will add a valuable component to the study of data production. There are many ways that accumulated data can be repurposed for analysis.
Researchers, including Karen Seto at Yale, have used satellite images from NASA to measure urban sprawl and greenspace. By observing coloration in the images and changes over time, they can use the data to survey geographical regions and estimate the areas of different land-uses. This is a form of data mining that uses historical data to create a time-series study of changes in the environment.
8
An enormous study of brain cancer and cell phone use was completed in Denmark and published in 2011. Rather than using a random sample, the study attempted to observe all cell phone users in Denmark, using mobile phone records. Although some records were excluded, overall the study was able to accumulate observations of 3.8 million person-years of data. By linking the cellphone records with the national registry of health, researchers were able to analyze length of exposure to cell phone use with cancers of the brain. They concluded that there was no correlation between brain cancer and cell phone use.
9
This search for correlations and the attempt at using ALL the data, not just a sample is the essence of Big Data. Again Cukier and Meyer-Schonberger describe the findings of a study by the University of Ontario Institute of Technology and IBM searching for correlations in health outcomes and patient data for premature infants. The study used real-time patient monitoring systems that captured about 1260 data points per second for one child. They found an unexpected correlation that showed very constant vital signs could be an early warning for a serious infection.
10
Searchable text is another example of the innovative ways in which the accumulated data from the internet can be used. Google has a data analysis tool based on Google Books called N-gram http://books.google.com/ngrams Words and phrases can be searched and quantified, predicting popularity in items being frequently mentioned (for business purposes), allowing searches for the first occurrences of phrases (historical purposes).
Correlation in statistics can indicate a probability that two variables are associated. It can't prove causation, but it can indicate relatedness. As statistics grows to incorporate new practices as a result of the enormity of available data, the search for correlations may provide us with a new and powerful means of exploratory analysis. These exploratory analyses will undoubtedly lead to more questions about the nature of the relationships that have been observed.
It is easier than ever to use our interactions as data, as many of them have become codified and searchable electronically. This electronic trail offers many possibilities for analysis of what has occurred. The "what" that we see in correlations can help us to point towards the future, and yet knowing the "what", we are still left with the question of why.