Limitations of Predictive Analysis
In my blog post discussing traditional statistics, I discussed
the idea of inferential statistics and the limitations of using a sample to make
estimations about the populations the sample is drawn from. In the world of Big
Data this translates to the concept of predictive analytics. This is when
Big Data leverages huge data sets, statistical algorithms, and machine learning
to make predictions of future outcomes.
Predictive analytic models come in two types: classification
and regression models. Classification models put data objects into categories and
make predictions based on this whereas regressions models predict continuous data.
A classification model may sort customers into categories and make predictions
about how receptive they would be to marketing whereas a regression model would
make predictions about how much money a customer will generate during their
relationship with the company. (1)
The idea of predictive analytics is becoming an increasing
part of Big Data, and the theory makes sense that with larger data sets or
larger sample sizes then the predictions made on the analysis of these samples become
more accurate. The issue is that no matter how large your data set is, it will
always be a prediction and never a guarantee. Despite this, Business and
organisations have become overconfident in the accuracy of large data dets when
applying them to individuals.
Predictive Analytics, like all technologies, are a tool than
can be used to gain insight into data but they are not a crystal ball which can
be used to see into the future. This is the nature of all statistics, they are
educated guesses when all is said and done, even if the educated guess is being
made with exabytes of data backing them up.
One example of where predictive analytics failed is the
Google Flu Trends. GFT was a service developed by google that aimed to use Big
Data to analyse Google search query data to predict influenza activity in
various regions of the world. The idea was that people would search certain
terms when they were sick allowing the service to detect rising flu cases. This
ultimately failed for a few reasons. One being because people search for these
queries even when are not sick for various reasons and increased media coverage
of the flu can lead to increased search activity creating a feedback loop. The
algorithm also used historical correlations, but each flu season tends to be different
and human behaviour changes. Therefore, changes in search behaviour, media
attention, or even they language people use to describe their symptoms all evolves
over time and affected the accuracy of the results. (2)
1. https://cloud.google.com/learn/what-is-predictive-analytics?hl=en
2. https://theconversation.com/googles-flu-fail-shows-the-problem-with-big-data-19363
Excellent examples and explanations for analytics.
ReplyDelete