Author Archives: Manan Shah

Can Data Science Predict The Future?

The big industry for the 2010s has been all things data. Big data, data engineering, data science, machine learning, artificial intelligence. Underlying a lot of this work is statistics (and of course algorithms and techniques). Overlying (is that a word?) all of this is hype. Or maybe a better framing is “confirmation-bias induced expectations”. It’s not hard to find a viral “data science” use case or application, especially when it comes to how start ups market themselves.

Here’s your generic data science marketing line

We use state-of-the-art machine learning techniques to produce actionable insights and enhance the user experience.

So what’s an actionable insight? What does it mean to enhance the user experience? What is a use case? Well, both “actionable insight” and “enhance the user experience” are just phrases that are a bit difficult to pin down in terms of business measurement. We can always specify something like “correlation coefficient of 0.7 or greater” is an actionable insight or “increasing retention rate by \(x\) percentage points” is a sign of enhanced user experience, but typically, the non-technical manager has loftier and sometimes mathematically unreasonable aspirations.

When a lay-person hears words like “prediction” or “forecast”, they don’t hear those words in a statistical framing. Rather they hear it more like “fortune-telling” or “prognostication”. However, in the data science or statistics context, prediction and forecasting are more a statement of expectation, variance, and error types (false positives, false negatives, true positives, and true negatives). Perfect prediction has zero false positives and zero false negatives. But if you have perfect prediction then either there is something wrong with your underlying data and modeling or you don’t have a process with randomness.

The reality is that most business problems that end up in a data science department’s queue tend to revolve around answering questions with the data available. And the data available is almost never complete with respect to the (non-degenerate) question being asked. Thus, there is inherent variance in the system; variance that cannot be reduced or accounted for. This is where the face meets the fist.

It’s a difficult thing for the non-statistician to resolve. I mean we have a lot of complicated mathematics working for us, so why is there still error or uncertainty? Doesn’t more sophisticated mean, less uncertainty? It’s an interesting cognitive dissonance that is worth researching. But if I had to put my math educator hat on for a moment, I’d say that some of the underlying root causes for this dissonance is probably the mentality of “it’s either correct or it isn’t” that is pounded into children when it comes to mathematics. For sure, there are elements of this that are hard to argue. \(2 + 2 = 4\) is the only correct statement (unless if we get really really snarky about the definition of \(+\), for example) and a lot of mathematics that students endure (yes, endure) is of the deterministic variety.

Statistics education ends up coming very late and isn’t really done in a manner that mixes micro-business needs with statistical uncertainty. Macro business needs and statistics tend to work out well. Law of large numbers, working with aggregates, etc. are better mind-melded in the world of industrial statistics. But when we try to go from broadstroke averages and other central tendency measures on populations to surgical prediction of a single user’s behavior we’re in a world of hurt and discomfort. We have to recognize that the success of such data science initiatives depends on the reasonableness of the expectations of these initiatives. And those expectations have to be tempered through an understanding of the inherent uncertainty in the system and completeness of the data.

User behaviors are notoriously difficult to predict. And acquiring ever more complete data can become very expensive both in terms of dollars and in terms of speed of acquisition. I can’t tell you how many times I’ve been contacted about building a predictive model that will tell if a user will buy their product. There is no predictive model that can do this with 100% accuracy.

What’s reasonable here is to supply a probability with some uncertainty on the probability! And sometimes extracting an uncertainty can be difficult as well.

Regardless, the problem of prediction first comes down to having a philosophical agreement on the meaning of the word between the technical doers and the business askers. After that, it’s worth understanding what questions are answerable and which aren’t.

And the final pitch, if you are looking to understand how prediction ought to work, get in touch. I’ve done a lot of work around building a data science team, working with organizations to rationalize their objectives and so forth. You can read more here. Feel free to send me a message.