Author Archives: Manan Shah

The Data Science Lifecycle and Process

Data science has come a long way since the buzzword took off in the late 2000s and early 2010s. Since then, companies have come to settle on their own way of adopting the craft into their organization. And every company has a data science trauma in its collective psyche — from the years-long effort that produced nothing and wasted millions to the complete failed rollouts of, what would otherwise be significant academic breakthroughs to the general broad relabeling of any solution that utilized any mathematics as “AI” or “Machine Learning”. This is going to be one of a series of articles that I’ve had written in various forms over the last few years discussing data science and analytics.

In this article, I want to give a high level overview of the data science lifecycle.

I named my consulting business “Think. Plan. Do.” because that is the order by which analysis work must be done. And on this I have a strong opinion. But in this process, there are loops and that, in general, is how the data science process ought to also work. Without paralleling the Think, Plan, Do philosophy, I’ll write the rest of this in the context of data science, but I wanted you to know where this stems from.

There are three “top level” phases in chronological order of execution: (1) Scoping, (2) Modeling & Analysis, (3) Deploying

Phase 1: Scoping

Every project should start here. If we start modeling and analysis without scoping, it is guaranteed we will find ourselves back in scoping. We want to do several things when scoping and these are intended to be iterative.

1a Understanding the Question

What is the question? Often we pose questions in a colloquial fashion: “Can we optimize our margin?”, “Can we find the optimal price of the asset?” Questions like these need to be translated from business speak to data science speak. This translation process is a sequence of questions in response to the business-level ask and generally involves getting to the quantitative heart of the matter.

Without getting too deep into the technical weeds, in order to answer a business question that we believe will require mathematical or statistical modeling, we need to be able to specify the measures by which we want to monitor the underlying modeling error (technical term). Typically, in classical models this is “least-squared error”, wherein the goal, loosely, is to find [the] solutions that minimize the aggregate error between data observations and model predictions.

It has been my experience that business questions as asked are often not the real questions. This is sometimes called the XY problem. Thus, when working with clients, I do my best to steer all initial conversations toward the actual question rather than the a priori desired modeling solution posed as a question. This means getting to the end use case: what business actions / decisions do we want to take and what risk are we willing / able to tolerate?

This leads us to 1b, Gathering Data.

1b Gathering Data

It is rare to skip this step as all models eventually have to be pressure tested against reality — observed data. It is to be expected that we will return to this step several times as obtaining relevant data is often initially an exercise in intuition [and a type of XY problem]. There is no such thing as “all the data”, but rather “reasonably enough relevant data”. Knowing what data to gather requires a business understanding of the question. For example, in order to price an asset, there should be a strong understanding of what typically drives the asset’s price. If we want to estimate the market price of a residential property [because we are in the business of pricing such assets or if we are in the business of trading such assets], for example, we would want relevant features of the property in question, recent transactions of geographically near properties preferably with a similar set of features, and current economic conditions [interest rates, economic sentiment, etc], to name a few. However, if we were interested in a forward looking time series model of the single family housing market in a specific metropolitan region, the modeling input assumptions [aka data] would be necessarily be different than pricing a single asset.

The punchline: more data isn’t a panacea. It’s about having enough relevant data.

1c Scrubbing and cleaning

As a general rule, data sources are not clean. Data points can be outright wrong [boundary condition violations], missing, or correct but extreme. What we do in each of these cases begin to form implicit “modeling assumptions”. It can be correct to discard non-relevant data points or records partially or in their entirety. It can also be correct to impute where needed [https://en.wikipedia.org/wiki/Imputation_%28statistics%29].

1d Exploratory Data Analysis

Depending on the school of thought, exploratory data analysis can be broad, very narrow, or somewhere in between. The general idea, however, is to extend the scrubbing and cleaning part of working with data to do further sanity checks as well as to get a sense of the “surface area” of the data. Do certain input variables have large variability? Do certain input variables behave fairly predictably? Are there implicit underlying correlations among input variables? Are input variables roughly on the same scale? Will they need to be rescaled? Are there overtly strong correlations between some input variables and the response variables? Is there adequate representation of different underlying segments in the data? [To use the housing example, if we want to price a single family home with 3BR, do we have enough relevant samples of 3BR single family homes or is our entire data set of 4BR single family homes?]

Having an underlying bias in sample data is a common problem and there is no single, catchall solution. Both the stakeholder and modeler have to be equally vigilant in ensuring that adequate representation of relevant data segments exists. Here is some additional reading on biases in data and modeling.

This exploratory data analysis phase tends to bleed into Phase 2, Modeling & Analysis

Phase 2: Modeling & Analysis

In this phase, the analyst / data scientist / statistician / quant drives the technical details. While the conventional belief may be that analysis is akin to “here is a bunch of data, please feed it into ‘the algorithm’ and tell me what to do’, the truth is anything but so. The quest for algorithmic automation [that is automating the modeling & analysis part so that it is “free of human intervention”] will, in the opinion of this author, never be fully solved. It may be solved in nuanced and well-defined problems with a long history of stable data and general immutability in process, but not so for any problem in any context. This is a variation on the “No Free Lunch Theorem”.

However, that does not mean that we can’t have some rules-of-thumb. And this chart gives a good overview for how one could go about choosing the right estimator.

There are also solutions that look to automate machine learning pipelines. TPOT is one such example.

There are also companies built around Enterprise AI. DataRobot is a good example. And here is a good branded article on Enterprise AI at Forbes.

The general goal in this phase, however, is to be able to develop a model that simultaneously produces a business-usable algorithmic solution that can be deployed in a scalable fashion. Here, “business-usable algorithmic solution” means a solution that allows for business stakeholders to take data-informed actions. This is an important caveat in that it is entirely possible that even the best algorithmic solution may not have much business value. This can happen even after a strong vetting process of the business problem because it may turn out that the response variables have more underlying variability than can be modeled and this was unknown at inception.

The second part, “can be deployed in a scalable fashion”, is also a strong consideration when developing a model. What works in a Jupyter notebook may be divorced from what is possible on a server-side implementation. Similarly, if the expectation is that a server-side implementation be able to ingest and process terabytes of data, but the modeling solution was developed on the analyst’s local environment, then there could be serious implications in terms of model accuracy. Hence, it’s advisable that there would be an additional round of “live” testing in the model development process before a full deployment to production. For example, while it may be straightforward to run some type of clustering algorithm in “real-time” on a dataset fit for a personal laptop, the computational complexity of the algorithm may not scale well to a terabyte (or worse a petabyte size data set).

Odds are as modeling work progresses, there will be additional data and modeling discoveries of the negative kind — [i] the collected data could be actually not be sufficient for meaningful prediction, [ii] there may not be enough data for the desired accuracy and acquiring new data may be prohibitive [while there are resampling techniques such as bootstrapping to help alleviate such a problem, they have limited use], [iii] the best model may be the least explainable model and if there are regulatory concerns or matters of transparency that need to be ensured, additional fallback contingencies would have to be accounted for, to name a few.

There is an ebb and flow between the Scoping and Modeling & Analysis phases. It’s not unusual to have to go back to the drawing the board or to have to allow for slight tweaks to the original ask. Both stakeholder and analyst should be open to this. The mantra, though, should be to “fail often and fail quickly”. Fine tuning and refining models should be a distant last step. I certainly encourage as many proof-of-concept solutions as possible and to steadily pressure test them on the stability of and dependence on underlying assumptions, scalability and maintainability of the algorithm, explanability of the algorithm, reliability of the data [not just the currently collected data, but for new data coming in], as well as more technical details such as model sensitivity to changes to (hyper) parameters and extreme but valid data points.

Phase 3: Deploying the Solution

Not all solutions need to be fully-integrated at the product software level. Some solutions can reliably exist as a local script that is run on-demand, while other solutions can be local scripts that live on the cloud to be run on-demand, while still other solutions can be part of an ETL process populating data tables that can be queried by a technical end user or delivered in a digestable client-facing manner. What this solution looks like should have been part of the Scoping phase. Without a serious plan or a reasonable considerations for what “done” means, it is possible for high quality data analysis work to be ultimately unusable because the end user is unable to have the solution run. If a software engineering team is going to be needed then appropriate time and resources should be allocated. Not all data analysts / scientists are software engineers, even though there may be significant overlap in programming ability and a common understanding of the roles. If a trader needs a local solution, then they probably just need to have a local environment set up so that model can be updated and run and odds are only minimal engineering effort would be needed. If however, a fully-specified web interface with two-factor authentication, and customized user settings are needed, then this becomes a much larger effort requiring the coordination of staff across multiple disciplines and roles.

Summary and Other Matters

While this is a high level outline for what a data science process should look like, there are plenty more details. Enough so, that a book could be written. Additionally, these thoughts represent one philosophy. What works and what doesn’t can also depend on the area of research and the risk being taken. For example, algorithmic solutions for health care should likely go through extreme scrutiny and have strong regulatory oversight especially if the consequences of output from models could have life or death implications. On the other hand, if we’re trying to determine if a player will buy the Helmet of the Stars or the Undying Boots of Doom for $7.99, the modeling and data assumptions can be a little more cavalier. Similarly, in the financial markets, a model-developed trading strategy that assumes that the sums of money being traded are not appreciable percentages of the overall trade volume, can have catastrophic consequences if, in fact, the opposite is true.

Finally, mine is not the only opinion. Here are a few other opinions of others in the space: Microsoft, CRISP-DM.

And more can be found in the links earlier in this article.

I tangentially mentioned ethical concerns when discussing bias in data. Ethics in AI is a large topic of discussion as is concern about user privacy (and right to privacy). Whether something can be modeled is different from whether something should be modeled. Facial recognition and its use by federal, state, and local governments as well as by law enforcement or military is a hot, click-driving topic. However, other less overt, but no less worrisome uses come in the form of user targeting for government propaganda directed at its own citizens or those of other nations [a form of information warfare]. There may very well be a hierarchy of ethics, but these examples should give a flavor for some of the more egregious uses of data.