This is the first of a series of blogs around building time-series forecasting models. At clypd, we use forecasting models to help media owners and buyers forecast future TV audiences. A successful forecasting model depends on many factors. In this blog, we focus on algorithms, and how we tap into both modern Machine Learning (ML) models and classical statistics models to take advantage of what both offer.
The advancement of Machine Learning and Artificial Intelligence has been creating amazing stories everyday, from the AI assistant and self-driving vehicles to computer programs beating professional Go players. At clypd, we also have lots of success stories of using ML models. With the benefits of better accuracy and better automation, these ML models are an integral part of our forecasting models. At the same time, we also continue to find great value in “conventional” statistics models. So, instead of pitching Data Scientist vs. Statistician, let us look at ML models vs. statistical models, and how we can leverage both types of approaches in building a TV audience forecasting model.
A common concern about ML models is their perceived “black box” nature. ML models can be mysterious in terms of how they work, even for a trained expert. As an example, object recognition is one of the most researched fields and today’s AI models can outperform humans in both picture and audio objects recognition. However, even the most sophisticated AI models can sometimes make easy and silly mistakes. Below is a well-quoted example of what is referred to as “Adversarial Examples” in the AI literature.
The key with adversarial inputs is that they can be very difficult to understand – it’s unclear why the panda was classified as a gibbon. As discussed in-depth by this article from OpenAI, “Adversarial examples show us that even simple modern algorithms, for both supervised and reinforcement learning, can already behave in surprising ways that we do not intend.” (In case you want to see more, there is a YouTube compilation.). This means we have to approach models like this with caution as the black box can be problematic for clients who want to understand how and why TV audience forecasts are created.
Statistical models, on the other hand, require a solid foundation with explicit theories and assumptions. Yes, the theoretical jargons could be difficult to understand and sometimes intimidating, but the benefits lay with transparency and interpretability. When used and presented properly, statistical models can provide useful business insight and recommendations for decision making.
We love how ML models can deliver amazing accuracy in predictive modeling. There are, however, areas where conventional statistics methods can still provide attractive alternatives to ML models.
Small data – ML models require lots of data (big data). When the size of the data is limited (small data), ML models are more susceptible to the problem of overfitting. In this situation, statistics models would be a very effective alternative. In a recent Kaggle competition (Leaf Classification Competition), the top-ranking solution uses logistic regression, a classical statistics model, in addition to random forest as the foundation of the model.
Extrapolation – Researchers have noticed that Random Forest and Gradient Boosting Machine models, two of the most popular tree-based ML models, can perform poorly when trying to predict data outside the range of the training data. Here is one example. Statistical models can be used to fill the gap.
Time-series analysis. ML has also been applied to time-series forecasting in the last few years. Traditional time-series models (such as ARIMA) have been used as important benchmarks, against which more complex ML models are measured. Additionally, learnings from time-series models have been applied to guide the development of ML models.
ML models can be time-consuming for a few reasons: comparing alternative models through trial and error, running many models, most of which are computation-intensive, and tuning the parameters of ML models can often be a painful process. It usually takes much less time to set up and run a statistics model. This means that researchers can spend more time focusing on examining the data, designing better predictive features and testing hypotheses in a much quicker fashion.
Given the pros and cons of ML vs. statistics models, how should we make the choice? Fortunately, we do not have to choose one over the other, instead we can leverage both types of approaches as we build forecasting models. In fact, the line between ML and statistics model is often vague and sometimes artificial, and we would rather consider them both part of a toolset that we choose from.
Let us look at a few examples of the hybrid approach at different stages of model development:
Exploratory Analysis. When exploring structural and quantitative data set, statistics have many well-designed techniques to choose from. These approaches may come from a wide range of fields including Econometrics, Biostatistics, Psychometrics, etc. When dealing with unstructured data (e.g. words and texts), Natural Language Processing (NLP) models are often desirable.
Building a Forecast Model. In this stage, we would combine multiple models to achieve better overall accuracy. For example, when forecasting TV audiences, we noticed that different models have different accuracy based upon certain characteristics (in this case different types of program), as the example illustrates. In the end, we employed “ensemble models”, essentially different models for different program types to achieve even better accuracy.
There is no single best solution for all forecasting requirements, even for a fairly narrow area of study as TV audiences. At clypd we are leveraging the benefits of both ML and statistics models as parts of a big modeling tool-box. This allows us to choose and combine the right algorithms and approaches based upon product objectives, available data and other resources to build improved forecasting models.