Interpreting Patterns in Multi-Variate Multi-Horizon Time-Series Forecasts from Google’s Temporal Fusion Transformer Model

Wow that title is a mouthful… But it’s not a complicated as it seems. Let’s break it down:

  • Multi-Variate Time-Series Forecasts – Single-variate time-series forecasting uses only the historical values of the data in which we are attempting to predict future values. (For example, expontial decay, moving average, auto-regressive moving average.) Multi-Variate allows additional time-series and non-time-series variable to be including in the model to enhance the models predictive capability and give better understanding as to what influences our target predicted value(s). (For example, including the weighted seven day moving average sentiment of news articles about a company when forecasting it’s stock price for tomorrow.)
  • Multi-Horizon Time-Series Forecasts – Traditional time series forecasting is typically optimized for a specified number of period ahead (for example, a produce department predicting next week’s potato sales to determine inventory). Multi Horizon means we attempt to predict many different future periods within in the same model. (For example, predicting daily potato sales for every day over the next four weeks to reduce the number of orders and schedule times for restocking.)
  • Interpreting Patterns – A good model doesn’t only provide an accurate prediction, it also gives insights as to what inputs are driving the results, that is, the model is interpretable.
  • Temporal Fusion Transformer – The name of the proposed Multi-Horizon Time-Series Forecasting framework. It combines elements of Long-Short Term Memory (LSTM) Convolutional Neural Networks (CNNs) and a mechanism first used in image recognition called “Attention” (We’ll talk more about attention later).

Despite my admitted love for cool new modeling techniques, the most important word in the title is “Interpreting”. Data Science and Machine Learning are beginning to direct Deep Learning methods towards traditional business problems, for example, forecasting – specifically, better understanding what other factors (i.e. “Multi-Variate”) impact longer term (i.e. ‘Multi-Horizon’) forecasts.

Unfortunately, Deep Learning models have a tendency to be black boxes, we can’t see how they arrive at a decision, we can only evaluate the accuracy. For people to learn something from the model, we have to be able to interpret what influences the model results. Enter the Google journal article and AIHub notebook on interpreting patterns in multi horizon  time series predictions.

https://aihub.cloud.google.com/u/0/p/products%2F9f39ad8d-ad81-4fd9-8238-5186d36db2ec

In my world of marketing and finance, we understand there is no such thing as a perfectly predictive model.  Because both marketing and financial models deal primarily in human behavior, if we could build a perfect model, it would disprove the existence of free-will, as we would somehow have an equation that predicted everything people would do before they did it, so instead of diving into an existential crisis, let’s assume people have agency and said agency will introduce a level of unpredictability into our models.

The random elements of human behavior may prevent perfect predictions; however, human behavior still has patterns. As Mark Twain (might have) said “History doesn’t repeat itself, but it often rhymes.” – we can use the echo’s of the past to find patterns in behavioral noise.

Even better, the Temporal Fusion Transformer TFT framework is a Multi-Variate Time Series Forecast – so we can add other data points to give a more accurate model.

Even better… er… TFT allows us to extract information about the model as well as the predictions. The information about the model, when interpreted correctly, shows what other factors were influencing our predicted values, including reoccurring “seasonal” patterns like the hour of the day, day of the week, and/or month of the year.

Output Interpretation

Here are some more in depth links diving into the very complex nature of what’s happening in the Temporal Fusion Transformer Model

https://www.researchgate.net/publication/329798567_A_Comparison_of_LSTMs_and_Attention_Mechanisms_for_Forecasting_Financial_Time_Series

https://towardsdatascience.com/attention-for-time-series-classification-and-forecasting-261723e0006d

For more mathematical and data science context, check out the pre-prints below

https://arxiv.org/pdf/1809.04206.pdf

https://arxiv.org/pdf/1912.09363.pdf

But the links are all supplementary information, what is really important are the use cases at the bottom of the notebook.

Use Case One, Variable Importance:

The first use case is for variable importance. The table above tells us that for any predicted value, on average, the hour of the day in which that value was recorded was most important in generating this prediction, while the second most important variable was the previous hour’s value (“Target”).

Be aware the variable importance value for a given input (i.e. “Absolute Value”)  is not that meaningful when only looking at a single variable. For example, the mean value of Day of Week is 17,884.56 doesn’t tell me much. The difference in variable importance between inputs (i.e. “Relative Value”) is meaningful. The relative variable importance tells us which variable and associated patterns are considered to be more associated with an accurate model.

An example of associated patterns, in the sample data, hour of day has high importance suggesting a 24-hour cycle is present. Alternatively, Time Index is of low relative importance, suggesting a pattern of linear grow or decline (Time is always consistently increasing, so when time is an important variable, we expect to see a general upward or downward trend in the target variable.)

The second use case, visualizing attention patterns:

First we need to understand what is “Attention”. The word “Attention” is used because attention modeling was first used to accelerate image recognition routines, in a manner that simulated a human’s process of identifying an image. For example, when attempting to identify someone, we would probably look at their hair, eyes, nose, and mouth; while generally ignoring their ears and hands.

Attention modeling does the same thing, it prioritizes patterns which provide the most predicative power. Patterns over time are no different than patterns in an image, so recently research attention (pun-intended) has been on attention based models for time-series forecasting.

So while image attention function may appear like this:

A time series attention function may look more like this:

Where ‘Mean Attention Weight’ is the average of importance of variables (including the historic behavior of the target value) over time.

Again, the article I linked above will explain all of these concepts it more in depth, but all you really need to know is that the spikes are when the important stuff is happening.

In the case of this model, we see a spike in attention every 24 hours with little attention paid in other hours. There is a small weekly pattern present, this will be shown in the next section

Bringing it all together:

Use case one demonstrated relative variable importance and use case two showed us the time series attention patterns.

In a multi-horizon model we forecast multiple time periods – the example model predict 24 hour increments forward.  In use case one, relative variable importance suggested time of day has a strong relationship with predicted value, while the day of the week was less important:

Including multiple horizons in the attention graph from use case two, we can see the pattern of each horizon. (Horizon 1 is one hour future state prediction, horizon 20 is twenty hour future state, and so on…) While there is a difference in attention from one day of the week to the next, for a given hour attention does not change as much from one day to the next as the dramatic change in attention from one hour to the next. (Each horizon shows a sharp peak at one hour, followed by sharp reduction in attention over the next 23 hours, and a jump back up after 24 hours, that is, one day.)

Finally, I added one more visual to help show the model accuracy. The visual overlays the 50th percentile predicted values for each hour over the original values. It’s nothing fancy but you usually want to make sure your model is able to give a decent prediction.

pred_50 = extract_numerical_data(p50_forecast)

fig = go.Figure()

fig.add_trace(go.Scatter(x=act.index.values[119000:120000], y=act[‘t+0’][119000:120000],

                    mode=’lines’,

                    name=’Actual’,

                    connectgaps=True))

fig.add_trace(go.Scatter(x=pred_50.index.values[119000:120000], y=pred_50[‘t+0’][119000:120000],

                    mode=’lines’,

                    name=’Pred 50′,

                    line=dict(color=’firebrick’, width=4, dash=’dot’)))

iplot(fig)

The predicted line very closely aligns with the actual line, except during outlining spikes, so our model is relatively accurate other than during high indexing activity.

Adding more code to the notebook also helped me better understand the data structure and modeling process.

I encourage you to open the AIHub link and try running the notebook yourself on GCP. If you need any help running the notebook or you would like to see what more you can do Deep Learning Tools on Google Cloud Platform, feel free to reach out, I’m always happy to help! Just visit our contact us page and let me know