Interpreting Patterns in Multi-Variate Multi-Horizon Time-Series Forecasts from Google’s Temporal Fusion Transformer Model

Wow that title is a mouthful… But it’s not a complicated as it seems. Let’s break it down:

  • Multi-Variate Time-Series Forecasts – Single-variate time-series forecasting uses only the historical values of the data in which we are attempting to predict future values. (For example, expontial decay, moving average, auto-regressive moving average.) Multi-Variate allows additional time-series and non-time-series variable to be including in the model to enhance the models predictive capability and give better understanding as to what influences our target predicted value(s). (For example, including the weighted seven day moving average sentiment of news articles about a company when forecasting it’s stock price for tomorrow.)
  • Multi-Horizon Time-Series Forecasts – Traditional time series forecasting is typically optimized for a specified number of period ahead (for example, a produce department predicting next week’s potato sales to determine inventory). Multi Horizon means we attempt to predict many different future periods within in the same model. (For example, predicting daily potato sales for every day over the next four weeks to reduce the number of orders and schedule times for restocking.)
  • Interpreting Patterns – A good model doesn’t only provide an accurate prediction, it also gives insights as to what inputs are driving the results, that is, the model is interpretable.
  • Temporal Fusion Transformer – The name of the proposed Multi-Horizon Time-Series Forecasting framework. It combines elements of Long-Short Term Memory (LSTM) Convolutional Neural Networks (CNNs) and a mechanism first used in image recognition called “Attention” (We’ll talk more about attention later).

Despite my admitted love for cool new modeling techniques, the most important word in the title is “Interpreting”. Data Science and Machine Learning are beginning to direct Deep Learning methods towards traditional business problems, for example, forecasting – specifically, better understanding what other factors (i.e. “Multi-Variate”) impact longer term (i.e. ‘Multi-Horizon’) forecasts.

Unfortunately, Deep Learning models have a tendency to be black boxes, we can’t see how they arrive at a decision, we can only evaluate the accuracy. For people to learn something from the model, we have to be able to interpret what influences the model results. Enter the Google journal article and AIHub notebook on interpreting patterns in multi horizon  time series predictions.

In my world of marketing and finance, we understand there is no such thing as a perfectly predictive model.  Because both marketing and financial models deal primarily in human behavior, if we could build a perfect model, it would disprove the existence of free-will, as we would somehow have an equation that predicted everything people would do before they did it, so instead of diving into an existential crisis, let’s assume people have agency and said agency will introduce a level of unpredictability into our models.

The random elements of human behavior may prevent perfect predictions; however, human behavior still has patterns. As Mark Twain (might have) said “History doesn’t repeat itself, but it often rhymes.” – we can use the echo’s of the past to find patterns in behavioral noise.

Even better, the Temporal Fusion Transformer TFT framework is a Multi-Variate Time Series Forecast – so we can add other data points to give a more accurate model.

Even better… er… TFT allows us to extract information about the model as well as the predictions. The information about the model, when interpreted correctly, shows what other factors were influencing our predicted values, including reoccurring “seasonal” patterns like the hour of the day, day of the week, and/or month of the year.

Output Interpretation

Here are some more in depth links diving into the very complex nature of what’s happening in the Temporal Fusion Transformer Model

For more mathematical and data science context, check out the pre-prints below

But the links are all supplementary information, what is really important are the use cases at the bottom of the notebook.

Use Case One, Variable Importance:

The first use case is for variable importance. The table above tells us that for any predicted value, on average, the hour of the day in which that value was recorded was most important in generating this prediction, while the second most important variable was the previous hour’s value (“Target”).

Be aware the variable importance value for a given input (i.e. “Absolute Value”)  is not that meaningful when only looking at a single variable. For example, the mean value of Day of Week is 17,884.56 doesn’t tell me much. The difference in variable importance between inputs (i.e. “Relative Value”) is meaningful. The relative variable importance tells us which variable and associated patterns are considered to be more associated with an accurate model.

An example of associated patterns, in the sample data, hour of day has high importance suggesting a 24-hour cycle is present. Alternatively, Time Index is of low relative importance, suggesting a pattern of linear grow or decline (Time is always consistently increasing, so when time is an important variable, we expect to see a general upward or downward trend in the target variable.)

The second use case, visualizing attention patterns:

First we need to understand what is “Attention”. The word “Attention” is used because attention modeling was first used to accelerate image recognition routines, in a manner that simulated a human’s process of identifying an image. For example, when attempting to identify someone, we would probably look at their hair, eyes, nose, and mouth; while generally ignoring their ears and hands.

Attention modeling does the same thing, it prioritizes patterns which provide the most predicative power. Patterns over time are no different than patterns in an image, so recently research attention (pun-intended) has been on attention based models for time-series forecasting.

So while image attention function may appear like this:

A time series attention function may look more like this:

Where ‘Mean Attention Weight’ is the average of importance of variables (including the historic behavior of the target value) over time.

Again, the article I linked above will explain all of these concepts it more in depth, but all you really need to know is that the spikes are when the important stuff is happening.

In the case of this model, we see a spike in attention every 24 hours with little attention paid in other hours. There is a small weekly pattern present, this will be shown in the next section

Bringing it all together:

Use case one demonstrated relative variable importance and use case two showed us the time series attention patterns.

In a multi-horizon model we forecast multiple time periods – the example model predict 24 hour increments forward.  In use case one, relative variable importance suggested time of day has a strong relationship with predicted value, while the day of the week was less important:

Including multiple horizons in the attention graph from use case two, we can see the pattern of each horizon. (Horizon 1 is one hour future state prediction, horizon 20 is twenty hour future state, and so on…) While there is a difference in attention from one day of the week to the next, for a given hour attention does not change as much from one day to the next as the dramatic change in attention from one hour to the next. (Each horizon shows a sharp peak at one hour, followed by sharp reduction in attention over the next 23 hours, and a jump back up after 24 hours, that is, one day.)

Finally, I added one more visual to help show the model accuracy. The visual overlays the 50th percentile predicted values for each hour over the original values. It’s nothing fancy but you usually want to make sure your model is able to give a decent prediction.

pred_50 = extract_numerical_data(p50_forecast)

fig = go.Figure()

fig.add_trace(go.Scatter(x=act.index.values[119000:120000], y=act[‘t+0’][119000:120000],




fig.add_trace(go.Scatter(x=pred_50.index.values[119000:120000], y=pred_50[‘t+0’][119000:120000],


                    name=’Pred 50′,

                    line=dict(color=’firebrick’, width=4, dash=’dot’)))


The predicted line very closely aligns with the actual line, except during outlining spikes, so our model is relatively accurate other than during high indexing activity.

Adding more code to the notebook also helped me better understand the data structure and modeling process.

I encourage you to open the AIHub link and try running the notebook yourself on GCP. If you need any help running the notebook or you would like to see what more you can do Deep Learning Tools on Google Cloud Platform, feel free to reach out, I’m always happy to help! Just visit our contact us page and let me know

Notes from passing both GCP Cloud Architect and Data Engineer Professional Certifications in 30 days

Within 30 days I passed both the Google Cloud Platform Professional Data Engineer and Architect Certification exams.

However, it took me much longer than 30 days of study and experience to pass the exams.

Fortunately, there was a lot of overlap between the two exams, so if anyone else wants to put their personal life on hold for a few months and attempt something as crazy as passing two of the hardest cloud certifications in a short period of time, here are some tips to help you out.

First, the professional certifications are just as much about technical knowledge as they are about critical thinking – meaning you will not know the right ‘correct’ answer for many questions, but you might know the wrong answers. The test requires process of elimination.  When you face a question that does have an obvious answer, make sure to read the other questions to see if there are any obvious candidates for elimination.

For example, there was a question about architecting a VM hosted web application and how to best accommodate a biz requirement for http failover. You had to decide between if you should point the load balancer to individual VM instances’ ip address or to a VM instance GROUP’s ip address and

If you’ve only used deployment templates or worked more with managed services rather than compute – or focused more on development or architecture rather than networking – this is not a situation you’ll come across very often. In the relatively rare case when someone has configured an http load balancer and an instance group *within the GCP console*, they would know you can only point a load balancer to an instance group, not an instance itself; but for the rest of us there is still a way we can figure out the answer.

We should know that http failover means a load balancer, so any answer not mentioning a level 7 load balancer should be excluded.  So we are left with options of either pointing the load balancer to the VM instances or the instance group. I should mention, we are technically pointing the load balancer to the instance in both options, but this is about configuration not physical architecture.

(note: Level 7 load balancing is, somewhat oversimplified, http traffic allocation with some logic, whereas level 4 is http / udp with little logic:

Let’s assume we don’t know the right answer, but we do know managed instance groups allowing autoscaling of vms based on usage, and we know enough about IP addresses and load balancers to know if a new vm instance is created the load balancer needs to know the new IP address of the new VM instance, otherwise the load balancer won’t know where to forward traffic. So knowing managed instance groups are often used for scalable web applications it would only make sense for us to point the load balancer to the managed instance *group* and each individual instance.

Speaking of networking, you’ll need to study a know a lot of networking.  Some examples of terms and concepts to be familiar with (non-exhaustive):

Related to networking, you’ll need to know how data is shared between GCP organizations, on premise data, and other cloud providers. There are a lot of options for this, and all are situational, so understand the differences between:

On the subject of data, the Data Engineer certification had much more architecture than I expected, you’ll need to understand both application data architecture and analytics data architecture. There’s so much information, but at a high level, you’ll need to know when to use:

As well as understand the different business cases on when to use the different ML and AI Platform services: For example, when is it better to use one of GCPs pre-trained ML APIs (e.g. Vision API) vs. training your own in AutoML vs. deploying your own custom built models using a tool like AI Platform Prediction(

Learning Resources

It’s difficult to describe my full experience without turning this article into even more of a study guide, but allow me to give some helpful resources.

My starting point was Earl Gay’s excellent study guide on Medium:  It has a lot of helpful links which I will not reproduce in this article, so check out Earl’s guide for more info. If you are able to explain why every decision was made in every single flowchart on this site, then you should be able to pass both GCP Architect and Data Engineer Professional Certifications.

In order to gain that knowledge, the most complete online courses I found were at Linux Academy. for Cloud Architect; and for Data Engineer.

The Linux Academy courses also contain practice tests with different questions than the sample test provided by GCP.

Most people consider Coursera first when they want to study online. In my personal opinion, I found the Coursera options to be lacking, both in practical training and in content, so I would not recommend taking them unless you have a lot of experience in GCP and only need a refresher. Also the course progression through their certificate tracks is confusing as you often just hope you’re taking the correct class for a given certificate (I took an entire course on Kubernetes before I realized I was taking a course in the Application Developer track and not for Cloud Architect.)

The Coursera practice exam questions were almost identical to the sample practice exams provided by GCP – therefore there wasn’t a lot of benefit taking the Coursera practice exams if you already had taken the free GCP practice exam.

(Note: This article is in no way sponsored by Linux Academy, nor at the time of writing does TheoryLane have any form of business relationship with Linux Academy, these opinions are from my experience alone and may not reflect the views of others at TheoryLane.)

Stay in Touch!

I hope this information was helpful, or at least guided you to information that was helpful.

If you have any questions or would just like to connect, feel free to reach out to me on linkedin:  or use the contact form below.

New DataFlow Job Metrics vs. StackDriver

Promoted as new capabilities in “DataFlow observability”, GCP is finally giving us the ability to see CPU a time series graph of cpu utilization and throughput for a given DataFlow job within the DataFlow console.

Before we used stackdriver (which is getting rebranded, by the way) to view the VM CPU utilization from our DataFlow jobs. The new DataFlow capabilities do not replace StackDriver aggregate monitoring and alerting in StackDrivery; however StackDriver and Obersavility serve different use cases – where the observability functions more for DataFlow job debugging and optimization while StackDriver is for holistic tracking and monitoring. I.e. the DataFlow UI is for job specific DataFlow Ops, StackDriver is for Administration.

Job details 

Those of us who have used StackDriver appreciate more visibility in CPU and throughput as all had in the console was the resource metrics on the right of the job topology.

I did a quick run of the standard wordcount example to generate so data. The new graphs are simple and to the point. I like them.

Throughput (elements/sec) 
Create alerting policy 
Mar 2, 2020 7:40 PM 
• group/ Reify 
. group/Write 
• split 
• read/Read 
17 lines below 
261 S/s 

Now we see specifically which ops taking the most IO and CPU for a given job – without the overhead of creating a new StackDriver dashboard or filtering to a specific job. In fact, there’s no real way to get this level of visual detail out of the box in StackDriver. (At least not that I’m aware of, let me know if there is a simple configuration setting I’ve been overlooking!) In StackDriver the minimum alignment period is 1 minute, so the best we can do is see operation counts or vCPUs per minute. In our new DataFlow UI we can see throughput and vCPU per second.

For a StackDriver workflow, per second detail is way to granular; however, when testing DataFlow jobs prior to a large scale deployment, lower level detail is important for introspection prior to rolling out inefficient – and expensive – DataFlow jobs.