Interpreting Patterns in Multi-Variate Multi-Horizon Time-Series Forecasts from Google’s Temporal Fusion Transformer Model

Wow that title is a mouthful… But it’s not a complicated as it seems. Let’s break it down:

  • Multi-Variate Time-Series Forecasts – Single-variate time-series forecasting uses only the historical values of the data in which we are attempting to predict future values. (For example, expontial decay, moving average, auto-regressive moving average.) Multi-Variate allows additional time-series and non-time-series variable to be including in the model to enhance the models predictive capability and give better understanding as to what influences our target predicted value(s). (For example, including the weighted seven day moving average sentiment of news articles about a company when forecasting it’s stock price for tomorrow.)
  • Multi-Horizon Time-Series Forecasts – Traditional time series forecasting is typically optimized for a specified number of period ahead (for example, a produce department predicting next week’s potato sales to determine inventory). Multi Horizon means we attempt to predict many different future periods within in the same model. (For example, predicting daily potato sales for every day over the next four weeks to reduce the number of orders and schedule times for restocking.)
  • Interpreting Patterns – A good model doesn’t only provide an accurate prediction, it also gives insights as to what inputs are driving the results, that is, the model is interpretable.
  • Temporal Fusion Transformer – The name of the proposed Multi-Horizon Time-Series Forecasting framework. It combines elements of Long-Short Term Memory (LSTM) Convolutional Neural Networks (CNNs) and a mechanism first used in image recognition called “Attention” (We’ll talk more about attention later).

Despite my admitted love for cool new modeling techniques, the most important word in the title is “Interpreting”. Data Science and Machine Learning are beginning to direct Deep Learning methods towards traditional business problems, for example, forecasting – specifically, better understanding what other factors (i.e. “Multi-Variate”) impact longer term (i.e. ‘Multi-Horizon’) forecasts.

Unfortunately, Deep Learning models have a tendency to be black boxes, we can’t see how they arrive at a decision, we can only evaluate the accuracy. For people to learn something from the model, we have to be able to interpret what influences the model results. Enter the Google journal article and AIHub notebook on interpreting patterns in multi horizon  time series predictions.

https://aihub.cloud.google.com/u/0/p/products%2F9f39ad8d-ad81-4fd9-8238-5186d36db2ec

In my world of marketing and finance, we understand there is no such thing as a perfectly predictive model.  Because both marketing and financial models deal primarily in human behavior, if we could build a perfect model, it would disprove the existence of free-will, as we would somehow have an equation that predicted everything people would do before they did it, so instead of diving into an existential crisis, let’s assume people have agency and said agency will introduce a level of unpredictability into our models.

The random elements of human behavior may prevent perfect predictions; however, human behavior still has patterns. As Mark Twain (might have) said “History doesn’t repeat itself, but it often rhymes.” – we can use the echo’s of the past to find patterns in behavioral noise.

Even better, the Temporal Fusion Transformer TFT framework is a Multi-Variate Time Series Forecast – so we can add other data points to give a more accurate model.

Even better… er… TFT allows us to extract information about the model as well as the predictions. The information about the model, when interpreted correctly, shows what other factors were influencing our predicted values, including reoccurring “seasonal” patterns like the hour of the day, day of the week, and/or month of the year.

Output Interpretation

Here are some more in depth links diving into the very complex nature of what’s happening in the Temporal Fusion Transformer Model

https://www.researchgate.net/publication/329798567_A_Comparison_of_LSTMs_and_Attention_Mechanisms_for_Forecasting_Financial_Time_Series

https://towardsdatascience.com/attention-for-time-series-classification-and-forecasting-261723e0006d

For more mathematical and data science context, check out the pre-prints below

https://arxiv.org/pdf/1809.04206.pdf

https://arxiv.org/pdf/1912.09363.pdf

But the links are all supplementary information, what is really important are the use cases at the bottom of the notebook.

Use Case One, Variable Importance:

The first use case is for variable importance. The table above tells us that for any predicted value, on average, the hour of the day in which that value was recorded was most important in generating this prediction, while the second most important variable was the previous hour’s value (“Target”).

Be aware the variable importance value for a given input (i.e. “Absolute Value”)  is not that meaningful when only looking at a single variable. For example, the mean value of Day of Week is 17,884.56 doesn’t tell me much. The difference in variable importance between inputs (i.e. “Relative Value”) is meaningful. The relative variable importance tells us which variable and associated patterns are considered to be more associated with an accurate model.

An example of associated patterns, in the sample data, hour of day has high importance suggesting a 24-hour cycle is present. Alternatively, Time Index is of low relative importance, suggesting a pattern of linear grow or decline (Time is always consistently increasing, so when time is an important variable, we expect to see a general upward or downward trend in the target variable.)

The second use case, visualizing attention patterns:

First we need to understand what is “Attention”. The word “Attention” is used because attention modeling was first used to accelerate image recognition routines, in a manner that simulated a human’s process of identifying an image. For example, when attempting to identify someone, we would probably look at their hair, eyes, nose, and mouth; while generally ignoring their ears and hands.

Attention modeling does the same thing, it prioritizes patterns which provide the most predicative power. Patterns over time are no different than patterns in an image, so recently research attention (pun-intended) has been on attention based models for time-series forecasting.

So while image attention function may appear like this:

A time series attention function may look more like this:

Where ‘Mean Attention Weight’ is the average of importance of variables (including the historic behavior of the target value) over time.

Again, the article I linked above will explain all of these concepts it more in depth, but all you really need to know is that the spikes are when the important stuff is happening.

In the case of this model, we see a spike in attention every 24 hours with little attention paid in other hours. There is a small weekly pattern present, this will be shown in the next section

Bringing it all together:

Use case one demonstrated relative variable importance and use case two showed us the time series attention patterns.

In a multi-horizon model we forecast multiple time periods – the example model predict 24 hour increments forward.  In use case one, relative variable importance suggested time of day has a strong relationship with predicted value, while the day of the week was less important:

Including multiple horizons in the attention graph from use case two, we can see the pattern of each horizon. (Horizon 1 is one hour future state prediction, horizon 20 is twenty hour future state, and so on…) While there is a difference in attention from one day of the week to the next, for a given hour attention does not change as much from one day to the next as the dramatic change in attention from one hour to the next. (Each horizon shows a sharp peak at one hour, followed by sharp reduction in attention over the next 23 hours, and a jump back up after 24 hours, that is, one day.)

Finally, I added one more visual to help show the model accuracy. The visual overlays the 50th percentile predicted values for each hour over the original values. It’s nothing fancy but you usually want to make sure your model is able to give a decent prediction.

pred_50 = extract_numerical_data(p50_forecast)

fig = go.Figure()

fig.add_trace(go.Scatter(x=act.index.values[119000:120000], y=act[‘t+0’][119000:120000],

                    mode=’lines’,

                    name=’Actual’,

                    connectgaps=True))

fig.add_trace(go.Scatter(x=pred_50.index.values[119000:120000], y=pred_50[‘t+0’][119000:120000],

                    mode=’lines’,

                    name=’Pred 50′,

                    line=dict(color=’firebrick’, width=4, dash=’dot’)))

iplot(fig)

The predicted line very closely aligns with the actual line, except during outlining spikes, so our model is relatively accurate other than during high indexing activity.

Adding more code to the notebook also helped me better understand the data structure and modeling process.

I encourage you to open the AIHub link and try running the notebook yourself on GCP. If you need any help running the notebook or you would like to see what more you can do Deep Learning Tools on Google Cloud Platform, feel free to reach out, I’m always happy to help! Just visit our contact us page and let me know

How to configure a local Pystan environment

Despite all the hype around Deep Learning Models, and AI as a Service APIs, there’s still a need for Data Scientists to explain – in simple terms – what factors influence a given prediction.  And even more importantly, sometimes we want to construct a model that represents real world process, rather than have input values feed into a programmatically optimized series of neural networks and produce a predicted value.

I am not attempting to argue Deep Learning is not effective. Deep Learning is the best tool we have right now for predicting outcomes. However, prediction alone does not necessarily lead to a better human understanding of what influences said prediction. In many cases a human can’t understand why a Deep Learning model calculates its predictions because by the time we understand the current model, the Deep Learning routine has already updated based on new information. Case in point, Reinforcement Learning routines are always adapting to the decisions of Actors in near real time. We typically do not judge the efficiency of a Reinforcement Learning model based on the decisions it makes, but on the associated effects of the human actors interacting with the model – are players in the battle arena giving up when playing against bots? Are chatbot responses receiving poor reviews? 

Bayesian Analytics or Bayesian Analysis uses a Bayesian Approach to create human understandable insights from models. “But Bayesian Statistics is so hard? It has lots of weird symbols and probabilities and stuff?”  Yes, learning Bayesian Statistics can be a challenge, but the notation is no more complex than any other type of statistics, most of us were just taught frequentist stats first so it seems more intuitive.

Regardless, an easy way to learn Bayesian Analysis is using this book: https://www.amazon.com/Doing-Bayesian-Data-Analysis-Tutorial/dp/0124058884/ followed by this book: https://www.amazon.com/Bayesian-Analysis-Chapman-Statistical-Science-dp-1439840954/dp/1439840954/

Between Kruschke and Gelman’s book, you can get a strong foundation on using Bayesian statistics for analysis. (See Andrew Gelman’s recent review of the Santa Clara SARS-COV-2 antibody prevalence study for why a strong foundation of Bayesian statistics is important for analysis: https://statmodeling.stat.columbia.edu/2020/04/19/fatal-flaws-in-stanford-study-of-coronavirus-prevalence/)

Unfortunately, both Kruschke and Gelman use R rather than Python in their examples. Fortunately, the MCMC sampling applications BUGS, JAGS, and Stan are not actually an R or Python program, R and Python merely call their APIs. So, setting up Python to use Stan, for example, is no harder than using R.

My process for setting up a local virtual environment is below. Please note, if you just want to get started, Google Colab is much easier. As an example, see the notebook provided by Ethan Steinburg in the comments of Gelman’s article: https://colab.research.google.com/drive/110EIVw8dZ7XHpVK8pcvLDHg0CN7yrS_t

For a local environment, it’s a little more complex, but not too bad.

Pystan’s repo documentation isn’t bad either: https://github.com/stan-dev/pystan; in this article I’m providing a supplement with my typical workflow.

This configuration assumes you have Anaconda installed and are able to set up a virtual environment on your machine.

Once you have Anaconda installed and accessible via command line, simply run the following commands for the first time you use the environment:

conda create -n stan_env python==3.7 numpy scipy matplotlib libpython m2w64-toolchain  -c conda-forge -c msys2

conda activate stan_env

python -m pip install pystan arviz scikit-learn statsmodels plotly seaborn nbformat

The first line creates a python3 environment with the necessary packages required for pystan installation.

The next activates the environment.

The third installs pystan and packages I often use for analysis.

Additionally, you probably want to use Jupyter Lab for development, so here are some additional configurations, again only necessary the first time you activate the environment.

pip install --user ipykernel
python -m ipykernel install --user --name=stan_env
conda install ipywidgets
conda install -c conda-forge nodejs
jupyter labextension install jupyterlab-plotly

These commands install necessary widget for visualization, nodejs for rendering the widgets, and the plotly extension for interactive visuals.

Now you should be ready to launch Jupyter in your new pystan environment!

Make sure you have the stan_env virtual environment active by typing…

conda activate stan_env

… in your terminal / command line / powershell

Then type “Jupyter Lab”  (after enabling the virtual environment).

Once Jupyter Lab loads attempt to execute “import pystan”, if there are no errors, congrats! You now have a functional Pystan Jupyter Notebook!

Next time you need to use the notebook, you only need to type

conda activate stan_env

And you are ready to launch your Jupyter Lab or Jupyter Notebook.

Notes from passing both GCP Cloud Architect and Data Engineer Professional Certifications in 30 days

Within 30 days I passed both the Google Cloud Platform Professional Data Engineer and Architect Certification exams.

However, it took me much longer than 30 days of study and experience to pass the exams.

Fortunately, there was a lot of overlap between the two exams, so if anyone else wants to put their personal life on hold for a few months and attempt something as crazy as passing two of the hardest cloud certifications in a short period of time, here are some tips to help you out.

First, the professional certifications are just as much about technical knowledge as they are about critical thinking – meaning you will not know the right ‘correct’ answer for many questions, but you might know the wrong answers. The test requires process of elimination.  When you face a question that does have an obvious answer, make sure to read the other questions to see if there are any obvious candidates for elimination.

For example, there was a question about architecting a VM hosted web application and how to best accommodate a biz requirement for http failover. You had to decide between if you should point the load balancer to individual VM instances’ ip address or to a VM instance GROUP’s ip address https://cloud.google.com/solutions/best-practices-floating-ip-addresses#option_3_failover_using_different_priority_routes and https://cloud.google.com/compute/docs/tutorials/high-availability-load-balancing

If you’ve only used deployment templates or worked more with managed services rather than compute – or focused more on development or architecture rather than networking – this is not a situation you’ll come across very often. In the relatively rare case when someone has configured an http load balancer and an instance group *within the GCP console*, they would know you can only point a load balancer to an instance group, not an instance itself; but for the rest of us there is still a way we can figure out the answer.

We should know that http failover means a load balancer, so any answer not mentioning a level 7 load balancer should be excluded.  So we are left with options of either pointing the load balancer to the VM instances or the instance group. I should mention, we are technically pointing the load balancer to the instance in both options, but this is about configuration not physical architecture.

(note: Level 7 load balancing is, somewhat oversimplified, http traffic allocation with some logic, whereas level 4 is http / udp with little logic:https://www.nginx.com/resources/glossary/layer-7-load-balancing/)

Let’s assume we don’t know the right answer, but we do know managed instance groups allowing autoscaling of vms based on usage, and we know enough about IP addresses and load balancers to know if a new vm instance is created the load balancer needs to know the new IP address of the new VM instance, otherwise the load balancer won’t know where to forward traffic. So knowing managed instance groups are often used for scalable web applications it would only make sense for us to point the load balancer to the managed instance *group* and each individual instance.

Speaking of networking, you’ll need to study a know a lot of networking.  Some examples of terms and concepts to be familiar with (non-exhaustive):

Related to networking, you’ll need to know how data is shared between GCP organizations, on premise data, and other cloud providers. There are a lot of options for this, and all are situational, so understand the differences between:

On the subject of data, the Data Engineer certification had much more architecture than I expected, you’ll need to understand both application data architecture and analytics data architecture. There’s so much information, but at a high level, you’ll need to know when to use:

As well as understand the different business cases on when to use the different ML and AI Platform services: https://cloud.google.com/ai-platform. For example, when is it better to use one of GCPs pre-trained ML APIs (e.g. Vision API) vs. training your own in AutoML vs. deploying your own custom built models using a tool like AI Platform Prediction(https://cloud.google.com/ai-platform/prediction/docs/overview)?

Learning Resources

It’s difficult to describe my full experience without turning this article into even more of a study guide, but allow me to give some helpful resources.

My starting point was Earl Gay’s excellent study guide on Medium: https://medium.com/@earlg3/google-cloud-architect-exam-study-materials-updates-for-2019-re-certification-c4894d3a82e7  It has a lot of helpful links which I will not reproduce in this article, so check out Earl’s guide for more info.

https://grumpygrace.dev/posts/gcp-flowcharts/: If you are able to explain why every decision was made in every single flowchart on this site, then you should be able to pass both GCP Architect and Data Engineer Professional Certifications.

In order to gain that knowledge, the most complete online courses I found were at Linux Academy. https://linuxacademy.com/course/google-cloud-certified-professional-cloud-architect/ for Cloud Architect; and https://linuxacademy.com/course/google-cloud-data-engineer/ for Data Engineer.

The Linux Academy courses also contain practice tests with different questions than the sample test provided by GCP.

Most people consider Coursera first when they want to study online. In my personal opinion, I found the Coursera options to be lacking, both in practical training and in content, so I would not recommend taking them unless you have a lot of experience in GCP and only need a refresher. Also the course progression through their certificate tracks is confusing as you often just hope you’re taking the correct class for a given certificate (I took an entire course on Kubernetes before I realized I was taking a course in the Application Developer track and not for Cloud Architect.)

The Coursera practice exam questions were almost identical to the sample practice exams provided by GCP – therefore there wasn’t a lot of benefit taking the Coursera practice exams if you already had taken the free GCP practice exam.

(Note: This article is in no way sponsored by Linux Academy, nor at the time of writing does TheoryLane have any form of business relationship with Linux Academy, these opinions are from my experience alone and may not reflect the views of others at TheoryLane.)

Stay in Touch!

I hope this information was helpful, or at least guided you to information that was helpful.

If you have any questions or would just like to connect, feel free to reach out to me on linkedin: https://www.linkedin.com/in/daniel-smith-data-scientist/  or use the contact form below.

New DataFlow Job Metrics vs. StackDriver

Promoted as new capabilities in “DataFlow observability”, GCP is finally giving us the ability to see CPU a time series graph of cpu utilization and throughput for a given DataFlow job within the DataFlow console.

https://cloud.google.com/blog/products/data-analytics/better-data-pipeline-observability-for-batch-and-stream-processing

Before we used stackdriver (which is getting rebranded, by the way) to view the VM CPU utilization from our DataFlow jobs. The new DataFlow capabilities do not replace StackDriver aggregate monitoring and alerting in StackDrivery; however StackDriver and Obersavility serve different use cases – where the observability functions more for DataFlow job debugging and optimization while StackDriver is for holistic tracking and monitoring. I.e. the DataFlow UI is for job specific DataFlow Ops, StackDriver is for Administration.

Job details 
JOB GRAPH 
BACK TO OLD JOB PAGE 
JOB METRICS

Those of us who have used StackDriver appreciate more visibility in CPU and throughput as all had in the console was the resource metrics on the right of the job topology.

I did a quick run of the standard wordcount example to generate so data. The new graphs are simple and to the point. I like them.

Throughput (elements/sec) 
7:37 
group/Read 
7:38 
7:39 
Create alerting policy 
Mar 2, 2020 7:40 PM 
• group/ Reify 
. group/Write 
• split 
• read/Read 
17 lines below 
261.9/s 
261.9/s 
261.9/s 
261 S/s 
50.97/s 
2S0's 
200's 
ISO's 
IDO/s 
111

Now we see specifically which ops taking the most IO and CPU for a given job – without the overhead of creating a new StackDriver dashboard or filtering to a specific job. In fact, there’s no real way to get this level of visual detail out of the box in StackDriver. (At least not that I’m aware of, let me know if there is a simple configuration setting I’ve been overlooking!) In StackDriver the minimum alignment period is 1 minute, so the best we can do is see operation counts or vCPUs per minute. In our new DataFlow UI we can see throughput and vCPU per second.

For a StackDriver workflow, per second detail is way to granular; however, when testing DataFlow jobs prior to a large scale deployment, lower level detail is important for introspection prior to rolling out inefficient – and expensive – DataFlow jobs.

Applying Continuous Delivery Patterns to Data Development

Historically, application development and data pipeline development have been kept separate. We are seeing this pattern begin to change. (See posts on Conway’s Law for reasons why.)

This means ETL/ELT development will almost certainly begin to model application development. App dev contains far more controls and business continuity, but more importantly Application Development has spent the past two decades refining coding patterns consistent with reusable and extensible code objects.

App DevOps has so refined their ability to quickly modify and deploy changes to app code that the current state of the industry is talking about not only pushing code for testing thousand of times a day, but also how they can automate pushing code changes to production! Continuously!

That sounds like an impossibility to many ETL/ELT devs, but the truth is – there is nothing stopping continuous deployment patterns in data dev. In fact, there is a movement toward Behavioral Driven Design (BDD) as an extension of Test Driven Design (TDD).

BDD and TDD are development patterns which integrate acceptance criteria into the code itself, meaning the first round of quality assurance must happen before any human lays eyes on the data output. App DevOps has found this can help find root causes to code issues as teams can focus on specific problems (e.g. “Are the acceptance criteria correct? if so, is Dave’s code correctly testing for them?”) rather than general problems (e.g. “Dave is an idiot”)

Databricks has a great article on how to architect dev/prod CI/CD.This shows details such how each developer has their own ‘development’ environment but with managed configurations and plugins to ensure all development occurs in the same configuration.

I would personally love to see data science engineering development patterns move into Test / Behavioral Driven Design – mostly because it makes things a lot easier on data end users; but also because it forces strict requirement definitions: https://databricks.com/blog/2017/06/02/integrating-apache-spark-cucumber…

It’s a little late to pick up on HumbleBundle, but Continuous Delivery with Docker and Jenkins by Rafal Lesko is a great read on the topic of Continuous Delivery, even if you are unfamiliar with the technologies.

At TheoryLane our architects help dis-entangle your existing data processes to help operationalize machine learning and data science solutions. Our data development patterns construct reusable, governed information objects.  Combined, innovative data architecture and development patterns provide reusable streaming context to break the barriers to continuous data deployment and create true value added applications!

Contact us for more information.

Exploring Development Patterns in Data Science

 

Data Scientists are in an interesting position. Data Science is about optimization, true optimization requires automation, so QED ‘true’ Data Science means eliminating Data Scientists.

And Data Scientists are happy to optimize themselves out of job. Unfortunately the development patterns and architectures at times used by data scientists can, ironically, be sub-optimal for optimization. (At times called ‘Anti-Patterns’ as the mechanism for developing the solution limits the success of said solution.)

The issue may stem from Data Science as a discipline not having been around for very long. Fortunately we are able to borrow from application and data development operations (i.e. “DevOps”) for some guidance.

First, let’s look at a frequent Data Science development workflow/anti-pattern; then explore a traditional and emerging app dev pattern to solve the Data Science anti-pattern.

The Big Ass Script “Architecture”

Data Science has a problem, we typically put all our data connection, transformation, and visualization code in one big-ass script (“BAS”). The BAS development pattern is great for one superstar data scientist to mess around with a bunch of data and make an awesome model. But there are a few limitations:

  • It’s hard to reuse in future analysis
  • No one else understands how it works
  • It’s difficult to debug

We have workarounds for the limitations, but copy pasting code, knowledge bases, and blind faith in results only get us so far.

BAS is only made easier to create with “Notebooks” like Jupyter and Zeppelin, which are practically IDEs for generating BAS solutions.

What about integrating a BAS into a production data pipeline? Simple answer, you can’t. It has to be refactored. Which is why we have a standard data science development pattern of:

  1. Data Science Data Mess-around
  2. Refactor data transformations into a more useful framework
  3. Build out a thick ETL dumping the scored data into a database

The Decoupled Architecture

“But we solved this problem over 30 years ago” says literally every enterprise data architect. Yes they did, so let’s look at the basics.

Multi-Tiered Architecture

https://en.wikipedia.org/wiki/Multitier_architecture

Who would have thought separating the data, data analysis (filtering, scoring, etc.), and reporting (visuals, dashboards, etc.) was important? 

The second someone downloads a csv to their laptop or starts messing around in Excel any semblance of data governance is thrown out the window. To make matters worse, updating the model means we have to repeat whatever crazy process was used to get that csv in first place.

A tightly coupled analysis and reporting stack is just as bad. Now data scientists are forced to rerun analysis for every enhancement or bug fix requested. It can lead to analysts becoming front end developers.

So the multi-tiered architecture is not only good for managing hardware resources, it is good for management human resources as well.

Is this the same as Model View Control?

Great question, and the answer is debatable, but personally I look at the tiered architecture as a physical separation and MVC as logical separation. MVC is a code development pattern which isolates the data (“Model”) from any persistent modifications (“Control”) from an application’s interface (“View”).

https://en.wikipedia.org/wiki/Model%E2%80%93view%E2%80%93controller

Meaning, in a modern PaaS environment, effective MVC design is equivalent to multi-tier design, as the data storage, processing, and visualization functions are handled by different services.

However, MVC doesn’t solve everything, even with decoupling operations, we may not decouple how the operations themselves interact. In other words, we may develop solutions with extremely efficient code execution but the code is buried in a larger method, or maybe we want to modify a model but it’s used across multiple solutions.

The Microservice Architecture

Microservice is emerging as the data science architecture pattern of choice. To oversimplify, it focuses on creating standard interfaces for all data extraction, manipulation, and visualization operations – without the need for complicated middle layer service bus applications to handle communication.

A standard interface means we have a consistent format for data going in and data coming out of the “service.”

For a good example of a service, think of Facebook or Github Graph APIs. You typically make several calls to various service endpoints as you get new information on each hop. (E.g. userId -> postId -> text.) We know the format of data to provide to the endpoint and what data to expect in return, so we can reuse the endpoints in multiple applications. Furthermore, modifications to the underlying code probably happen all the time without anyone ever noticing.

How to Implement Data Science Microservice Architecture

Well Microservices sounds well and good, but how on earth are analysts and most data scientists going to actually develop like this? Am I seriously proposing they learn how to develop and deploy reusable APIs like back-end developers or engineers??

Well, kind of, yeah…

But it’s not as bad as it sounds. While purely coded solutions exist, for example flask for python or plumbr in R, they require a lot of administration and development to run in a fully integrated microservices architecture. There is no pre configured security, no management layer, no high availability, and not to mention it’s an entirely new coding paradigm.

So without a lot of IT Infrastructure and support around them, having data scientists start standing up apis they coded themselves is probably not going to happen.

Alternatively, PaaS providers like Azure and AWS have other ways for data scientists to begin deploying services.

Azure Machine Learning is a clean looking GUI for data manipulation and preconfigured statistical operations, as well as a way for raw R and Python code to be placed into modules for reuse in a group.

More importantly, any workflow created in it can be deployed as a RESTful endpoint for either triggering a model scoring workflow or making an request for some data produced by the workflow in real time.

Azure ML is probably the fastest and most cost effective way to start producing reusable data science microservices, particularly if your company is already using Azure Cloud Services.

AWS SageMaker is another relatively easy way to build and deploy models as services. However, SageMaker targeted more toward data science developers so most of its functionality is only accessible through code. Furthermore, the recommended data manipulation component is a separate service “AWS Glue”.

As of right now SageMaker is more powerful without a doubt, you can load custom containers into it for example. It depends on your existing cloud environment and the coding abilities of your data science team to determine which is correct for you.

These are only two examples of how we are moving into a new world of Microservices. It is likely Data Scientists will soon be part of development teams, and data science architecture will be needed to support this new enterprise data development pattern.

At TheoryLane our architects help dis-entangle your existing data processes to help operationalize machine learning and data science solutions. Our data development patterns construct reusable, governed information objects.  Combined, innovative data architecture and development patterns provide reusable streaming context to break the barriers of the BAS anti-pattern and create true value added applications!

Contact us for more information.

Three Tiered Architecture: Basic Full Stack Analytics Architecture using R

Descriptive, Diagnostic and Predictive Analytics – and their corresponding specializations reporting, business intelligence, and data science – do not operate independently in an organization; at least not if they are to operate effectively.

A relatively new term emerging into our business vocabulary is “Full-Stack Data Science”; describing how all layers of analytics and data must operate in concert to maximize organizational returns on analytical activities. (See our post on emerging data science architecture for more detail.)

Let us break down the phrase “Full-Stack Data Science.” “Data Science” (as we’ve mentioned in previous articles) remains a nebulous term at best. As a quick refresher, Data Scientist tend to fall into two categories: one, PhDs in computer science who create programs leveraging statistics, and two statisticians who understand how to incorporate programmatic solutions to accelerate data transformations and insights.

“Full-Stack” is a term appropriated from web development. In Web Dev a “Full Stack Developer” creates the user interface, data interface, and the data repository, as well as implements the technologies for users to access their applications via the web.

Full Stack Data Science is very similar to Full Stack Web Development. The Data Science stack requires some way to interface with data, a means to transform the data (i.e. mathematical operations), data storage, and an architecture to support the entire process.

Below we will provide some high level examples of Web Dev and Data Science Full Stack use cases as well as some technologies often used. This is not an exhaustive list by any means, only designed to introduce the terminology and provide a frame of reference for those familiar with the more common web based tools and technology.

Full Stack Terminology

User Interface

  • Definition: How the end user accesses and interacts with the application
  • Examples:
    • Web Dev: User goes to department store website and is able to quickly find desired product
    • Data Science: Analysts wishes to see results of a model or report; user wishes to change parameters of existing model
  • Web Development Technologies:
    • HTML; CSS; JavaScript
  • Data Science Technologies:
    • Shiny; SSRS; Front End of BI Tools (e.g. Tableau; Spotfire)

Data Interface

  • Definition: How the data entered into the user interface is moved to the data storage layer and how data in data  layer is retrieved given user interface requests and information required for website
    • Data may be passed directly from interface to repository or it may be transformed in some capacity (e.g. aggregations, type changes, mathematical operations, modeling)
  • Examples:
    • Web Dev: User purchases good from website, credit card number is transferred to banking system and purchase is saved in payment processing system. Shopping history may be retrieved by user at a later date.
    • Data Science: Support Vector Machine Unsupervised Learning processes generates product suggestions for potential online customers; high value customers identified via squared error cost function neural network model
  • Web Development Technologies:
    • PHP; Angular.js; node.js
  • Data Science Technologies:
    • R; Spark; Python; SSIS; T-SQL; Data Modeling aspect of BI Tools(e.g. Tableau VQL; Spotfire TERR and Information Designer)

Data Repository (Web Dev and Data Science)

  • Long term storage of data required for websites, predictive modeling, reporting
  • Example: Long term storage of customer information; ERP; CMS
  • Technologies:
    • SQL; MongoDB; Cassandra; HBASE; Delimited Text; XML; Unstructured

User Interface

The first thing most people think about when “Analytics” is mentioned are analytics visuals. Visualization is critical in all levels of analytical maturity – and one of the two core functions of a good UI is its ability to communicate information succinctly via visuals, that is, its ability to send information to the user

User interactability is the second core function – i.e. the ability of the user to send information into the stack via the UX. Early on in analytics maturity Interactivity is more important in development than it is to the end users; however just as website more from simple communicators of information in web 1.0 to dynamic communities full of user generated content in web 2.0, so too do organizations begin moving analytics into the hands of their business users as analytics needs grow.

Analytics maturity typically starts at the user interface level. Creating interactive visualizations and allowing users to create their own cross-tabs and visualization is commonly known as “self-service analytics” and acts as a bridge between descriptive and diagnostic analytics. User engagement with the data means that users are able to ask “Why” certain results are present and begin to answer the question themselves.

Business Intelligence Platforms

R Integration

Modern Business Intelligence platform provide all the user interface functionality (and some data interface) required in the modern data science stack. For example, Spotfire and Tableau both have R integration out of the box. R integration means the BI platform has access to all the data interface functionality in R – including, but not limited to, data shaping, statistical modeling, read/write to SQL, etc.

Web Functionality

Business Intelligence platforms typically use internet browser based solutions for large scale deployments (~+50 users), as such, they allow developers to either embed a BI visualization into an existing website (See: http://www.tibco.com/blog/2015/01/17/responsive-design-with-bootstrap-spotfire/) or they allow website development and formatting directly in the analysis.

The platforms which truly embrace web functionality have javascript and html capability. With javascript, the door is open to the giant library of visualization created by the javascript community, some examples: processing.js; d3.js, raphael.js, ember.js

Entire books have been written on each one of those libraries so I will leave it as an exercise for the reader to research more about javascript libraries.

Direct Connections to Databases

It may sound like there needs to be a separate application as the data interface layer, this is not necessarily true. All BI platforms allow for direct connection to the data layer. Their ability to both interface with and display data makes them a low barrier to entry for advancing analytic maturity.

However, BI platforms are built to be read only from data. A separate data interface layer needs to be present for the read/write to data that is required in a true data science full stack.

As an aside, if the platform of choice for the organization does not have some type of data interface integration (e.g. python, matlab, SAS, Spark, R), the statistical data modeling and shaping done by said data interface platforms must happen asynchronously from the user interface – meaning, a separate statistical process not controlled by the BI Tool must read data and write results to the data layer, which will then be consumed by the BI Tool.

Data Interface

R as a Governed Hub

R is awesome for full-stack data science, don’t get me wrong, but there are plenty of other languages / platforms that are better in different way. For example, Python is better at data manipulation and has easier syntax. Spark and H20 are better for large scale computation. But this is very important, every one of tools can be called within a R script.

If I want to write my output to a SQL Database? There’s sqldf. Do I have a cool pyhon script that I need to kick-off and listen to output? There’s rPython. What about Spark? It  has native integration with R in SparkR.

More importantly, we can keep all our code objects as a version controlled library. Allowing developers to update and extend code functionality independent of the BI platform. While UX analysts do not need to have multiple versions of the same code floating around – they don’t even need to learn any programming.

Passing Values to R

Once that data is in R, you have access to almost 50 years of mathematical and data operations libraries.  You can execute models and send them right to the Business Intelligence platform for evaluation and reports.

In fact, the model evaluation can become more robust. Remember, BI Platform now have R integration as well. In many cases users can input or select values in the BI User Interface and have them included in an R function, which then gives the user access to all the other functionality available in R!

From hyper parameter tuning to adding notes to an insert SQL statement, manipulating R scripts via user interface in a BI platform greatly expands the base functionality of BI Tools and is a major part of why R is a preferred part of a Full-Stack Data Science Environment.

R can also access data online in websites and RESTful services via tools like RCurl and httr. Imagine a use case where we want to see the conversations, subject, and sentiment of all users across a set of websites. We can create a full stack solution where an end user can type in the name of a website within Spotfire*, R scrapes the data page by page. Each page getting processed into natural language blocks (Verb, Subject, DO, etc.) into a streaming Spark application where it’s then written to a Graphical Database indicating the relationship between phrases, pages, people, etc. All this time, the sentiment and relationships between websites, pages, users, sentiment, and products mentioned.

*Note: I am specifically stating Spotfire here because I know from personal experience Spotfire is capable of implementing this solution.