Exploring Development Patterns in Data Science

 

Data Scientists are in an interesting position. Data Science is about optimization, true optimization requires automation, so QED ‘true’ Data Science means eliminating Data Scientists.

And Data Scientists are happy to optimize themselves out of job. Unfortunately the development patterns and architectures at times used by data scientists can, ironically, be sub-optimal for optimization. (At times called ‘Anti-Patterns’ as the mechanism for developing the solution limits the success of said solution.)

The issue may stem from Data Science as a discipline not having been around for very long. Fortunately we are able to borrow from application and data development operations (i.e. “DevOps”) for some guidance.

First, let’s look at a frequent Data Science development workflow/anti-pattern; then explore a traditional and emerging app dev pattern to solve the Data Science anti-pattern.

The Big Ass Script “Architecture”

Data Science has a problem, we typically put all our data connection, transformation, and visualization code in one big-ass script (“BAS”). The BAS development pattern is great for one superstar data scientist to mess around with a bunch of data and make an awesome model. But there are a few limitations:

  • It’s hard to reuse in future analysis
  • No one else understands how it works
  • It’s difficult to debug

We have workarounds for the limitations, but copy pasting code, knowledge bases, and blind faith in results only get us so far.

BAS is only made easier to create with “Notebooks” like Jupyter and Zeppelin, which are practically IDEs for generating BAS solutions.

What about integrating a BAS into a production data pipeline? Simple answer, you can’t. It has to be refactored. Which is why we have a standard data science development pattern of:

  1. Data Science Data Mess-around
  2. Refactor data transformations into a more useful framework
  3. Build out a thick ETL dumping the scored data into a database

The Decoupled Architecture

“But we solved this problem over 30 years ago” says literally every enterprise data architect. Yes they did, so let’s look at the basics.

Multi-Tiered Architecture

https://en.wikipedia.org/wiki/Multitier_architecture

Who would have thought separating the data, data analysis (filtering, scoring, etc.), and reporting (visuals, dashboards, etc.) was important? 

The second someone downloads a csv to their laptop or starts messing around in Excel any semblance of data governance is thrown out the window. To make matters worse, updating the model means we have to repeat whatever crazy process was used to get that csv in first place.

A tightly coupled analysis and reporting stack is just as bad. Now data scientists are forced to rerun analysis for every enhancement or bug fix requested. It can lead to analysts becoming front end developers.

So the multi-tiered architecture is not only good for managing hardware resources, it is good for management human resources as well.

Is this the same as Model View Control?

Great question, and the answer is debatable, but personally I look at the tiered architecture as a physical separation and MVC as logical separation. MVC is a code development pattern which isolates the data (“Model”) from any persistent modifications (“Control”) from an application’s interface (“View”).

https://en.wikipedia.org/wiki/Model%E2%80%93view%E2%80%93controller

Meaning, in a modern PaaS environment, effective MVC design is equivalent to multi-tier design, as the data storage, processing, and visualization functions are handled by different services.

However, MVC doesn’t solve everything, even with decoupling operations, we may not decouple how the operations themselves interact. In other words, we may develop solutions with extremely efficient code execution but the code is buried in a larger method, or maybe we want to modify a model but it’s used across multiple solutions.

The Microservice Architecture

Microservice is emerging as the data science architecture pattern of choice. To oversimplify, it focuses on creating standard interfaces for all data extraction, manipulation, and visualization operations – without the need for complicated middle layer service bus applications to handle communication.

A standard interface means we have a consistent format for data going in and data coming out of the “service.”

For a good example of a service, think of Facebook or Github Graph APIs. You typically make several calls to various service endpoints as you get new information on each hop. (E.g. userId -> postId -> text.) We know the format of data to provide to the endpoint and what data to expect in return, so we can reuse the endpoints in multiple applications. Furthermore, modifications to the underlying code probably happen all the time without anyone ever noticing.

How to Implement Data Science Microservice Architecture

Well Microservices sounds well and good, but how on earth are analysts and most data scientists going to actually develop like this? Am I seriously proposing they learn how to develop and deploy reusable APIs like back-end developers or engineers??

Well, kind of, yeah…

But it’s not as bad as it sounds. While purely coded solutions exist, for example flask for python or plumbr in R, they require a lot of administration and development to run in a fully integrated microservices architecture. There is no pre configured security, no management layer, no high availability, and not to mention it’s an entirely new coding paradigm.

So without a lot of IT Infrastructure and support around them, having data scientists start standing up apis they coded themselves is probably not going to happen.

Alternatively, PaaS providers like Azure and AWS have other ways for data scientists to begin deploying services.

Azure Machine Learning is a clean looking GUI for data manipulation and preconfigured statistical operations, as well as a way for raw R and Python code to be placed into modules for reuse in a group.

More importantly, any workflow created in it can be deployed as a RESTful endpoint for either triggering a model scoring workflow or making an request for some data produced by the workflow in real time.

Azure ML is probably the fastest and most cost effective way to start producing reusable data science microservices, particularly if your company is already using Azure Cloud Services.

AWS SageMaker is another relatively easy way to build and deploy models as services. However, SageMaker targeted more toward data science developers so most of its functionality is only accessible through code. Furthermore, the recommended data manipulation component is a separate service “AWS Glue”.

As of right now SageMaker is more powerful without a doubt, you can load custom containers into it for example. It depends on your existing cloud environment and the coding abilities of your data science team to determine which is correct for you.

These are only two examples of how we are moving into a new world of Microservices. It is likely Data Scientists will soon be part of development teams, and data science architecture will be needed to support this new enterprise data development pattern.

At TheoryLane our architects help dis-entangle your existing data processes to help operationalize machine learning and data science solutions. Our data development patterns construct reusable, governed information objects.  Combined, innovative data architecture and development patterns provide reusable streaming context to break the barriers of the BAS anti-pattern and create true value added applications!

Contact us for more information.

Three Tiered Architecture: Basic Full Stack Analytics Architecture using R

Descriptive, Diagnostic and Predictive Analytics – and their corresponding specializations reporting, business intelligence, and data science – do not operate independently in an organization; at least not if they are to operate effectively.

A relatively new term emerging into our business vocabulary is “Full-Stack Data Science”; describing how all layers of analytics and data must operate in concert to maximize organizational returns on analytical activities. (See our post on emerging data science architecture for more detail.)

Let us break down the phrase “Full-Stack Data Science.” “Data Science” (as we’ve mentioned in previous articles) remains a nebulous term at best. As a quick refresher, Data Scientist tend to fall into two categories: one, PhDs in computer science who create programs leveraging statistics, and two statisticians who understand how to incorporate programmatic solutions to accelerate data transformations and insights.

“Full-Stack” is a term appropriated from web development. In Web Dev a “Full Stack Developer” creates the user interface, data interface, and the data repository, as well as implements the technologies for users to access their applications via the web.

Full Stack Data Science is very similar to Full Stack Web Development. The Data Science stack requires some way to interface with data, a means to transform the data (i.e. mathematical operations), data storage, and an architecture to support the entire process.

Below we will provide some high level examples of Web Dev and Data Science Full Stack use cases as well as some technologies often used. This is not an exhaustive list by any means, only designed to introduce the terminology and provide a frame of reference for those familiar with the more common web based tools and technology.

Full Stack Terminology

User Interface

  • Definition: How the end user accesses and interacts with the application
  • Examples:
    • Web Dev: User goes to department store website and is able to quickly find desired product
    • Data Science: Analysts wishes to see results of a model or report; user wishes to change parameters of existing model
  • Web Development Technologies:
    • HTML; CSS; JavaScript
  • Data Science Technologies:
    • Shiny; SSRS; Front End of BI Tools (e.g. Tableau; Spotfire)

Data Interface

  • Definition: How the data entered into the user interface is moved to the data storage layer and how data in data  layer is retrieved given user interface requests and information required for website
    • Data may be passed directly from interface to repository or it may be transformed in some capacity (e.g. aggregations, type changes, mathematical operations, modeling)
  • Examples:
    • Web Dev: User purchases good from website, credit card number is transferred to banking system and purchase is saved in payment processing system. Shopping history may be retrieved by user at a later date.
    • Data Science: Support Vector Machine Unsupervised Learning processes generates product suggestions for potential online customers; high value customers identified via squared error cost function neural network model
  • Web Development Technologies:
    • PHP; Angular.js; node.js
  • Data Science Technologies:
    • R; Spark; Python; SSIS; T-SQL; Data Modeling aspect of BI Tools(e.g. Tableau VQL; Spotfire TERR and Information Designer)

Data Repository (Web Dev and Data Science)

  • Long term storage of data required for websites, predictive modeling, reporting
  • Example: Long term storage of customer information; ERP; CMS
  • Technologies:
    • SQL; MongoDB; Cassandra; HBASE; Delimited Text; XML; Unstructured

User Interface

The first thing most people think about when “Analytics” is mentioned are analytics visuals. Visualization is critical in all levels of analytical maturity – and one of the two core functions of a good UI is its ability to communicate information succinctly via visuals, that is, its ability to send information to the user

User interactability is the second core function – i.e. the ability of the user to send information into the stack via the UX. Early on in analytics maturity Interactivity is more important in development than it is to the end users; however just as website more from simple communicators of information in web 1.0 to dynamic communities full of user generated content in web 2.0, so too do organizations begin moving analytics into the hands of their business users as analytics needs grow.

Analytics maturity typically starts at the user interface level. Creating interactive visualizations and allowing users to create their own cross-tabs and visualization is commonly known as “self-service analytics” and acts as a bridge between descriptive and diagnostic analytics. User engagement with the data means that users are able to ask “Why” certain results are present and begin to answer the question themselves.

Business Intelligence Platforms

R Integration

Modern Business Intelligence platform provide all the user interface functionality (and some data interface) required in the modern data science stack. For example, Spotfire and Tableau both have R integration out of the box. R integration means the BI platform has access to all the data interface functionality in R – including, but not limited to, data shaping, statistical modeling, read/write to SQL, etc.

Web Functionality

Business Intelligence platforms typically use internet browser based solutions for large scale deployments (~+50 users), as such, they allow developers to either embed a BI visualization into an existing website (See: http://www.tibco.com/blog/2015/01/17/responsive-design-with-bootstrap-spotfire/) or they allow website development and formatting directly in the analysis.

The platforms which truly embrace web functionality have javascript and html capability. With javascript, the door is open to the giant library of visualization created by the javascript community, some examples: processing.js; d3.js, raphael.js, ember.js

Entire books have been written on each one of those libraries so I will leave it as an exercise for the reader to research more about javascript libraries.

Direct Connections to Databases

It may sound like there needs to be a separate application as the data interface layer, this is not necessarily true. All BI platforms allow for direct connection to the data layer. Their ability to both interface with and display data makes them a low barrier to entry for advancing analytic maturity.

However, BI platforms are built to be read only from data. A separate data interface layer needs to be present for the read/write to data that is required in a true data science full stack.

As an aside, if the platform of choice for the organization does not have some type of data interface integration (e.g. python, matlab, SAS, Spark, R), the statistical data modeling and shaping done by said data interface platforms must happen asynchronously from the user interface – meaning, a separate statistical process not controlled by the BI Tool must read data and write results to the data layer, which will then be consumed by the BI Tool.

Data Interface

R as a Governed Hub

R is awesome for full-stack data science, don’t get me wrong, but there are plenty of other languages / platforms that are better in different way. For example, Python is better at data manipulation and has easier syntax. Spark and H20 are better for large scale computation. But this is very important, every one of tools can be called within a R script.

If I want to write my output to a SQL Database? There’s sqldf. Do I have a cool pyhon script that I need to kick-off and listen to output? There’s rPython. What about Spark? It  has native integration with R in SparkR.

More importantly, we can keep all our code objects as a version controlled library. Allowing developers to update and extend code functionality independent of the BI platform. While UX analysts do not need to have multiple versions of the same code floating around – they don’t even need to learn any programming.

Passing Values to R

Once that data is in R, you have access to almost 50 years of mathematical and data operations libraries.  You can execute models and send them right to the Business Intelligence platform for evaluation and reports.

In fact, the model evaluation can become more robust. Remember, BI Platform now have R integration as well. In many cases users can input or select values in the BI User Interface and have them included in an R function, which then gives the user access to all the other functionality available in R!

From hyper parameter tuning to adding notes to an insert SQL statement, manipulating R scripts via user interface in a BI platform greatly expands the base functionality of BI Tools and is a major part of why R is a preferred part of a Full-Stack Data Science Environment.

R can also access data online in websites and RESTful services via tools like RCurl and httr. Imagine a use case where we want to see the conversations, subject, and sentiment of all users across a set of websites. We can create a full stack solution where an end user can type in the name of a website within Spotfire*, R scrapes the data page by page. Each page getting processed into natural language blocks (Verb, Subject, DO, etc.) into a streaming Spark application where it’s then written to a Graphical Database indicating the relationship between phrases, pages, people, etc. All this time, the sentiment and relationships between websites, pages, users, sentiment, and products mentioned.

*Note: I am specifically stating Spotfire here because I know from personal experience Spotfire is capable of implementing this solution.