Data Scientists are in an interesting position. Data Science is about optimization, true optimization requires automation, so QED ‘true’ Data Science means eliminating Data Scientists.
And Data Scientists are happy to optimize themselves out of job. Unfortunately the development patterns and architectures at times used by data scientists can, ironically, be sub-optimal for optimization. (At times called ‘Anti-Patterns’ as the mechanism for developing the solution limits the success of said solution.)
The issue may stem from Data Science as a discipline not having been around for very long. Fortunately we are able to borrow from application and data development operations (i.e. “DevOps”) for some guidance.
First, let’s look at a frequent Data Science development workflow/anti-pattern; then explore a traditional and emerging app dev pattern to solve the Data Science anti-pattern.
The Big Ass Script “Architecture”
Data Science has a problem, we typically put all our data connection, transformation, and visualization code in one big-ass script (“BAS”). The BAS development pattern is great for one superstar data scientist to mess around with a bunch of data and make an awesome model. But there are a few limitations:
- It’s hard to reuse in future analysis
- No one else understands how it works
- It’s difficult to debug
We have workarounds for the limitations, but copy pasting code, knowledge bases, and blind faith in results only get us so far.
BAS is only made easier to create with “Notebooks” like Jupyter and Zeppelin, which are practically IDEs for generating BAS solutions.
What about integrating a BAS into a production data pipeline? Simple answer, you can’t. It has to be refactored. Which is why we have a standard data science development pattern of:
- Data Science Data Mess-around
- Refactor data transformations into a more useful framework
- Build out a thick ETL dumping the scored data into a database
The Decoupled Architecture
“But we solved this problem over 30 years ago” says literally every enterprise data architect. Yes they did, so let’s look at the basics.
Who would have thought separating the data, data analysis (filtering, scoring, etc.), and reporting (visuals, dashboards, etc.) was important?
The second someone downloads a csv to their laptop or starts messing around in Excel any semblance of data governance is thrown out the window. To make matters worse, updating the model means we have to repeat whatever crazy process was used to get that csv in first place.
A tightly coupled analysis and reporting stack is just as bad. Now data scientists are forced to rerun analysis for every enhancement or bug fix requested. It can lead to analysts becoming front end developers.
So the multi-tiered architecture is not only good for managing hardware resources, it is good for management human resources as well.
Is this the same as Model View Control?
Great question, and the answer is debatable, but personally I look at the tiered architecture as a physical separation and MVC as logical separation. MVC is a code development pattern which isolates the data (“Model”) from any persistent modifications (“Control”) from an application’s interface (“View”).
Meaning, in a modern PaaS environment, effective MVC design is equivalent to multi-tier design, as the data storage, processing, and visualization functions are handled by different services.
However, MVC doesn’t solve everything, even with decoupling operations, we may not decouple how the operations themselves interact. In other words, we may develop solutions with extremely efficient code execution but the code is buried in a larger method, or maybe we want to modify a model but it’s used across multiple solutions.
The Microservice Architecture
Microservice is emerging as the data science architecture pattern of choice. To oversimplify, it focuses on creating standard interfaces for all data extraction, manipulation, and visualization operations – without the need for complicated middle layer service bus applications to handle communication.
A standard interface means we have a consistent format for data going in and data coming out of the “service.”
For a good example of a service, think of Facebook or Github Graph APIs. You typically make several calls to various service endpoints as you get new information on each hop. (E.g. userId -> postId -> text.) We know the format of data to provide to the endpoint and what data to expect in return, so we can reuse the endpoints in multiple applications. Furthermore, modifications to the underlying code probably happen all the time without anyone ever noticing.
How to Implement Data Science Microservice Architecture
Well Microservices sounds well and good, but how on earth are analysts and most data scientists going to actually develop like this? Am I seriously proposing they learn how to develop and deploy reusable APIs like back-end developers or engineers??
Well, kind of, yeah…
But it’s not as bad as it sounds. While purely coded solutions exist, for example flask for python or plumbr in R, they require a lot of administration and development to run in a fully integrated microservices architecture. There is no pre configured security, no management layer, no high availability, and not to mention it’s an entirely new coding paradigm.
So without a lot of IT Infrastructure and support around them, having data scientists start standing up apis they coded themselves is probably not going to happen.
Alternatively, PaaS providers like Azure and AWS have other ways for data scientists to begin deploying services.
Azure Machine Learning is a clean looking GUI for data manipulation and preconfigured statistical operations, as well as a way for raw R and Python code to be placed into modules for reuse in a group.
More importantly, any workflow created in it can be deployed as a RESTful endpoint for either triggering a model scoring workflow or making an request for some data produced by the workflow in real time.
Azure ML is probably the fastest and most cost effective way to start producing reusable data science microservices, particularly if your company is already using Azure Cloud Services.
AWS SageMaker is another relatively easy way to build and deploy models as services. However, SageMaker targeted more toward data science developers so most of its functionality is only accessible through code. Furthermore, the recommended data manipulation component is a separate service “AWS Glue”.
As of right now SageMaker is more powerful without a doubt, you can load custom containers into it for example. It depends on your existing cloud environment and the coding abilities of your data science team to determine which is correct for you.
These are only two examples of how we are moving into a new world of Microservices. It is likely Data Scientists will soon be part of development teams, and data science architecture will be needed to support this new enterprise data development pattern.
At TheoryLane our architects help dis-entangle your existing data processes to help operationalize machine learning and data science solutions. Our data development patterns construct reusable, governed information objects. Combined, innovative data architecture and development patterns provide reusable streaming context to break the barriers of the BAS anti-pattern and create true value added applications!
Contact us for more information.
One Reply to “Exploring Development Patterns in Data Science”
Thank you for the article. The first two points are spot-on and the notion that Notebooks only facilitate the creation of BAS nicely highlights the danger.
I believe it is worth really highlighting your point about data scientists requiring an infrastructure and processes in place in order to start tackling microservices for any kind of generalised infrastructure.
I have seen it backfiring spectacularly on several occasions. And I do not just mean it took too much of Data Scientists time. It can introduce limited-quality and resiliance services as part of now production – grade infrastructure. It can create nearly – instantly orphaned silos. It requires significant effort to standardise afterwards.
What I did find to be a much easier victory in the absence of established process for deployment of microservices is what you hint at in “messing around in Excel ” part. A standard for data management – introduction of a simple lake – like immutable single repository for original datasets with a light-weight governance and defined scope for progression to products.