Applying Continuous Delivery Patterns to Data Development

Historically, application development and data pipeline development have been kept separate. We are seeing this pattern begin to change. (See posts on Conway’s Law for reasons why.)

This means ETL/ELT development will almost certainly begin to model application development. App dev contains far more controls and business continuity, but more importantly Application Development has spent the past two decades refining coding patterns consistent with reusable and extensible code objects.

App DevOps has so refined their ability to quickly modify and deploy changes to app code that the current state of the industry is talking about not only pushing code for testing thousand of times a day, but also how they can automate pushing code changes to production! Continuously!

That sounds like an impossibility to many ETL/ELT devs, but the truth is – there is nothing stopping continuous deployment patterns in data dev. In fact, there is a movement toward Behavioral Driven Design (BDD) as an extension of Test Driven Design (TDD).

BDD and TDD are development patterns which integrate acceptance criteria into the code itself, meaning the first round of quality assurance must happen before any human lays eyes on the data output. App DevOps has found this can help find root causes to code issues as teams can focus on specific problems (e.g. “Are the acceptance criteria correct? if so, is Dave’s code correctly testing for them?”) rather than general problems (e.g. “Dave is an idiot”)

Databricks has a great article on how to architect dev/prod CI/CD.This shows details such how each developer has their own ‘development’ environment but with managed configurations and plugins to ensure all development occurs in the same configuration.

I would personally love to see data science engineering development patterns move into Test / Behavioral Driven Design – mostly because it makes things a lot easier on data end users; but also because it forces strict requirement definitions: https://databricks.com/blog/2017/06/02/integrating-apache-spark-cucumber…

It’s a little late to pick up on HumbleBundle, but Continuous Delivery with Docker and Jenkins by Rafal Lesko is a great read on the topic of Continuous Delivery, even if you are unfamiliar with the technologies.

At TheoryLane our architects help dis-entangle your existing data processes to help operationalize machine learning and data science solutions. Our data development patterns construct reusable, governed information objects.  Combined, innovative data architecture and development patterns provide reusable streaming context to break the barriers to continuous data deployment and create true value added applications!

Contact us for more information.

Leave a Reply

Your email address will not be published.