Advanced R Programming (And Why that is Important)

Thanks to Reddit.com/r/programming for directing me to this excellent compilation of R programming information by Hadley Wickham.

Most of us in analytics use R just to get stuff done. It has a large number of packages that lets us produce insights at our own pace. What I mean by “at our own pace” is that we typically analyse data either individually or within our team, with the results communicated to our audience independently of the analysis itself.

This creates a situation where programmatic efficiency was not considered a priority by many R analysts. If a function was sub-optimal, it would only impact us with a little extra processing time, our end user would never know the difference. In fact, it made more sense to create a fast coded, long processing solution as the code would only be used a few times. Spending hours optimizing code would be considered a waste.

However, analysts are now crunching massive amounts of data. What would have cause a few minutes delay in processing a few years ago will now cause hours and hours of unnecessary waiting. Addionally, reducing processing time is now become more valuable as we move toward elastic cloud computing where more processing = more cost.

Most importantly, analytics is moving toward machine learning solutions, requiring the algo to be constantly updating and impacting automated decisions in close to real time. For example, predicting likely online customer behavior based on current site activity. Slow processing would make a very ineffective modeled solution.

Given these factors, we are rapidly reaching the point where an analyst’s programming abilities will be just as important and their statistical skill set. Over the next few weeks I plan to pick out a few relevant items from the Advance R Programming review which will help us Analysts make our programs faster and easier to understand.

Metaphors used in Data Science

People say data science is difficult, which it is, but even harder is explaining it to other people!

Data Science itself is to blame for this, mostly because we don’t have a concrete definition of it either, which has created a few problem. There are companies promoting ‘Data Science’ tools as ways to enable all your analysts to become “Data Scientists”. The job market is full of people who took a course on Python calling themselves “Data Scientists”. And businesses so focused on reporting that they think all Analytics, Data Science included, is just getting data faster and prettier.

But the tools we use are just that, tools. The code we use requires specialized knowledge to apply it effectively. The data pipelines we create are to monitor the success and failure of our models, it’s an added bonus it helps with reporting.  To mitigate these challenges we have to come up with clever phrases such as:

  • “Buying a hammer does not make you a carpenter”
  • “Knowing how to drive does not make you a mechanic”
  • “Following the recipe on the back of cake mix does not make you a baker”
  • “Owning Quickbooks does not make you an accountant”
  • “Wearing a FitBit doesn’t make you a Doctor
Continue reading “Metaphors used in Data Science”

FAQs: Getting Started in Data Science

Frequently Asked Questions

Sometime when I get asked a question I send an extremely well thought out response, which may or may not be appreciated by that person, but I feel like there might be people who would – maybe…

Regardless of my delusions of grandeur, I do want to start documenting these responses both to feed my massive ego and because I am hopelessly lazy. Since the one I get asked the most is “How do I get started in Data Science?” we might as well start there. So here’s my Fall 2017 go to answer on how to get started quickly in Data Science.

The Fundamentals

No matter what you want to do in Data Science, if you’ve never actually done it professionally, you’ll need a few things.

  1. A linkedin page
  2. Your own website and associated email
  3. A github account
  4. A willingness to spend time adding content to all the above
Continue reading “FAQs: Getting Started in Data Science”

Emerging Data Science Architecture Patterns

The past month I’ve taken two EdX courses to brush up on Enterprise Data Integration Architecture. One on Active Directory Identity Management in Azure and the other on deploying data application interface services with C#.

What does that have to do with Data Science? Everything. At least that is my strong suspicion. This article explores industry trends towards “Enterprise” data science and how we can build our architectures to support very rapidly evolving  Data Science Solutions.

Continue reading “Emerging Data Science Architecture Patterns”

Configuring Anacondas for Bayesian Analytics with STAN

It’s not as popular as it once was, but Bayesian Analytics remains a powerful tool for more supervised learning exercises.

Despite all the hype around Deep Learning Models, and AI as a Service APIs, there’s still a need for Data Scientists to explain – in simple terms – what factors influence a given prediction. And even more importantly, sometimes we want to construct a model that represents real world process, rather than have a input values feed into a programmatically optimized series of neural networks and produce a predicted value.

Continue reading “Configuring Anacondas for Bayesian Analytics with STAN”

Continuous Delivery in Data Science

As I discussed in a previous article, Data Science is in desperate need of Devops. Fortunately, there are finally some emerging devops patterns to support Data Science development. DataBricks themselves are providing much of it.

Two concepts keep popping up in the devops patterns: “Continuous Integration / Continuous Deployment” and “Test Driven Design” (Moving toward “Behavioral Driven Design” but that’s not a widely used term).

Databricks has a great article on how to architect dev/prod CI/CD: https://databricks.com/blog/2017/10/30/continuous-integration-continuous… This shows details such how each developer has their own ‘development’ environment but with managed configurations and plugins to ensure all development occurs in the same configuration.

I would personally love to see data science engineering development patterns move into Test / Behavioral Driven Design – mostly because it makes things a lot easier on data end users; but also because it forces strict requirement definitions: https://databricks.com/blog/2017/06/02/integrating-apache-spark-cucumber…

It’s a little late to pick up on HumbleBundle, but Continuous Delivery with Docker and Jenkins by Rafal Lesko is a great read on the topic of Continuous Delivery, even if you are unfamiliar with the technologies.