People say data science is difficult, which it is, but even harder is explaining it to other people!
Data Science itself is to blame for this, mostly because we don’t have a concrete definition of it either, which has created a few problem. There are companies promoting ‘Data Science’ tools as ways to enable all your analysts to become “Data Scientists”. The job market is full of people who took a course on Python calling themselves “Data Scientists”. And businesses so focused on reporting that they think all Analytics, Data Science included, is just getting data faster and prettier.
But the tools we use are just that, tools. The code we use requires specialized knowledge to apply it effectively. The data pipelines we create are to monitor the success and failure of our models, it’s an added bonus it helps with reporting. To mitigate these challenges we have to come up with clever phrases such as:
Continue reading “Metaphors used in Data Science”
- “Buying a hammer does not make you a carpenter”
- “Knowing how to drive does not make you a mechanic”
- “Following the recipe on the back of cake mix does not make you a baker”
- “Owning Quickbooks does not make you an accountant”
- “Wearing a FitBit doesn’t make you a Doctor
Frequently Asked Questions
Sometime when I get asked a question I send an extremely well thought out response, which may or may not be appreciated by that person, but I feel like there might be people who would – maybe…
Regardless of my delusions of grandeur, I do want to start documenting these responses both to feed my massive ego and because I am hopelessly lazy. Since the one I get asked the most is “How do I get started in Data Science?” we might as well start there. So here’s my Fall 2017 go to answer on how to get started quickly in Data Science.
No matter what you want to do in Data Science, if you’ve never actually done it professionally, you’ll need a few things.
Continue reading “FAQs: Getting Started in Data Science”
- A linkedin page
- Your own website and associated email
- A github account
- A willingness to spend time adding content to all the above
Thanks to Reddit.com/r/programming for directing me to this excellent compilation of R programming information by Hadley Wickham.
Most of us in analytics use R just to get stuff done. It has a large number of packages that lets us produce insights at our own pace. What I mean by “at our own pace” is that we typically analyse data either individually or within our team, with the results communicated to our audience independently of the analysis itself.
This creates a situation where programmatic efficiency was not considered a priority by many R analysts. If a function was sub-optimal, it would only impact us with a little extra processing time, our end user would never know the difference. In fact, it made more sense to create a fast coded, long processing solution as the code would only be used a few times. Spending hours optimizing code would be considered a waste.
However, analysts are now crunching massive amounts of data. What would have cause a few minutes delay in processing a few years ago will now cause hours and hours of unnecessary waiting. Addionally, reducing processing time is now become more valuable as we move toward elastic cloud computing where more processing = more cost.
Most importantly, analytics is moving toward machine learning solutions, requiring the algo to be constantly updating and impacting automated decisions in close to real time. For example, predicting likely online customer behavior based on current site activity. Slow processing would make a very ineffective modeled solution.
Given these factors, we are rapidly reaching the point where an analyst’s programming abilities will be just as important and their statistical skill set. Over the next few weeks I plan to pick out a few relevant items from the Advance R Programming review which will help us Analysts make our programs faster and easier to understand.
Wow that title is a mouthful… But it’s not a complicated as it seems. Let’s break it down:
Continue reading “Interpreting Patterns in Multi-Variate Multi-Horizon Time-Series Forecasts from Google’s Temporal Fusion Transformer Model”
- Multi-Variate Time-Series Forecasts – Single-variate time-series forecasting uses only the historical values of the data in which we are attempting to predict future values. (For example, expontial decay, moving average, auto-regressive moving average.) Multi-Variate allows additional time-series and non-time-series variable to be including in the model to enhance the models predictive capability and give better understanding as to what influences our target predicted value(s). (For example, including the weighted seven day moving average sentiment of news articles about a company when forecasting it’s stock price for tomorrow.)
- Multi-Horizon Time-Series Forecasts – Traditional time series forecasting is typically optimized for a specified number of period ahead (for example, a produce department predicting next week’s potato sales to determine inventory). Multi Horizon means we attempt to predict many different future periods within in the same model. (For example, predicting daily potato sales for every day over the next four weeks to reduce the number of orders and schedule times for restocking.)
- Interpreting Patterns – A good model doesn’t only provide an accurate prediction, it also gives insights as to what inputs are driving the results, that is, the model is interpretable.
- Temporal Fusion Transformer – The name of the proposed Multi-Horizon Time-Series Forecasting framework. It combines elements of Long-Short Term Memory (LSTM) Convolutional Neural Networks (CNNs) and a mechanism first used in image recognition called “Attention” (We’ll talk more about attention later).
As I discussed in a previous article, Data Science is in desperate need of Devops. Fortunately, there are finally some emerging devops patterns to support Data Science development. DataBricks themselves are providing much of it.
Two concepts keep popping up in the devops patterns: “Continuous Integration / Continuous Deployment” and “Test Driven Design” (Moving toward “Behavioral Driven Design” but that’s not a widely used term).
Databricks has a great article on how to architect dev/prod CI/CD: https://databricks.com/blog/2017/10/30/continuous-integration-continuous… This shows details such how each developer has their own ‘development’ environment but with managed configurations and plugins to ensure all development occurs in the same configuration.
I would personally love to see data science engineering development patterns move into Test / Behavioral Driven Design – mostly because it makes things a lot easier on data end users; but also because it forces strict requirement definitions: https://databricks.com/blog/2017/06/02/integrating-apache-spark-cucumber…
It’s a little late to pick up on HumbleBundle, but Continuous Delivery with Docker and Jenkins by Rafal Lesko is a great read on the topic of Continuous Delivery, even if you are unfamiliar with the technologies.