Metaphors used in Data Science

People say data science is difficult, which it is, but even harder is explaining it to other people!

Data Science itself is to blame for this, mostly because we don’t have a concrete definition of it either, which has created a few problem. There are companies promoting ‘Data Science’ tools as ways to enable all your analysts to become “Data Scientists”. The job market is full of people who took a course on Python calling themselves “Data Scientists”. And businesses so focused on reporting that they think all Analytics, Data Science included, is just getting data faster and prettier.

But the tools we use are just that, tools. The code we use requires specialized knowledge to apply it effectively. The data pipelines we create are to monitor the success and failure of our models, it’s an added bonus it helps with reporting.  To mitigate these challenges we have to come up with clever phrases such as:

  • “Buying a hammer does not make you a carpenter”
  • “Knowing how to drive does not make you a mechanic”
  • “Following the recipe on the back of cake mix does not make you a baker”
  • “Owning Quickbooks does not make you an accountant”
  • “Wearing a FitBit doesn’t make you a Doctor
Continue reading “Metaphors used in Data Science”

Configuring Anacondas for Bayesian Analytics with STAN

It’s not as popular as it once was, but Bayesian Analytics remains a powerful tool for more supervised learning exercises.

Despite all the hype around Deep Learning Models, and AI as a Service APIs, there’s still a need for Data Scientists to explain – in simple terms – what factors influence a given prediction. And even more importantly, sometimes we want to construct a model that represents real world process, rather than have a input values feed into a programmatically optimized series of neural networks and produce a predicted value.

Continue reading “Configuring Anacondas for Bayesian Analytics with STAN”

FAQs: Getting Started in Data Science

Frequently Asked Questions

Sometime when I get asked a question I send an extremely well thought out response, which may or may not be appreciated by that person, but I feel like there might be people who would – maybe…

Regardless of my delusions of grandeur, I do want to start documenting these responses both to feed my massive ego and because I am hopelessly lazy. Since the one I get asked the most is “How do I get started in Data Science?” we might as well start there. So here’s my Fall 2017 go to answer on how to get started quickly in Data Science.

The Fundamentals

No matter what you want to do in Data Science, if you’ve never actually done it professionally, you’ll need a few things.

  1. A linkedin page
  2. Your own website and associated email
  3. A github account
  4. A willingness to spend time adding content to all the above
Continue reading “FAQs: Getting Started in Data Science”

A Modified Comparison of Vegetable Costs

An article about the cost of fresh vs frozen or canned vegetables recently made the rounds on Gizomdo. The data was taken the USDA 2013 Vegetables, cost per cup equivalent data set. The Bite.Gizomoo article did a decent job expounding on the data; however, it may have made an erroneous conclusion. The author represented all the data and stated there was not that much of difference; however, this is a classic visualization mistake. You can see how messy all this data looks in my other article about embedding excel with onedrive. The data is so packed together that it’s impossible to draw any conclusions. However, with a quick pivot and some calculations, it become abundantly clear that, in the case where the same vegetables are available in both fresh and frozen or fresh and canned, the fresh cost per cup is generally quite a bit higher. See the visualization below made in Tableau Public. Any positive value indicates the dollar amount per cup the fresh version of the vegetable is greater than its frozen or canned equivalent, respectively.

Continue reading “A Modified Comparison of Vegetable Costs”

Advanced R Programming (And Why that is Important)

Thanks to for directing me to this excellent compilation of R programming information by Hadley Wickham.

Most of us in analytics use R just to get stuff done. It has a large number of packages that lets us produce insights at our own pace. What I mean by “at our own pace” is that we typically analyse data either individually or within our team, with the results communicated to our audience independently of the analysis itself.

This creates a situation where programmatic efficiency was not considered a priority by many R analysts. If a function was sub-optimal, it would only impact us with a little extra processing time, our end user would never know the difference. In fact, it made more sense to create a fast coded, long processing solution as the code would only be used a few times. Spending hours optimizing code would be considered a waste.

However, analysts are now crunching massive amounts of data. What would have cause a few minutes delay in processing a few years ago will now cause hours and hours of unnecessary waiting. Addionally, reducing processing time is now become more valuable as we move toward elastic cloud computing where more processing = more cost.

Most importantly, analytics is moving toward machine learning solutions, requiring the algo to be constantly updating and impacting automated decisions in close to real time. For example, predicting likely online customer behavior based on current site activity. Slow processing would make a very ineffective modeled solution.

Given these factors, we are rapidly reaching the point where an analyst’s programming abilities will be just as important and their statistical skill set. Over the next few weeks I plan to pick out a few relevant items from the Advance R Programming review which will help us Analysts make our programs faster and easier to understand.

Emerging Data Science Architecture Patterns

The past month I’ve taken two EdX courses to brush up on Enterprise Data Integration Architecture. One on Active Directory Identity Management in Azure and the other on deploying data application interface services with C#.

What does that have to do with Data Science? Everything. At least that is my strong suspicion. This article explores industry trends towards “Enterprise” data science and how we can build our architectures to support very rapidly evolving  Data Science Solutions.

Continue reading “Emerging Data Science Architecture Patterns”

When “Good Enough” Really Is Good Enough – Managing Perfectionism in an Imperfect World

Originally Published December 2012 on How many times has this happened? You’re working on a report. You polished the visuals, showing the insights in elegant detail. You reworked some tables to be more concise, and it’s almost perfect! It’s also a week late. The truth is, in most cases you don’t need to be perfect to be effective. The modern corporate workplace has layers upon layers of junior directors, directors, vice presidents, account executives, and project managers who will pick apart and revise any project you deliver. Every one of these people will give you feedback, and many of them will do so in a critical manner. Unfortunately, often your job as a young professional is to start a project. From there, the rest of the team can comment, and you assimilate their feedback into the project and start the cycle all over again. They will criticize you. They may make you feel worthless. In the end, I promise, this process will lead to success. But first, you have to learn how to be comfortable with “good enough.” From my experience, here are a few steps to keep perfectionism in check:

First, Get It Done … Then Make It Better

Are you familiar with the Pareto Principle? It states that 80 percent of the effort comes from 20 percent of the project; or alternatively, 80 percent of the project can be completed with only 20 percent of the effort. Odds are your perfectionism is making you spend long hours on the extra 20 percent of the project, when it’s not really necessary. Regardless of whether it is a flowchart, budget, prospectus, creative, copy or financial statement, something can always be improved. An eleven-point font with nine-point line spacing would be more legible. And who puts borders around images anymore? While these are nice touches, they are not the important details, and in truth, most people won’t notice or care. However, they will care if you do not deliver on time. So, finish whatever it is you’re working on first, do it fast, and then concern yourself with making it better.

Accept Criticism as a Reality

It doesn’t matter what you do, someone isn’t going to like it. Don’t try to fight the inevitable. Be able to accept criticism as an opportunity for improvement. The idea that you will somehow achieve perfection on your own is, at best, delusional, and at worst, an enormous waste of everyone’s time. Perfection is an impossible goal in the modern workplace. Your superiors will want to change something about what you did as soon as they see it. It doesn’t matter how good you think your work was, they will want to change it. Embrace this and look at it as an educational opportunity. Learn what they like, and the next time you deliver a project, mention that you’ve incorporated their feedback. You will have improved yourself and they will feel justified. This is how you can turn “good enough” into “better.”

Manage Your Time, and Don’t Burn Out

Perfectionism can lead to procrastination. Perfectionists often take on attitudes that say, “If I can’t do it perfectly, why do it at all?” or “No one appreciates all my hard work!” For one thing, if it seems like no one appreciates your hard work, you may need to re-evaluate your priorities at work (that’s a topic for another article). But burning out is a huge risk for a budding young professional. To avoid burnout, time management is critical. Set deadlines for yourself and stick to them. More importantly, use techniques and applications like Google Calendar,, and pomodoro to organize tasks and eliminate stress. You will be amazed how much more enjoyable life is when you’re not trying remember all the details of five projects at once. In the end, perfectionism has its place. If you don’t work hard, then you won’t be successful. But anyone is capable of working hard. The people who truly succeed are the ones who are smart enough to know that they have worked hard enough.

Notes from passing both GCP Cloud Architect and Data Engineer Professional Certifications in 30 days

Within 30 days I passed both the Google Cloud Platform Professional Data Engineer and Architect Certification exams.

However, it took me much longer than 30 days of study and experience to pass the exams.

Fortunately, there was a lot of overlap between the two exams, so if anyone else wants to put their personal life on hold for a few months and attempt something as crazy as passing two of the hardest cloud certifications in a short period of time, here are some tips to help you out.

First, the professional certifications are just as much about technical knowledge as they are about critical thinking – meaning you will not know the right ‘correct’ answer for many questions, but you might know the wrong answers. The test requires process of elimination.  When you face a question that does have an obvious answer, make sure to read the other questions to see if there are any obvious candidates for elimination.

Continue reading “Notes from passing both GCP Cloud Architect and Data Engineer Professional Certifications in 30 days”

Interpreting Patterns in Multi-Variate Multi-Horizon Time-Series Forecasts from Google’s Temporal Fusion Transformer Model

Wow that title is a mouthful… But it’s not a complicated as it seems. Let’s break it down:

  • Multi-Variate Time-Series Forecasts – Single-variate time-series forecasting uses only the historical values of the data in which we are attempting to predict future values. (For example, expontial decay, moving average, auto-regressive moving average.) Multi-Variate allows additional time-series and non-time-series variable to be including in the model to enhance the models predictive capability and give better understanding as to what influences our target predicted value(s). (For example, including the weighted seven day moving average sentiment of news articles about a company when forecasting it’s stock price for tomorrow.)
  • Multi-Horizon Time-Series Forecasts – Traditional time series forecasting is typically optimized for a specified number of period ahead (for example, a produce department predicting next week’s potato sales to determine inventory). Multi Horizon means we attempt to predict many different future periods within in the same model. (For example, predicting daily potato sales for every day over the next four weeks to reduce the number of orders and schedule times for restocking.)
  • Interpreting Patterns – A good model doesn’t only provide an accurate prediction, it also gives insights as to what inputs are driving the results, that is, the model is interpretable.
  • Temporal Fusion Transformer – The name of the proposed Multi-Horizon Time-Series Forecasting framework. It combines elements of Long-Short Term Memory (LSTM) Convolutional Neural Networks (CNNs) and a mechanism first used in image recognition called “Attention” (We’ll talk more about attention later).
Continue reading “Interpreting Patterns in Multi-Variate Multi-Horizon Time-Series Forecasts from Google’s Temporal Fusion Transformer Model”

Continuous Delivery in Data Science

As I discussed in a previous article, Data Science is in desperate need of Devops. Fortunately, there are finally some emerging devops patterns to support Data Science development. DataBricks themselves are providing much of it.

Two concepts keep popping up in the devops patterns: “Continuous Integration / Continuous Deployment” and “Test Driven Design” (Moving toward “Behavioral Driven Design” but that’s not a widely used term).

Databricks has a great article on how to architect dev/prod CI/CD:… This shows details such how each developer has their own ‘development’ environment but with managed configurations and plugins to ensure all development occurs in the same configuration.

I would personally love to see data science engineering development patterns move into Test / Behavioral Driven Design – mostly because it makes things a lot easier on data end users; but also because it forces strict requirement definitions:…

It’s a little late to pick up on HumbleBundle, but Continuous Delivery with Docker and Jenkins by Rafal Lesko is a great read on the topic of Continuous Delivery, even if you are unfamiliar with the technologies.