New DataFlow Job Metrics vs. StackDriver

Promoted as new capabilities in “DataFlow observability”, GCP is finally giving us the ability to see CPU a time series graph of cpu utilization and throughput for a given DataFlow job within the DataFlow console.

https://cloud.google.com/blog/products/data-analytics/better-data-pipeline-observability-for-batch-and-stream-processing

Before we used stackdriver (which is getting rebranded, by the way) to view the VM CPU utilization from our DataFlow jobs. The new DataFlow capabilities do not replace StackDriver aggregate monitoring and alerting in StackDrivery; however StackDriver and Obersavility serve different use cases – where the observability functions more for DataFlow job debugging and optimization while StackDriver is for holistic tracking and monitoring. I.e. the DataFlow UI is for job specific DataFlow Ops, StackDriver is for Administration.

Job details 
JOB GRAPH 
BACK TO OLD JOB PAGE 
JOB METRICS

Those of us who have used StackDriver appreciate more visibility in CPU and throughput as all had in the console was the resource metrics on the right of the job topology.

I did a quick run of the standard wordcount example to generate so data. The new graphs are simple and to the point. I like them.

Throughput (elements/sec) 
7:37 
group/Read 
7:38 
7:39 
Create alerting policy 
Mar 2, 2020 7:40 PM 
• group/ Reify 
. group/Write 
• split 
• read/Read 
17 lines below 
261.9/s 
261.9/s 
261.9/s 
261 S/s 
50.97/s 
2S0's 
200's 
ISO's 
IDO/s 
111

Now we see specifically which ops taking the most IO and CPU for a given job – without the overhead of creating a new StackDriver dashboard or filtering to a specific job. In fact, there’s no real way to get this level of visual detail out of the box in StackDriver. (At least not that I’m aware of, let me know if there is a simple configuration setting I’ve been overlooking!) In StackDriver the minimum alignment period is 1 minute, so the best we can do is see operation counts or vCPUs per minute. In our new DataFlow UI we can see throughput and vCPU per second.

For a StackDriver workflow, per second detail is way to granular; however, when testing DataFlow jobs prior to a large scale deployment, lower level detail is important for introspection prior to rolling out inefficient – and expensive – DataFlow jobs.

Leave a Reply

Your email address will not be published. Required fields are marked *