What is Diagnostic Analytics?
Welcome back to Data Science Wednesday! On this week's episode, Decisive Data's Lead Data Scientist Tessa Jones goes over what really is diagnostic analytics.
Hello, and welcome to Data Science Wednesday. My name is Tessa Jones, and I'm a data scientist at Decisive Data. And I'm here today to talk about diagnostic analytics. Diagnostic analytics really fits into this spectrum of analysis, really going from basic to more complex. Descriptive analytics, we've talked about before in a previous episode, so if you're interested in learning more about descriptive analytics then you're more than welcome to watch it. But the important part about descriptive analytics is it really launches us into these more advanced analytics data science phases.
Today we're gonna talk about diagnostic analytics, which is really the most abstract of any of these phases of analysis. And it really answers the questions of why. Why things are happening? What's driving things to go up, or down, or anything along those lines? So in order to have this conversation let's imagine that we are grocery store owners, and we wanna know what's causing our revenue for any given product to go up or go down so that we know how to stock our shelves. So we're gonna do...first, we're gonna find correlations in our data where we wanna know what's driving the revenue? What's causing it to go up or down? So we're gonna pick out a bunch of features that we think might impact that revenue and then we're gonna basically find a number that tells us it's either highly correlated or low correlation.
So in this example, we're gonna take time of year, commercial airtime, Twitter mention, location, shelf height, and then we're gonna put it in our big box of statistical methods, and it's gonna spit out a number between zero and one. One would mean that it's very highly correlated and zero would mean that it's really not correlated at all. In this example, we see that time of year is really, really...has a high correlation as does location and shelf height. And Twitter mention also is somewhat correlated. So let's think about this graphically and visually.
So here's an example. We have Twitter mention and sales, and we can see that the more Twitter mentions you have the higher the sales go as you can see from this line here. And these dots out here, the distance away from these lines and the more clustered these dots are to this line, the more confident we are that this correlation is in fact true. In contrast, we go down here and we see time on market and we plot it relative to sales. And these points are kind of all over the place and the line is flat. This is just clearly not correlated. Time on market really doesn't impact the revenue. And this is all really good to know because we need to know what drives revenue in order to build a good predictive model, and to understand what's really driving all of our data in general.
So let's visualize some of this for a minute. If you're starting to go into diagnostic analytics and data science you probably have a pretty good descriptive and analysis base. So you probably have a pretty good dashboard. So here we're gonna go into an example where you have revenue by different departments, like pantry, and dairy, and meat. And if you have a really good visualization you click in it and then it shows you the revenue of all your products in that department. And then let's say you're interested in your top seller. And you click on it and you see it over time in a given year, your sales has this weird bump where it's really high on either side, but then it's really low. Which makes sense because we've identified that the time of year is correlated to revenue, which means that for a given time of year you're gonna sell more than another time of the year.
So let's kind of like drill down into that a little bit more. So let's take an example of cold and hot cereal. So on this axis we have the months of the year, and we see that during the summer months the cold cereal really sells really well, and the oatmeal doesn't sell so well. But it sells really well in the winter months which totally makes sense because it's cold out, you want hot cereal, or it's really hot out and you want something cold. So this totally makes sense. So this is really good information to have for our next steps of analysis.
So then we come back to this and we say, "Wow, you know, location was really important too." So let's look at that one. So we dive into that and we break it down into the Northeast and the Southwest. And we see that in the Northeast there is a pretty dramatic curve here. Like cereal sales went really low in the summer and really high in the winter. And, you know, cold cereal had the exact opposite trend, whereas in the Southwest that variation was a lot smaller. It's still there, but a lot smaller. So this tells us that, you know, in the future anything that we model really needs to go down to the level of location or even ideally a store.
So diagnostic analytics. Why are things happening? What's causing things to go on? That's what it's really all about. It's a very explorative phase to look at what's causing things to happen. In this example, we see that the time of year and the location are two really important things that we wanna keep in mind when we're going forward into predictive and prescriptive analysis.
So if we bring it back, we have descriptive analytics as a start where we're really wrangling our data, we clean it, we relate it, we visualize it. We start to get a feel for what's going on in our business and then we start to get to the cool stuff where we're doing some diagnostic and some data science stuff. And so we do some drill downs, some data discovery, we find some correlations, we do things of that nature. So that's diagnostic analytics. Thanks for joining us today, and join us next week for Data Science Wednesday and we're gonna talk about predictive analytics.
Posted by Gage Peake