So there I am at the eMetrics Summit, sitting with WAA President Richard Foley who also has the impressive title of World Wide Product Manager and Strategist for SAS Institute.? He asks me what I’m going to talk about for my “Guru” (hate that word) session with Avinash and John Q and I respond with the Accuracy versus Precision thing. You know, that web analytics folks are generally far too obsessed with Accuracy when the data is really too “dirty” to support that obsession.
Well, don’t you know, (and this is 90 minutes before the Guru gig, but I have a Track presentation first), Richard responds, “Web Data isn’t dirty, it’s some of the cleanest data around.”
Hmmm, I think. This has to be another one of those Marketing / Technology Interface things. Clearly a semantic rift of some kind. But he’s a SAS guy, so there must be substance behind this statement!
So we spend the next half hour or so Drilling Down into the meat of the issue. Turns out none of his analysts would call web data “dirty” because it’s created by machines, don’t you know. No mistakes. Data is “clean”. You haven’t seen dirty data until you start looking at human keystroke input, for example. Think large call centers. Or how about botched data integration projects. Millions of records with various fields incomplete or truncated. That’s dirty data.
Dirty, from both an Operational and Marketing perspective, you see. But web server logs, they might be dirty from a Marketing perspective, but they’re not dirty from an Operational perspective. They just are what they are; super-clean records of what the server did or the tag read or the sniffer sniffed.
OK, I’m with Richard on this idea, having seen some horrendously dirty data in my time by his definition. So what do we call web data, if it’s clean? Even a 404 Error isn’t really “dirty”, right? It sure is dirty from a customer / user perspective; but from an already widely-used Operational / BI definition, it’s not dirty, it just “is”.
So how do we get to this idea of all the problems with web data that can lead an analyst down the wrong track if they focus so much on Accuracy they never get Precision? You know, cookie deletion, network serving errors, crashing browsers, multiple users of a single machine, single users of multiple machines, tabbed browsing, etc. etc. etc.? What do we call that kind of data, if not dirty?
We start going through all the lingo, like trying on different sets of clothes, looking for something that fits. What other kinds of data are like web data? What is the precise nature of the “problem” with web data? We finally arrive at the notion of Incomplete that seems to fit pretty well. It’s not that the data is dirty, it simply is often “not there” for the end user or analyst, as in missing a cookie, or serving a page that is never rendered in the browser, or a tag that never gets to execute properly.
But that’s not quite it, we decide, because there has been a solution for “incomplete” data around a long time – modeling. As long as you can get a set of reliable data, you can interpolate or “fill in” the missing data, right? Like is often done with geo-demographic modeling?
There’s a word, we think – “reliable”. Web data is certainly not reliable, but that’s not quite it. Why is it not reliable?
Well, because at a fundamental level, the incompleteness is Random, so it cannot be modeled very well.
And there we have it.
Web data is not dirty, it is Randomly Incomplete. A label that works for both the Marketing and Technology folks at the same time. A beautiful thing, don’t you think? A great example of being a little “less scientific” on the Technical side and a little “more specific” on the Marketing side, I think. We wrastled it to the ground.
So I rush off to change the phrase “data is dirty” in my Guru presentation to “data is Randomly Incomplete”. The panel is right after my Track presentation, so I rush up on stage with Avinash and John Q. We’re late so Avinash starts right away; we don’t even have time to mention to each other what we will be presenting.
Avinash is riffing on Creating a Data Driven Boss and his Rule #2 is:
Embrace Incompleteness
Yikes. That’s some coincidence, don’t you think?
But more importantly, do you think web data is dirty, Randomly Incomplete, or some other definition? Because if there are no objections, I’m moving from “dirty” to “Randomly Incomplete” – at least when I talk with BI folks!
Great blog.
INCREDIBLE book! Thank you so much!
I respectfully disagree on the “Randomly” part. Rest is spot on.
Why disagree?
Fundamentally the data being incomplete is not random. I’ve had (way more than…) sufficient exposure to crypto to appreciate the horrendous issues with getting truly random data.
As for Analytics? There are actual rules and conditions around why the data is incomplete. Cache. Cookies. Javascript woes and so on. It’s a highly complex problem with lots of unknowns.
Unknowns != Random! :-)
I would find “incomplete”, as a standalone phrase, adequate to the task of explaining the problems. But perhaps another descriptive leader than Randomly may be appropriate? It’s too early in the morning here to engage my brain to come up with something else.
Thoughts? Cheers!
– Steve
You see, with these technology guys, they are always trying to hold you to a specific definition…and I take it unless the data is truly random by the mathematical definition, “random” is not a satisfactory qualifier.
How about Erratically Incomplete?
“And it’s not like I’ve ever seen marketing folk spend weeks arguing over a word or phrase or colour. Heaven forbid! ;-)”
Thanks Steve…
We now have a vote for “Variably” due to “Erratically” being too close to “Randomly” in meaning – at least in Australia!
Variably Incomplete is OK, but “Variability” can sometimes be predicted and dealt with in a model.
How about going back to the original premise of modeling not being an answer to this problem, and going with:
Unpredictably Incomplete
What say you Steve? Anyone else?
Sorry for delay. Been meaning to write. Best intentions etc. :-(
Unpredictably works for me. I remain unconvinced we (at some point in the future?) can’t predict and deal with this in some sort of model. But we certainly can’t right here and now.
As I mentioned in my email you quoted. :-) I stayed away from the double negatives, but … well… best of a bad bunch?
Cheers!
– Steve