All posts by

Web Data: Randomly Erratically Variably Unpredictably Incomplete?

So there I am at the eMetrics Summit, sitting with WAA President Richard Foley who also has the impressive title of World Wide Product Manager and Strategist for SAS Institute.?  He asks me what I’m going to talk about for my “Guru” (hate that word) session with Avinash and John Q and I respond with the Accuracy versus Precision thing. You know, that web analytics folks are generally far too obsessed with Accuracy when the data is really too “dirty” to support that obsession.

Well, don’t you know, (and this is 90 minutes before the Guru gig, but I have a Track presentation first), Richard responds, “Web Data isn’t dirty, it’s some of the cleanest data around.”

Hmmm, I think.  This has to be another one of those Marketing / Technology Interface things.  Clearly a semantic rift of some kind.  But he’s a SAS guy, so there must be substance behind this statement!

So we spend the next half hour or so Drilling Down into the meat of the issue.  Turns out none of his analysts would call web data “dirty” because it’s created by machines, don’t you know.  No mistakes.  Data is “clean”.  You haven’t seen dirty data until you start looking at human keystroke input, for example. Think large call centers.  Or how about botched data integration projects. Millions of records with various fields incomplete or truncated.  That’s dirty data.

Dirty, from both an Operational and Marketing perspective, you see.  But web server logs, they might be dirty from a Marketing perspective, but they’re not dirty from an Operational perspective.  They just are what they are; super-clean records of what the server did or the tag read or the sniffer sniffed.

OK, I’m with Richard on this idea, having seen some horrendously dirty data in my time by his definition.  So what do we call web data, if it’s clean?  Even a 404 Error isn’t really “dirty”, right?  It sure is dirty from a customer / user perspective; but from an already widely-used Operational / BI definition, it’s not dirty, it just “is”. 

So how do we get to this idea of all the problems with web data that can lead an analyst down the wrong track if they focus so much on Accuracy they never get Precision?  You know, cookie deletion, network serving errors, crashing browsers, multiple users of a single machine, single users of multiple machines, tabbed browsing, etc. etc. etc.? What do we call that kind of data, if not dirty?

We start going through all the lingo, like trying on different sets of clothes, looking for something that fits.  What other kinds of data are like web data?  What is the precise nature of the “problem” with web data?  We finally arrive at the notion of Incomplete that seems to fit pretty well.  It’s not that the data is dirty, it simply is often “not there” for the end user or analyst, as in missing a cookie, or serving a page that is never rendered in the browser, or a tag that never gets to execute properly.

But that’s not quite it, we decide, because there has been a solution for “incomplete” data around a long time – modeling.  As long as you can get a set of reliable data, you can interpolate or “fill in” the missing data, right?  Like is often done with geo-demographic modeling?

There’s a word, we think – “reliable”.  Web data is certainly not reliable, but that’s not quite it.  Why is it not reliable?

Well, because at a fundamental level, the incompleteness is Random, so it cannot be modeled very well.

And there we have it. 

Web data is not dirty, it is Randomly Incomplete.  A label that works for both the Marketing and Technology folks at the same time.  A beautiful thing, don’t you think?  A great example of being a little “less scientific” on the Technical side and a little “more specific” on the Marketing side, I think.  We wrastled it to the ground.

So I rush off to change the phrase “data is dirty” in my Guru presentation to “data is Randomly Incomplete”.  The panel is right after my Track presentation, so I rush up on stage with Avinash and John Q. We’re late so Avinash starts right away; we don’t even have time to mention to each other what we will be presenting.

Avinash is riffing on Creating a Data Driven Boss and his Rule #2 is:

Embrace Incompleteness

Yikes.  That’s some coincidence, don’t you think?

But more importantly, do you think web data is dirty, Randomly Incomplete, or some other definition?  Because if there are no objections, I’m moving from “dirty” to “Randomly Incomplete” – at least when I talk with BI folks!

On the eMetrics / Marketing Optimization Summit

I had to bolt the Summit a day early to speak at the Direct Marketing Association annual conference in Chicago.  Too bad, the conference was humming and there was a ton of great content along with the usual great people.

The most interesting trend going on (for me, remember I favor a behavioral approach to marketing, online and off) is the killing off of e-mail subs once they become unresponsive.  The most excellent Jay Allen from Cutter and Buck kills them off at 6 months because he simply gets more pain than gain from mailing them – basically zero response and lots of spam complaints after 6 months dormant.  Reputation management, don’t you know. 

Hard to figure out why more people don’t do this, but I have a good guess – folks simply can’t (or don’t) segment behaviorally so they can’t really see where the sales come from.  If they could, they’d kill off the “haven’t opened in 6 months” subs too.  These e-mail “purge” practices are simply a manifestation of the reality of Engagement – there is a time-based predictive element that tells you when it is over. 

The smartest marketers will realize they can predict this degradation of the relationship and take action before it is too late – in other words, before 6 months of no opens.  Check with your (offline?) BI folks for any patterns that might be useful in managing these LifeCycles, hopefully they have seen these patterns before.  Use segmentation; source of customer is highly predictive of these patterns, as is entry / first content and first purchase product.

Beware the average LifeCycle of interactive relationships are typically quite short compared with offline.  For example, catalogs can get decent ROI mailing all the way out to customers who have been dormant for 2 years.  In TV shopping, we considered folks dormant at about 6 months.  Online, the majority of the value is generated in the first 3 months.  Put another way, in catalog you get a 20 / 80 Pareto.  In TV shopping, more like 90 / 10.  Online, 95 / 5.

In the end, this behavioral knowledge ties directly to the “customer experience” idea so many people comment about in vague prose but never quantify.  You have sales people, products, procedures, and business rules that create customers likely to defect.

Sure, you have online customers that stick.  But the percentage of those that stick is smaller, and since they generate huge sales volume, it’s incredibly important to pay attention to what they are doing behaviorally.  You can predict when they will defect by the parameters mentioned above; isn’t it your responsibility to take action on this knowledge?

For the Brand folks out there, Rachel Scotto from Sony Pictures also kills off her e-mail subs after 6 months of no opens, a rule that varies a bit with the type of list and topic (movie, TV show, etc.) For her, Brand is everything and she simply does not want the negative experience of unwanted e-mails to tarnish the Brand.  If someone demonstrates through their behavior they are no longer interested, then why continue to send them e-mails?  Good question.  Brand folks, please respond.

Jay also had a great shopping cart recovery example.  They e-mail folks that abandon carts with a simple, subtle message featuring the product and no discount – and get  fabulous response.  The folks sending discounts in this kind of program really need to do some controlled testing – they are giving away the store.

I’ve had a lot of positive feedback on my Summit presentations and I thank you for that.  Feel free to leave any comments or questions.

That’s it on the eMetrics / Marketing Optimization Summit from me.  Between WAA stuff and speaking / travel logistics I did not get to see many presentations, but the ones I did see demonstrated significant progress in grasping and leveraging visitor behavior.

On Engagement

I’ve had some bad luck with connecting to the web lately, trying to catch up on blog posts as the latest trip winds down.

The panel on Engagement at the WebTrends customer meeting was a lot of fun, probably best described as “productive friction” if forced to describe it with a phrase.

Based on comments from the audience, the panel was quite useful in terms of vetting some of the ideas floating around out there and answering their burning question, “Am I missing something here?  Why should I care about this engagement thing?”

This in itself is an interesting issue: generally, the audience perceives “engagement” as yet another buzzword of the week that like most buzzwords, is simply another word for stuff most of the audience deals with all the time, namely customer service and retention – or customer “experience” if you prefer last week’s buzzword.  This was the insight I gained from the well-lubricated crowd at the party after the panel, so please take this fact into account as well.  Do people tend to say what they really think after a few drinks?  Or were they just tired of talking about web analytics the whole day?

Some of the more interesting discussion among the panelists actually took place right before and after the panel, when we had a chance to really first explain our positions and then challenge each other to defend them.  Great conversation.

For what it’s worth, here’s a breakdown of what I thought I heard being said.  My perception and reality may of course be different and I encourage participants to correct any misperceptions I may have had!

Andy Beal – as the only “generalist” on the panel, I think Andy was a bit steamrolled by the hard core “get the facts” thing web analytics folks do.  He maintained web analytics could measure only one area of customer engagement with a company (the web), and that you would never get the full picture of engagement because some of it is unmeasurable.  Probably true in a strict sense, though I bet there’s a lot that can be measured on the web through customer conversations and so forth.  However, we left this “can’t be measured” question to simmer, because the rest of the panel and the audience wanted to talk about web analytics so that was what we were going to do.

Anil Batra / Myself – I’ll go out on a limb and say our positions were very similar; I”m sure Anil will chime in.  Basically, the formula is this:

The difference between Measuring Activity and Measuring Engagement is Prediction.

In other words, when you start using the word Engagement, you are implying “expected” activity in the future, with this expectation or likelihood being valued or scored with a prediction of some kind.  Activity without an implication of continuity is simply Activity, it’s history and stands alone.  Same stuff web analytics has always done, nothing new.

Jim Sterne – Jim was a bit more global in his thinking as you might expect, and seemed to be concerned more about how Engagement fits into the greater Marketing picture rather than looking to hang parameters on it.  How Engagement is related to Customer experience and Brand, how it does or does not turn into Loyalty, and so forth.

Gary Angel / Manoj Jasra – not sure either of these fine folks fully buy into the “prediction” requirement Anil and I support, though they might be talked into it.  Gary and I had a long conversion which included June Dershewitz after the panel, where we traded examples and generally wrestled over what I would call the”advertising / duration conundrum”. 

I maintain advertising is an outlier in this discussion, which is strange since those folks basically started this whole engagement thing and stoked the fire hard with the Duration variable that got web analytics folks in general so pissed off.  Not sure Gary or Manoj will ever accept Duration in any form as a measure of Engagement, where I maintain that if you isolate Advertising as a unique conversation, it makes a lot of sense.  The reality of buying online display ads is you need an absolute standard or the networks and buying process absolutely fall apart; you simply cannot look at a unique Engagement metric for every site or the buy would never get done.  So you hold your nose, say Duration is important to advertising as a metric, and do the deal.

In other words, there is a huge difference between being Engaged with a site and being Engaged with an ad on the same site.  These are two completely different ideas and unless you believe that Engagement with a site always spills over to Engagement with the ads on the site (I do not) then these two ideas deserve two different treatments.

June wanted to get into it all over again at the eMetrics Summit…feel free to post your comments here June!