Old Fashioned Web Analytics in a Newfangled Serverless World

In case you didn’t know, I started a podcast this year: No Manifestos.

One of the interesting things about podcasting is that it’s difficult to know who’s listening. This has even been suggested as the reason for the genre’s success, as it prevents the aggressive tracking and reductionist analytics that have made such a cesspool of the rest of the web.

But occasionally I am curious. At the least, I want to know, roughly, how many people are listening. How can I find out?

Podcasting has remained largely decentralized, even as the web has become increasingly centralized (see reason for success, above). Apple is probably the biggest single provider, but the subscription numbers I see in iTunes Connect are (hopefully) only a tiny fraction of actual listeners.

I don’t use one of the dedicated podcast-hosting services, so we’re back to old-fashioned analytics. You know, server logs. No cookies, no trackers, just raw HTTP requests.

As it turns out, server logging hasn’t changed much from when I started tinkering with websites in the late ’90s. I’m using an Amazon S3 bucket to host the MP3 files for my podcast. The Amazon S3 Access Log Format is basically an Apache-style log with a few S3-specific fields tacked on.

S3 stores its access logs in … S3. Which means I can use Athena to query them. So that’s convenient.

But what do I query? It’s surprisingly difficult to find information on how to analyze server logs. Web searches mostly yield ads for products that want to do the analysis for you (see the web is a cesspool, above). But some of those products are open-source, so I can examine them to see what they do.

Goaccess is one such tool. It defines a “unique visitor” as the combination of IP address, User Agent string, and date. After far too many hours tinkering with date/time parsing, I managed to express that to Athena as:

SELECT key, COUNT(DISTINCT(remoteip, useragent,
  date_trunc('day',
    date_parse(requestdatetime, '%d/%b/%Y:%H:%i:%S +0000'))))
FROM "MY_ATHENA_DB"."MY_LOGS_TABLE"
WHERE operation = 'WEBSITE.GET.OBJECT'
  AND key LIKE 'media/%'
GROUP BY key

The answer? Several hundred for each of the last few episodes. Not bad for something that I’ve only released sporadically.