Processing Web Server Logs

How I process my web server log files. Tips for removing referral spam and how to track which pages are printed.

Introduction

When I started web publishing thirteen years ago it was a highlight to get the monthly web trends reports emailed to you by the hosting company. They would process the web server log files for you and it was great getting to see the activity reports, even though they were a month behind. Later on I learned how to use Analog to process the log files myself and produce reports on demand.

Then in November 2005 Google released Google Analytics and nearly every web master switched over to using this free web service. Web server log analysis fell out of fashion.

Now, in 2016, as more people are using ad/tracking blockers like uBlock Origin and Ghostery -- both which block Google Analytics code -- web site owners are returning to using web log analysis tools as Google Analytics under reports site visitors.

Below I outline my preferred tool and how I track which web pages have been printed. I also cover how I remove referral spam from the log files before I process them.

GoAccess

My current tool of choice is a Linux command line program called GoAccess. GoAccess has minimal server requirements, it’s written in C and only requires ncurses. There are a couple of other optional dependencies if you want GeoIP support.

The main idea behind GoAccess is being able to quickly analyze and view web server statistics in real time without having to generate an HTML report. But you can generate a nice looking report if you wish, and I do this using a cron job to generate the reports in the early hours of the morning.

GoAccess

Restricting the reporting to a given month

At the moment this web site doesn’t get that many visitors. Hopefully this will change over time. Due to the relatively low traffic I don’t need to rotate my log files very often and the log file usually contains entries that span multiple months. I do however want to restrict the reporting to the month I’m interested in.

To do this, I take a working copy of the access log file and use the Linux sed command to filter out just the lines containing the month I’m interested in. For example to extract just the entries for March I would use:

sed -n '/Mar\/2016/p' access.log > march.log

Removing referral spam

There is so much referral spam these days that if you don’t remove it your referrals report would be cluttered up with all this junk. Making it harder for you to review which sites have legitimately sent you visitors.

So I use a series of sed commands to delete the referral spam from the log file I’m working on. The --in-place option tells sed to edit the log file and write the changes back to the same file. The d option lets us delete lines that contain the text string between the two / characters.

sed --in-place '/buttons-for-website.com/d' march.log
sed --in-place '/success-seo.com/d' march.log

Every time I see a new spammy referral show up in the reports I create a new sed command to the list. So that it is removed from future reports. I sometimes include Google & Bing in the list so that I can reduce the size of the report and just see actual web sites which link to my site.

Tracking which pages have been printed

Most of my writing is how-to tutorials. So I like to see how many times a given article is printed. If someone thinks the article is worth printing, especially my teach your kids to code page, then it’s validation to me that it was worth spending the time writing it.

To be able to track when a web page is printed you need to somehow create a request entry in your web server log file which you can then later search for. I use the following print CSS code in my web pages. The code loads a small single pixel gif image when a page is printed. I also pass a reference indicating which page the image was requested from.

print {
  body:after {
    content:url(https://nhs.io/printed/p.gif?p=logs)
  }
}

I use sed again to search for all the lines in the log file which contain the text string p.gif and filter those lines to a new file called printed.log

sed -n '/p.gif/p' march.log > printed.log
cut -d ' ' -f6 printed.log | sort | uniq --count

I use the cut command to grab field six -f6 from the log entries, which in my case is the actual file request. The -d option tells cut what character to use as a field delimiter. In this example it’s using the space character as the delimiter.

The output from the cut command is piped through the sort command to sort them alphabetically. The output from the sort command is then piped into the uniq command which removes duplicates and the --count option is what gives me the total number of times a given page is printed.