My Home Grown Sans-JavaScript Tracker Blog Stat Tool

2021-03-09 in SOFTWARE ENGINEERING

software engineering linux open source kotlin sqlite

14 min read

Table of Content

The one thing I missed about my old WordPress blog when I switched to a static Jekyll site is having statistics about my blog. I could have solved that by using Google Analytics or other tracking tools but a big part of what I was trying to do was get rid of all the trackers, JavaScript injection, and what I may overly aggressively label “spyware.” Looking at the Nginx log I thought there was enough in there to let me recreate a lot of those statics once I worked through all the bot traffic. This also gave me the opportunity to create a Kotlin Multiplatform project that I could potentially one day migrate to a pure Kotlin Native application. I’m sure that using existing log processing tools out there that may have hit my requirements but I decided to do the usual programmer thing of just write my own. It’s running live on my blog now and generating the annual and monthly statistics I was looking for. The source code is up on Gitlab for others that may want to use it as well. Now on to the details of the project.

Blogging Background

So let’s take a step back to the beginning of my blogging. I like sharing my thoughts and opinions on the internet. Before the internet I did that on AOL and BBSs as well. That would be in the form of newsgroups, forum discussions, chat rooms etc. In the early-2000s I gave a try of doing it on LiveJournal. However it was when I began doing personal health experiments that I decided I should actually create my own blog. That’s where the title “N=1” comes from. I’m an experiment with one sample. It’s not so much scientific as methodical, if that makes sense. It’s not coming up with a control group and an experimental group and then study the relative benefits of one over the other for a particular thing. That would be actual science. This is more experimentation in the colloquial sense, and exploring health. I had big plans to try to cycle through a series of eating lifestyles for a couple months at a time with blood and fitness tests before/after to see if I could feel and measure a big difference. That experiment never came to fruition for various reasons but I kept blogging. I then added just general writing, program writing, and the like so now it’s just a generic personal blog. It’d be shit for monetization and influencing if I cared about such things because it lacks focus and regular consistent updates but that’s not the purpose.

So for a simple blog I went with Wordpress. I wanted to host it myself so rather than use Wordpress.com I followed instructions to setup a nice small Linux machine and was on my way. Then there was the noise of all the bots in the comments section. So I installed a plugin for that. Then there were some other things I was looking for so installed Jetpack, which is where those stats came from. Updates were a bit of a bitch but not too bad. Still there would be one attack after another against Wordpress sites announced. I was good with my security and I literally only had three core plugins install so I was mostly safe but I still found it unsettling a bit. Besides whenever I’d post to Diaspora or Mastodon the flood of fediverse servers would cause hundreds of requests to come in over a very short period of time. That would make my server’s response too slow for awhile. I could add caching or put a CDN in front but the whole point was to keep it simple. So I was teetering on the edge of making the jump to a static site system like Jekyll or Hugo. When I first started using DuckDuckGo’s Privacy Essentials plugin and saw how many trackers were embedded on my simple website despite me not setting it up (probably a bunch from Jetpack) I decided to finally make the leap.

Project Background

For two years I’ve had a simple blog that I can just type up my comment in Markdown, commit to my git repository and then the blog updates. It’s been perfect. Still, I liked seeing which posts people were going to. I liked seeing how many hits my blog was getting each month, even if it was just a couple dozen to a couple hundred. When I read articles about seeing which browsers or operating systems hit a site I’d like to have that sort of information too. But I don’t want to have it by installing tracker JavaScript files to my website. I like the fact my site renders perfectly in Lynx and non-JavaScript browsers. I like that it’s fast because of that too. There is a privacy aspect to that as well which I’m proud of. I also didn’t want to potentially increase the complexity of my setup. Setting up external tracking sites is just one more attack surface potentially. Setting up a local tool which talks to a database engine is again just more potential attack surfaces. It’s not that big of a deal but again if all I’m looking for is a little statistics work then why do I need that. So that was the requirement I set out for myself:

Able to generate a history of real blog hits to the blog posts themselves
Able to tell me which referrers the articles came from
Able to tell me which site-navigation that may have been used (like from another post or the Archive)
Able to tell me which browsers and OS’s visitors are from
No JavaScript tracking of any kind
No additional infrastructure systems to support this
No additional network connections to support this
No additional long running services to support this

Implementation

I chose Kotlin as the language because it’s my favorite and it would allow me to potentially write a small self contained native application eventually. In the short term though I was more interested in getting things working so the current deployment is using a lot of JVM specific Kotlin features. Where I can though it’s using pure Kotlin libraries. Where I can’t I use Kotlin Multiplatform’s expected/actual class methodology to create the JVM specific implementation but that will allow me to write a pure Kotlin or native platform version in the future without changing the rest of the code. Most of the code already is Kotlin common code with the only things requiring the JVM are the database wrapper, which uses JetBrain’s Exposed library and the JDBC SQLite adapter and the SHA-256 string hashing algorithm. Lastly the command line program that drives this also is using the Java File walker. All of these could potentially be replaced with native components right now but it’d be more complicated and I wanted to get this up and running. So the main technologies used for this are:

Kotlin Multiplatform
SQLite database

That’s it. The first pass is pretty straight forward. The program parses each line of the Nginx log file into some processable constituents and stores them. I’m doing a SHA-256 hash of each line which should guarantee uniqueness without storing the line itself. All of that gets stored in a SQLite file database for querying and processing. That’s the easy part. The first hard part is filtering out irrelevant information. First, I want statistics on my articles so I only want it tracking accesses to those posts not any place on the website. If I had thought of this long ago I could have had a relative path like /posts/yyyy/mm/dd or /blog/yyyy/mm/dd but I didn’t so it’s just a bare path of /yyyy/mm/dd. That’s not too hard to parse around so that was quickly overcome. It’s the bot traffic and malicious actor requests that needs some real filtering.

Going through the log manually I could see that 90% or more of the traffic hitting my site were bots and malicious actors. Those are not the same thing. Lots of bots are just indexing the site. When I post to the fediverse all those servers ask for a copy of the post too, just like Facebook and Twitter do. I only want statistics when a human is reading the post though. That’s where most statistics trackers inject some JavaScript code to work its magic, which I’m not going to do. However in there were tons of site attacks as well. Most of them were, ironically, Wordpress attacks. Seeing if my admin page was wide open. Trying different backdoors. There were some SQL injection attack style attempts too. Talk about a good reminder of why I was taking the approach I was to simplifying my blog several years ago!

To distill out the relevant human accesses to my articles I created a series of Filter classes that can used in succession to perform the distillation. It just takes some trial and error in configuration:

AccessLogEntryFilter is the extension class which takes a list of these filters and applies it against the AccessLogEntry to decide whether it is okay or not
AllowedResponseCodeFilter looks at the response code in the Nginx log entry and only passes those of interest, for my blog simple 200 and 301.
ArticleYearPathFilter takes care of my blog structure problem by looking at request paths that begin with select blog years, like /2020 etc. I could have restructured my site to have /posts or /blogs but decided to just keep thing status quo for now and write this simple filter. It’d also allow me to run statistics later on subsets of the blog if I chose
BotFilter is used to check the referrer string and/or the user agent string for keywords that are used by bots. Using an older dataset and my current logs I created my own dictionary of these terms. I also used this Nginx Bad Bot configuration to seed my own configuration files with known bad bot keywords.
FilterBadIPs is used to block known spammy IP addresses. Even with all the potential bot filtering some sites aren’t nice enough to clue me in to their nature but it can be pretty clear looking at the stats who is a bot and who is not. For example, an IP address with 20 hits over 1 second, that’s a bot.
RelPathStartFilter is a generic form of the ArticleYearPathFilter which in fact uses it under the hood.

Every filter implements the Filter interface which has one simple method, checkIsOkay for a given AccessLogEntry item. If it’s okay then it’ll get processed and stored. If it’s not then the entry doesn`t get into the dataset. As I wrote above there is a ton of noise. I test log history with over 50,000 items distilled down to about 1,000 actual log entries of interest. Configuring this took trial and error, which I’ll get into in a minute, but it was a grand total of less than an hour of such tweaking. With the data stored it’s time to generate statistics. The link statistics I’m looking to track are:

The number of views of a post
The unique IP addresses that went to it and the number of times that IP address went to it
The unique external referrers and the number of times each referrer sent people to the post
The unique “internal referrers”, basically site navigation, that sent people to the post
The unique browsers and versions to the post
The unique operating systems and versions to the post

All of these I then bin up into monthly bins which can then generate “Top 10” style metrics. It was pretty interesting but not entirely surprising what sorts of systems visited my site and which links were the most popular. My thoughts on my site’s metrics are a topic for another post. While I originally wanted to generate some graphs and HTML I decided to keep it simple for now and generate a flat text file. It has all the essential data so met my “minimum viable product” definition.

Usage

All of this capability is wrapped up in a single command line tool which provides a pretty simple interface to use:

Usage: run-analysis [OPTIONS] SETTINGSFILE

Options:
  -b, --batch INT      Size of each batch insert to database (default = 100)
  -i, --ingest         Whether to try to ingest new log data
  -s, --summarize      Whether to try to ingest new log data
  --find-badip         Look for potentially bad IPs in the data
  --list-ip TEXT       List all entries for the IP address
  -f, --apply-filters  Applies the current filter configuration to the
                       existing dataset, toggling hidden field accordingly not
                       deleting data
  -h, --help           Show this message and exit

Arguments:
  SETTINGSFILE  Location of run settings configuration

I violated the Unix principle a bit so I could have just one giant project. I thought of refactoring it into a common library used across multiple command line tools with a common library but nixed that for the time being. Using flags you can define which operations to perform at execution time:

The settings file has all of the configurations for the filters, file locations, etc.
--ingest causes the application to read the log files in the folder specified in the settings looking for new entries. It then processes them according to the filter and storage settings.
--summarize generates the summary data from the existing database stored dataset and outputs it to a local file
--find-badip is a tool I to look through the dataset to find potentially unflagged bad IPs. These are IP addresses that are making too many requests over too short a window of time (configurable in the settings file).
--list-ip will list all valid entries for a given IP address and print out those entries with human readable time tags. This can be used by user to investigate potentially spammy IP’s flagged by the system with the bad IP tool or when looking at the summary statistics and seeing an especially chatty IP address.
--apply-filters will run the current filter configuration against the current dataset and make whether those entries are visible in the summary statistics. It never deletes anything it simply toggles a hidden flag which is used during the queries.
--batch is used to define a batch size used when adding entries to the database. This is really only important during bootstrapping so that it combines several inserts into one transaction. This was way more important early on in development when I was batch inserting the whole set not the fraction that are filtered out.

For regular updating I have a normal user account’s cron job setup to run this tool with the ingest and summarize options every 30 minutes. The only destructive operation is the overwriting of the latest statistics summary text file that gets generated with each execution. I can look at those statistics from time to time and see if I need to tweak the filters. For example I saw that an IP address went to my site 71 times in one month. That seemed unlikely to me. I ran the “find-badip” tool which didn’t flag it. I then used the “list-ip” tool which listed all the entries. Sure enough it was legitimate traffic. And so it stays in. If it turned out to be a new IP address I don’t want statistics on I could add it to the filter configuration JSON file and run the tool with the --apply-filters command to hide it from future statistics generation.

Conclusion

What I’ve developed is a blog statistics generation tool that can be used to generate statistics on a blog or simple website without the use of tracking JavaScript files, database servers, extra network connections etc, or system services. I went as far as setting up a cron job that runs every 30 minutes but that’s as close I get to a system service. Even when processing two months from scratch it only takes the tool a few seconds to run. For my blog’s traffic levels it’s generating between 500-1000 KB of database size per month. That makes the overhead of the tool very small as well. Getting to the data is as simple as logging into the server to view the summary file or scp’ing the file down to your local machine. Right now the only system requirement is that a Java Runtime Environment has to be installed. It suits my purposes for now but things I’d like to do moving forward are:

Convert the whole thing to a Kotlin Native application so that it does not require any runtime environment at all
Change the output to be formatted output instead of raw text. First step would be generation of markdown. Second step would be adding graphs, especially if there was a Kotlin Native graph generation library, and lastly perhaps making HTML files similar to the ones I generated from my Apple Silicon Benchmark tool .

As I wrote from the beginning I wrote this for myself but it may be useful to others looking for a compact, self contained, totally server side statistic system. I’ve open sourced it as AGPLv3 as well so feel free to look at the code, contribute back, or make changes as one desires.