So let’s take a step back to the beginning of my blogging. I like sharing my thoughts and opinions on the internet. Before the internet I did that on AOL and BBSs as well. That would be in the form of newsgroups, forum discussions, chat rooms etc. In the early-2000s I gave a try of doing it on LiveJournal. However it was when I began doing personal health experiments that I decided I should actually create my own blog. That’s where the title “N=1” comes from. I’m an experiment with one sample. It’s not so much scientific as methodical, if that makes sense. It’s not coming up with a control group and an experimental group and then study the relative benefits of one over the other for a particular thing. That would be actual science. This is more experimentation in the colloquial sense, and exploring health. I had big plans to try to cycle through a series of eating lifestyles for a couple months at a time with blood and fitness tests before/after to see if I could feel and measure a big difference. That experiment never came to fruition for various reasons but I kept blogging. I then added just general writing, program writing, and the like so now it’s just a generic personal blog. It’d be shit for monetization and influencing if I cared about such things because it lacks focus and regular consistent updates but that’s not the purpose.
So for a simple blog I went with Wordpress. I wanted to host it myself so rather than use Wordpress.com I followed instructions to setup a nice small Linux machine and was on my way. Then there was the noise of all the bots in the comments section. So I installed a plugin for that. Then there were some other things I was looking for so installed Jetpack, which is where those stats came from. Updates were a bit of a bitch but not too bad. Still there would be one attack after another against Wordpress sites announced. I was good with my security and I literally only had three core plugins install so I was mostly safe but I still found it unsettling a bit. Besides whenever I’d post to Diaspora or Mastodon the flood of fediverse servers would cause hundreds of requests to come in over a very short period of time. That would make my server’s response too slow for awhile. I could add caching or put a CDN in front but the whole point was to keep it simple. So I was teetering on the edge of making the jump to a static site system like Jekyll or Hugo. When I first started using DuckDuckGo’s Privacy Essentials plugin and saw how many trackers were embedded on my simple website despite me not setting it up (probably a bunch from Jetpack) I decided to finally make the leap.
- Able to generate a history of real blog hits to the blog posts themselves
- Able to tell me which referrers the articles came from
- Able to tell me which site-navigation that may have been used (like from another post or the Archive)
- Able to tell me which browsers and OS’s visitors are from
- No additional infrastructure systems to support this
- No additional network connections to support this
- No additional long running services to support this
I chose Kotlin as the language because it’s my favorite and it would allow me to potentially write a small self contained native application eventually. In the short term though I was more interested in getting things working so the current deployment is using a lot of JVM specific Kotlin features. Where I can though it’s using pure Kotlin libraries. Where I can’t I use Kotlin Multiplatform’s expected/actual class methodology to create the JVM specific implementation but that will allow me to write a pure Kotlin or native platform version in the future without changing the rest of the code. Most of the code already is Kotlin common code with the only things requiring the JVM are the database wrapper, which uses JetBrain’s Exposed library and the JDBC SQLite adapter and the SHA-256 string hashing algorithm. Lastly the command line program that drives this also is using the Java File walker. All of these could potentially be replaced with native components right now but it’d be more complicated and I wanted to get this up and running. So the main technologies used for this are:
- Kotlin Multiplatform
- SQLite database
That’s it. The first pass is pretty straight forward. The program parses each line of the Nginx log file into some processable constituents and stores them. I’m doing a SHA-256 hash of each line which should guarantee uniqueness without storing the line itself. All of that gets stored in a SQLite file database for querying and processing. That’s the easy part. The first hard part is filtering out irrelevant information. First, I want statistics on my articles so I only want it tracking accesses to those posts not any place on the website. If I had thought of this long ago I could have had a relative path like
/blog/yyyy/mm/dd but I didn’t so it’s just a bare path of
/yyyy/mm/dd. That’s not too hard to parse around so that was quickly overcome. It’s the bot traffic and malicious actor requests that needs some real filtering.
To distill out the relevant human accesses to my articles I created a series of Filter classes that can used in succession to perform the distillation. It just takes some trial and error in configuration:
AccessLogEntryFilteris the extension class which takes a list of these filters and applies it against the
AccessLogEntryto decide whether it is okay or not
AllowedResponseCodeFilterlooks at the response code in the Nginx log entry and only passes those of interest, for my blog simple 200 and 301.
ArticleYearPathFiltertakes care of my blog structure problem by looking at request paths that begin with select blog years, like
/2020etc. I could have restructured my site to have
/blogsbut decided to just keep thing status quo for now and write this simple filter. It’d also allow me to run statistics later on subsets of the blog if I chose
BotFilteris used to check the referrer string and/or the user agent string for keywords that are used by bots. Using an older dataset and my current logs I created my own dictionary of these terms. I also used this Nginx Bad Bot configuration to seed my own configuration files with known bad bot keywords.
FilterBadIPsis used to block known spammy IP addresses. Even with all the potential bot filtering some sites aren’t nice enough to clue me in to their nature but it can be pretty clear looking at the stats who is a bot and who is not. For example, an IP address with 20 hits over 1 second, that’s a bot.
RelPathStartFilteris a generic form of the
ArticleYearPathFilterwhich in fact uses it under the hood.
Every filter implements the
Filter interface which has one simple method,
checkIsOkay for a given
AccessLogEntry item. If it’s okay then it’ll get processed and stored. If it’s not then the entry doesn`t get into the dataset. As I wrote above there is a ton of noise. I test log history with over 50,000 items distilled down to about 1,000 actual log entries of interest. Configuring this took trial and error, which I’ll get into in a minute, but it was a grand total of less than an hour of such tweaking. With the data stored it’s time to generate statistics. The link statistics I’m looking to track are:
- The number of views of a post
- The unique IP addresses that went to it and the number of times that IP address went to it
- The unique external referrers and the number of times each referrer sent people to the post
- The unique “internal referrers”, basically site navigation, that sent people to the post
- The unique browsers and versions to the post
- The unique operating systems and versions to the post
All of these I then bin up into monthly bins which can then generate “Top 10” style metrics. It was pretty interesting but not entirely surprising what sorts of systems visited my site and which links were the most popular. My thoughts on my site’s metrics are a topic for another post. While I originally wanted to generate some graphs and HTML I decided to keep it simple for now and generate a flat text file. It has all the essential data so met my “minimum viable product” definition.
All of this capability is wrapped up in a single command line tool which provides a pretty simple interface to use:
Usage: run-analysis [OPTIONS] SETTINGSFILE Options: -b, --batch INT Size of each batch insert to database (default = 100) -i, --ingest Whether to try to ingest new log data -s, --summarize Whether to try to ingest new log data --find-badip Look for potentially bad IPs in the data --list-ip TEXT List all entries for the IP address -f, --apply-filters Applies the current filter configuration to the existing dataset, toggling hidden field accordingly not deleting data -h, --help Show this message and exit Arguments: SETTINGSFILE Location of run settings configuration
I violated the Unix principle a bit so I could have just one giant project. I thought of refactoring it into a common library used across multiple command line tools with a common library but nixed that for the time being. Using flags you can define which operations to perform at execution time:
- The settings file has all of the configurations for the filters, file locations, etc.
--ingestcauses the application to read the log files in the folder specified in the settings looking for new entries. It then processes them according to the filter and storage settings.
--summarizegenerates the summary data from the existing database stored dataset and outputs it to a local file
--find-badipis a tool I to look through the dataset to find potentially unflagged bad IPs. These are IP addresses that are making too many requests over too short a window of time (configurable in the settings file).
--list-ipwill list all valid entries for a given IP address and print out those entries with human readable time tags. This can be used by user to investigate potentially spammy IP’s flagged by the system with the bad IP tool or when looking at the summary statistics and seeing an especially chatty IP address.
--apply-filterswill run the current filter configuration against the current dataset and make whether those entries are visible in the summary statistics. It never deletes anything it simply toggles a
hiddenflag which is used during the queries.
--batchis used to define a batch size used when adding entries to the database. This is really only important during bootstrapping so that it combines several inserts into one transaction. This was way more important early on in development when I was batch inserting the whole set not the fraction that are filtered out.
For regular updating I have a normal user account’s cron job setup to run this tool with the ingest and summarize options every 30 minutes. The only destructive operation is the overwriting of the latest statistics summary text file that gets generated with each execution. I can look at those statistics from time to time and see if I need to tweak the filters. For example I saw that an IP address went to my site 71 times in one month. That seemed unlikely to me. I ran the “find-badip” tool which didn’t flag it. I then used the “list-ip” tool which listed all the entries. Sure enough it was legitimate traffic. And so it stays in. If it turned out to be a new IP address I don’t want statistics on I could add it to the filter configuration JSON file and run the tool with the
--apply-filters command to hide it from future statistics generation.
- Convert the whole thing to a Kotlin Native application so that it does not require any runtime environment at all
- Change the output to be formatted output instead of raw text. First step would be generation of markdown. Second step would be adding graphs, especially if there was a Kotlin Native graph generation library, and lastly perhaps making HTML files similar to the ones I generated from my Apple Silicon Benchmark tool.
As I wrote from the beginning I wrote this for myself but it may be useful to others looking for a compact, self contained, totally server side statistic system. I’ve open sourced it as AGPLv3 as well so feel free to look at the code, contribute back, or make changes as one desires.