rmed

blog

Goodbye cookies, hello logs

2022-07-20 18:00

Currently (as of 2022-07-19) I'm using Matomo to have some self-hosted web analytics. In particular, I'm intrested in knowing which posts people find interesting over time (monthly). However, I'm not too fond of having to enable cookies in my website, so I have been looking for some way to process server logs from Nginx to get that information without having to bother visitors with cookies. That's where GoAccess comes in.

What is GoAccess?

As stated in the official GoAccess website:

GoAccess is an open source real-time web log analyzer and interactive viewer that runs in a terminal in *nix systems or through your browser.

It provides fast and valuable HTTP statistics for system administrators that require a visual server report on the fly.

That is exactly what I'm looking for. In addition, it can be used as a TUI to browse the logs or generate a report in HTML, which I could then send to my email weekly or monthly.

Install GoAccess

Initially, I tried to install GoAccess from the Debian package repository, but the version in the repository is not the latest at the time of writing (1.6.2). Therefore, I followed the instructions in the GoAccess documentation to use GoAccess' official Debian/Ubuntu repository, giving me the latest stable version of the software.

After installation I was eager to try out the parsing of the access logs of my site:

goaccess /var/log/nginx/website_access.log

The terminal UI shows a screen asking me to specify the log format my files use. Because I'm using the default Nginx configuration, I chose the NCSA Combined Log Format option and lo and behold, I get a terminal dashboard with my statistics. Initially I chose the Common Log Format (CLF), but GoAccess was not correctly identifying clients and crawlers.

Now, my intent, as I'll show later, is to automate this, so I went ahead and uncommented the log-format COMBINED entry in /etc/goaccess/goaccess.conf to default to this format whenever goaccess is executed.

Parsing all the logs!

The previous command parsed the latest access log from Nginx, but depending on the log rotation configured it may not be too helpful (e.g. it could just contain the entries from the previous day). Luckily, my log rotation is configured to keep a certain number of old logs compressed with filenames such as website_access.log.10.gz, meaning that I can ingest them as well to get a more complete dataset.

In order to do so, and again, following the documentation, it is enough to execute the following:

zcat /var/log/nginx/website_access.log.*.gz | goaccess /var/log/nginx/website_access.log -

Now I get a complete overview from all the access logs of the site, with some interesting sections for web analytics:

  • Unique visitors per day - Including spiders: as its name implies, registers all the clients that have accessed a URL. This includes crawlers, but we'll take care of that later
  • Requested Files (URLS): in my case, pages accessed by visitors, ordered from most visited to least visited
  • Time distribution: when the visitors are most active during the day
  • Geo location: summary of countries that visitors are from

But there is also other data that I may find interesting for other reasons:

  • Static Requests: static assets downloaded
  • Not Found URLs (404s): Come on, people, stop trying to acces a non-existent /wp-login.php

The navigation needs a bit of getting used to, but once you get the hang of it it feels quite Vim-like (arrows to go up/down in the dashboard, numbers to jump to a specific section, return to open the section, j/k to go up/down in the section).

Incremental log processing

For sites that have a lot of traffic, the log files can grow pretty quickly making them harder to parse and process quickly. For that reason, I find the persistence capabilities of GoAccess really cool. It allows you to save an initial snapshot of the logs in an internal database and then only add new entries when parsing logs.

To do this, GoAccess keeps track of the timestamp of the latest entries and will only add log lines to the database if their timestamp is more recent.

On to business! I created the snapshot using only the compressed logs:

zcat /var/log/nginx/website_access.log.*.gz | goaccess --persist --db-path logs/www.rmedgar.com -

Note that --db-path points to a directory that I created beforehand.

Afterwards, I append data from not compressed logs:

goaccess --restore --persist --db-path logs/www.rmedgar.com /var/log/nginx/website_access.log.1 /var/log/nginx/website_access.log -

Note that in order to append data to an existing database, both --restore and --persist must be used.

Finally, the dataset can be read directly from the database using the following command:

goaccess --restore --db-path logs/www.rmedgar.com

If I wanted to periodically add logs to this database, I could set up a cronjob running the following command:

goaccess --restore --persist --db-path logs/www.rmedgar.com --process-and-exit

With --process-and-exit the TUI will not be displayed and goaccess will terminate after adding the new entries.

Filtering and reporting

Current version of GoAccess (1.6.2) has no way to filter a dataset after it has been stored in the internal database. You can only browse it as-is. The drawback with this is that I'd have to ingest the data once per report I wanted to generate for myself. For example, I'm interested in the following:

  • Report of visits of humans per month: for some statistics regarding visitors
  • Report of visits of crawlers per month: again, it's always funny to see crawlers trying to access a non-existent Wordpress instance

That means I would have to follow this process each month:

  1. At the beginning of the month, ingest all logs into two different databases (with --ignore-crawlers and with --crawlers-only respectively), using the --persist option and filtering with grep or similar to get only entries from the current month. These databases should not overwrite the ones from the previous month
  2. Hourly (for example) read latest log files applying the same filters and using the --restore --persist options to update the respective database

Although a bit inconvenient in terms of having to process the same dataset twice, it can be easily scripted, so it's not much of an issue. Still, there is an open issue in the GoAccess repository that covers this sort of use case, so I imagine the feature may arrive at some point.

Periodic ingest

Taking into account the previous information, I devised the following script to be executed as part of a cronjob:

#!/usr/bin/env bash

# Directory in which the databases will be stored
STORAGE_PATH="/srv/web-analytics"

# Main
while getopts f:p:h: flag
do
    case "${flag}" in
        f) ADDITIONAL_FILTERS=${OPTARG};;
        p) ADDITIONAL_PARAMS=${OPTARG};;
        h)
            show_help
            exit;;
    esac
done

# Name of the site being analyzed
SITENAME=${@:$OPTIND:1}

# Site variant
VARIANT=${@:$OPTIND+1:1}

# Absolute path to the base log file.
#
# For example:
#
# /var/log/nginx/mysite.log
LOG_PATH=${@:$OPTIND+2:1}

if [ -z $SITENAME ] || [ -z $VARIANT ] || [ -z $LOG_PATH ]
then
    show_help
    exit
fi


# Ingest the data, creating the initial database if needed
ingest() {
    full_date=$(date '+%Y-%m-%d %H:%M:%S')
    dir_name=$(date '+%Y-%m' -d "$full_date")
    date_filter="/$(date +'%b/%Y'):"

    # Database directory
    db_path="$STORAGE_PATH/$SITENAME/$dir_name/$VARIANT"

    # Create directory and database if it does not exist already
    if [ ! -d $db_path ] 
    then
        mkdir -p "$db_path"

        # Begin with compressed logs
        if [ -z $ADDITIONAL_FILTERS ]
        then
            zcat $LOG_PATH.*.gz | grep "$date_filter" | goaccess --persist --db-path $db_path --process-and-exit $ADDITIONAL_PARAMS -
        else
            zcat $LOG_PATH.*.gz | grep "$date_filter" | grep "$ADDITIONAL_FILTERS" | goaccess --persist --db-path $db_path --process-and-exit $ADDITIONAL_PARAMS -
        fi
    fi

    # Incremental logs (parse those with a number)
    for LOG in $(find `dirname $LOG_PATH` -name "`basename $LOG_PATH`*" -not -name "*.gz" -type f | xargs echo)
    do
        echo "Ingesting $LOG"

        if [ -z $ADDITIONAL_FILTERS ]
        then
            cat $LOG | grep "$date_filter" | goaccess --restore --persist --db-path $db_path --process-and-exit $ADDITIONAL_PARAMS -
        else
            cat $LOG | grep "$date_filter" | grep "$ADDITIONAL_FILTERS" | goaccess --restore --persist --db-path $db_path --process-and-exit $ADDITIONAL_PARAMS -
        fi
    done
}


# Show help message
show_help() {
    echo "Ingest log files for web analytics"
    echo ""
    echo "Syntax: nginx-analytics.sh [-f <filter>] [-p <params>] [-h] <sitename> <variant> <log_path>"
    echo ""
    echo "Options:"
    echo ""
    echo "  -f <filter>     Additional filter to use before ingesting the logs (grep)"
    echo "  -p <params>     Additional parameters to provide to goaccess"
    echo "  -h              Show this help message and exit"
    echo ""
    echo "Argsuments:"
    echo ""
    echo "  sitename        Name of the site being analyzed"
    echo "  variant         Name of the report variant for this site"
    echo "  log_path        Absolute path to the base log file"
}


ingest

The usage is really simple. Supposing I wanted to have the following reports for my site:

  • Complete report: logs parsed as is, nginx-analytics.sh mysite complete /var/log/nginx/mysite_access.log
  • Crawler report: only extract information regarding crawler visits, nginx-analytics.sh -p "--crawlers-only" mysite crawlers /var/log/nginx/mysite_access.log
  • Client report: only extract information regarding normal client visits, nginx-analytics.sh -p "--ignore-crawlers" mysite clients /var/log/nginx/mysite_access.log

This will generate the appropriate databases:

  • srv/web-analytics/mysite/YY-mm/complete
  • srv/web-analytics/mysite/YY-mm/crawkers
  • srv/web-analytics/mysite/YY-mm/clients

Next executions of the script with the same parameters will just append new data as long as we are in the same month. Otherwise, the databases corresponding to the next month will be created.

Moreover, if I wanted to add more filtering (e.g. to get only hits on the blog pages of the site), I could use the -f <filter> parameter to have an additional grep filter before ingesting the logs.

Conclusion

Even with the aforemenioned drawback, I think I will set up my web analytics using my Nginx logs and shut down the Matomo instance, removing cookies from my website in the process. What's more, I think GoAccess coud prove very useful in the future in the eventuallity that I have to analyze other web logs in bulk.

One thing I didn't really explore is the option to generate an HTML report, nor the real-time server for checking the statistics in real-time, but those seem to be really powerful options to gain further insight.