regina/README.md
2023-05-17 18:04:53 +02:00

8.4 KiB

regina - website analytics

Ruling Empress Generating In-depth Nginx Analytics

Description

regina is a python program that generates analytics for a static webpage serverd with nginx. regina is easy to deploy and privacy respecting:

  • it collects the data from the nginx logs: no javascript/changes to your website required
  • data is stored on your device in a sqlite database, nothing goes to any cloud It parses the log and stores the important data in an sqlite database. It can then create an analytics html page that has lots of useful plots and numbers.

Capabilities

Statistics

regina can generate the following statistics:

  • visitor count history
  • request count history
  • referrer ranking (from which site people visit)
  • route ranking (accessed files)
  • browser ranking
  • platform ranking (operating systems)
  • city ranking (where your site visitors are from)
  • country ranking
  • mobile visitor percentage
  • detect if a visitor is likely to be human or a bot

All of those plots and numbers can be generated for the last x days (you can set x yourself) and for all times.

Visualization

regina can use the data above to generate a static analytics page in a single html file. The visitor and ranking histories are included as plots.
You can view an example page here
If that is not enough for you, you can write your own script and use data exported by regina or access the database directly.

Getting started

Dependencies

  • nginx: You need a nginx webserver that outputs the access log in the combined format, which is the default
  • sqlite >= 3.37
  • python >= 3.10
  • python-matplotlib

Installation

You can install regina with python-pip:

git clone https://github.com/MatthiasQuintern/regina.git
cd regina
python3 -m pip install .

You can also install it system-wide using sudo python3 -m pip install .

If you also want to install the man-page and the zsh completion script:

    sudo cp regina.1.man /usr/share/man/man1/regina.1
    sudo gzip /usr/share/man/man1/regina.1
    sudo cp regina/package-data/_regina.compdef.zsh /usr/local/share/zsh/site-functions/_regina
    sudo chmod +x /usr/share/zsh/site-functions/_regina

Configuration

The following instructions assume you have an nginx webserver configured for a website like this, with /www as root (/):

    /www
    |-- resources
    |   |-- image.jpg
    |-- index.html

By default, nginx will generate logs in the combined format with the name access.log in /var/log/nginx/ and rotate them daily.

Copy the default configuration and template from the git directory to a directory of your choice, in this case ~/.config/regina If you did clone the git repo, the files should be in /usr/local/lib/python3.11/site-packages/regina/package-data/.

    mkdir ~/.config/regina
    cp regina/package-data/default.cfg ~/.config/regina/regina.cfg
    cp regina/package-data/template.html ~/.config/regina/template.html

Now edit the configuration to fit your needs. For our example:

    [regina]
    server_name = my_server.com
    access_log = /var/log/nginx/access.log.1
    ...
    [html-generation]
    html_out_path = /www/analytics/analytics.html
    img_location = /img

    [plot-generation]
    img_out_dir = /www/analytics/img

Most defaults should be fine. The default configuration should also be documented well enough for you to know what do do. It is strongly recommended to only use absolute paths.

Now you fill collect the data from the nginx log specified as access_log in the configuration into the database specified at the database location (or ~/.local/share/regina/my-server.com.db if left blank):

    regina --config ~/.config/regina/regina.cfg --collect

To visualize the data, run:

    regina --config ~/.config/regina/regina.cfg --visualize

This will generate plots and statistics and replace all variables in template_html and output the result to html_out_path. If html_out_path is in your webroot, you should now be able to access the generated site.
In our example, /www will look like this:

    /www
    |-- analytics
    |   |-- analytics.html
    |   |-- img
    |       |-- ranking_referer_total.svg
    |       |-- ranking_referer_last_x_days.svg
    |       ...
    |-- resources
    |   |-- image.jpg
    |-- index.html

Automation

You will probably run regina once per day, after nginx has filled the daily access log. The easiest way to that is using a cronjob. Run crontab -e and enter: 10 0 * * * /usr/bin/regina --config /home/myuser/.config/regina/regina.cfg --collect --visualize This assumes, you installed regina system-wide.
Now the regina command will be run every day, ten minutes after midnight. After each day, rotates the logs, so access.log becomes access.log.1. Since regina is run after the log rotation, you will probably want to run it on access.log.1.

Logfile permissions

By default, nginx logs are -rw-r----- root root so you can not access them as user. You could either run regina as root, which I strongly do not recommend or make a root-cronjob that changes ownership of the log after midnight. Run sudo crontab -e and enter: 9 0 * * * chown your-username /var/log/nginx/access.log.1 This will make you the owner of the log 9 minutes after midnight, just before regina needs read access.

GeoIP

regina can show you from which country or city a visitor is from, but you will need an ip2location database. You can acquire such a database for free at ip2location.com (and probably some other sites as well!). After creating create an account you can download several different databases in different formats.
For regina, download the IP-COUNTRY-REGION-CITY for IPv4 as csv.

To configure regina to use the GeoIP database, edit get_visitor_location and get_cities_for_contries in section data-collection.
By default, regina only tells you which country a user is from. Append the two-letter country codes for countries you are interested in to the get_cities_for_contries option.
After that, add the GeoIP-data into your database:

    regina --config regina.cfg --update-geoip path-to-csv

Depending on how many countries you specified, this might take a long time. You can delete the csv afterwards.

Customization

Generated html

The generated file does not need to be an html. The template can be any text file.
regina will only replace certain words starting with a %. You can see all supported variables and their values by running --visualize with debug_level = 1.

Data export

If you want to further process the data generated by regina, you can export the data by setting the data_out_dir in the data-export section. The data can be exported as csv or pkl.
If you choose pkl as file type, all rankings will be exported as python type list[tuple[int, str]].

Database

You can of course work directly with the database, as long as it is not altered. Editing, adding or deleting entries might make the database incompatible with regina, so only do that if you know what you are doing. Just querying entries will be fine though.

Troubleshooting

General

If you are having problems, try setting the debug_level in section debug of the configuration file to a non-zero value.

sqlite3.OperationalError: near "STRICT": syntax error

Your sqlite3 version is probably too old. Check with sqlite3 --version. regina requires 3.37 or higher.
Hotfix: Remove all STRICTs from <python-dir>/site-packages/regina/sql/create_db.sql.

Cangelog

1.1 (2023-05-17)

  • Improved database format:
    • put referrer, browser and platform in own table to reduze size of the database
    • route groups now part of visualization, not data collection
  • Data visualization now uses more sql for improved performance
  • Refactored codebase
  • Bug fixes
  • Changed setup.py to pyproject.toml

1.0 (2022-12-14)

  • Initial release

Copyright

Copyright © 2022 Matthias Quintern. License GPLv3+: GNU GPL version 3 https://gnu.org/licenses/gpl.html.
This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law.