Updated readme

This commit is contained in:
matthias@arch 2023-05-17 18:02:21 +02:00
parent 281c766cbd
commit f3ea297570
3 changed files with 465 additions and 69 deletions

191
README.md
View File

@ -1,32 +1,50 @@
# regina - nginx analytics tool
**R**uling **E**mpress **G**enerating **I**n-depth **N**ginx **A**nalytics (obviously)
## Overview
Regina is an analytics tool for nginx.
It collects information from the nginx access.log and stores it in a sqlite3 database.
Regina supports several data visualization configurations and can generate an admin-analytics page from an html template file.
# regina - website analytics
**R**uling **E**mpress **G**enerating **I**n-depth **N**ginx **A**nalytics
## Command line options
**-h**, **--help**
: Show the the possible command line arguments
## Description
`regina` is a **python** <!-- ![python-logo](/resources/img/logos/python.svg "snek make analytics go brr") --> program that generates ***analytics*** for a static webpage serverd with **nginx**.
`regina` is easy to deploy and privacy respecting:
- it collects the data from the nginx logs: no javascript/changes to your website required
- data is stored on your device in a **sqlite** database, nothing goes to any cloud
It parses the log and **stores** the important data in an *sqlite* <!-- ![sqlite-logo](/resources/img/logos/sqlite.svg) --> database.
It can then create an analytics html page that has lots of useful **plots** and **numbers**.
**-c**, **--config** config-file
: Retrieve settings from the config-file
## Capabilities
### Statistics
`regina` can generate the following statistics:
**--access-log** log-file
: Overrides the access_log from the configuration
- visitor count history
- request count history
- referrer ranking *(from which site people visit)*
- route ranking *(accessed files)*
- browser ranking
- platform ranking *(operating systems)*
- city ranking *(where your site visitors are from)*
- country ranking
- mobile visitor percentage
- detect if a visitor is likely to be human or a bot
**--collect**
: Collect information from the access_log and store them in the databse
All of those plots and numbers can be generated for the **last x days** (you can set *x* yourself) and for **all times**.
**--visualize**
: Visualize the data from the database
### Visualization
`regina` can use the data above to generate a static analytics page in a single html file.
The visitor and ranking histories are included as plots.
You can view an example page [here](https://quintern.xyz/en/software/regina-example.html)
If that is not enough for you, you can write your own script and use data exported by regina or access the database directly.
**--update-geoip** geoip-db
: Recreate the geoip part of the database from the geoip-db csv. The csv must have this form: lower, upper, country-code, country-name, region, city
# Getting started
# Installation with pip
You can also install regina with python-pip:
## Dependencies
- **nginx**: You need a nginx webserver that outputs the access log in the `combined` format, which is the default
- **sqlite >= 3.37**
- **python >= 3.10**
- **python-matplotlib**
## Installation
You can install regina with python-pip:
```shell
git clone https://github.com/MatthiasQuintern/regina.git
cd regina
@ -36,13 +54,138 @@ You can also install it system-wide using `sudo python3 -m pip install .`
If you also want to install the man-page and the zsh completion script:
```shell
sudo cp regina.1.man /usr/share/man/man1/regina.1
sudo gzip /usr/share/man/man1/regina.1
sudo cp _regina.compdef.zsh /usr/share/zsh/site-functions/_regina
sudo chmod +x /usr/share/zsh/site-functions/_regina
sudo cp regina.1.man /usr/share/man/man1/regina.1
sudo gzip /usr/share/man/man1/regina.1
sudo cp regina/package-data/_regina.compdef.zsh /usr/local/share/zsh/site-functions/_regina
sudo chmod +x /usr/share/zsh/site-functions/_regina
```
## 1.0
## Configuration
The following instructions assume you have an nginx webserver configured for a website like this, with `/www` as root (`/`):
```
/www
|-- resources
| |-- image.jpg
|-- index.html
```
By default, nginx will generate logs in the `combined` format with the name `access.log` in `/var/log/nginx/` and rotate them daily.
Copy the default configuration and template from the git directory to a directory of your choice, in this case `~/.config/regina`
If you did clone the git repo, the files should be in `/usr/local/lib/python3.11/site-packages/regina/package-data/`.
```shell
mkdir ~/.config/regina
cp regina/package-data/default.cfg ~/.config/regina/regina.cfg
cp regina/package-data/template.html ~/.config/regina/template.html
```
Now edit the configuration to fit your needs.
For our example:
```
[regina]
server_name = my_server.com
access_log = /var/log/nginx/access.log.1
...
[html-generation]
html_out_path = /www/analytics/analytics.html
img_location = /img
[plot-generation]
img_out_dir = /www/analytics/img
```
Most defaults should be fine. The default configuration should also be documented well enough for you to know what do do.
It is strongly recommended to only use absolute paths.
Now you fill collect the data from the nginx log specified as `access_log` in the configuration into the database specified at the `database` location (or `~/.local/share/regina/my-server.com.db` if left blank):
```
regina --config ~/.config/regina/regina.cfg --collect
```
To visualize the data, run:
```
regina --config ~/.config/regina/regina.cfg --visualize
```
This will generate plots and statistics and replace all variables in `template_html` and output the result to `html_out_path`.
If `html_out_path` is in your webroot, you should now be able to access the generated site.
In our example, `/www` will look like this:
```
/www
|-- analytics
| |-- analytics.html
| |-- img
| |-- ranking_referer_total.svg
| |-- ranking_referer_last_x_days.svg
| ...
|-- resources
| |-- image.jpg
|-- index.html
```
### Automation
You will probably run `regina` once per day, after `nginx` has filled the daily access log. The easiest way to that is using a *cronjob*.
Run `crontab -e` and enter:
`10 0 * * * /usr/bin/regina --config /home/myuser/.config/regina/regina.cfg --collect --visualize`
This assumes, you installed `regina` system-wide.
Now the `regina` command will be run every day, ten minutes after midnight.
After each day, rotates the logs, so `access.log` becomes `access.log.1`.
Since `regina` is run after the log rotation, you will probably want to run it on `access.log.1`.
#### Logfile permissions
By default, `nginx` logs are `-rw-r----- root root` so you can not access them as user.
You could either run regina as root, which I **strongly do not recommend** or make a root-cronjob that changes ownership of the log after midnight.
Run `sudo crontab -e` and enter:
`9 0 * * * chown your-username /var/log/nginx/access.log.1`
This will make you the owner of the log 9 minutes after midnight, just before `regina` needs read access.
## GeoIP
`regina` can show you from which country or city a visitor is from, but you will need an *ip2location* database.
You can acquire such a database for free at [ip2location.com](https://lite.ip2location.com/) (and probably some other sites as well!).
After creating create an account you can download several different databases in different formats.
For `regina`, download the `IP-COUNTRY-REGION-CITY` for IPv4 as *csv*.
To configure regina to use the GeoIP database, edit `get_visitor_location` and `get_cities_for_contries` in section `data-collection`.
By default, `regina` only tells you which country a user is from.
Append the two-letter country codes for countries you are interested in to the `get_cities_for_contries` option.
After that, add the GeoIP-data into your database:
```
regina --config regina.cfg --update-geoip path-to-csv
```
Depending on how many countries you specified, this might take a long time. You can delete the `csv` afterwards.
# CUSTOMIZATION
## Generated html
The generated file does not need to be an html. The template can be any text file.
`regina` will only replace certain words starting with a `%`.
You can see all supported variables and their values by running `--visualize` with `debug_level = 1`.
## Data export
If you want to further process the data generated by regina, you can export the data by setting the `data_out_dir` in the `data-export` section.
The data can be exported as `csv` or `pkl`.
If you choose `pkl` as filetype, all rankings will be exported as python type `list[tuple[int, str]]`.
## Database
You can of course work directly with the database, as long as it is not altered.
Editing, adding or deleting entries might make the database incompatible with regina, so only do that if you know what you are doing.
Just querying entries will be fine though.
# TROUBLESHOOTING
## General
If you are having problems, try setting the `debug_level` in section `debug` of the configuration file to a non-zero value.
## sqlite3.OperationalError: near "STRICT": syntax error
Your sqlite3 version is probably too old. Check with `sqlite3 --version`. `regina` requires 3.37 or higher.
Hotfix: Remove all `STRICT`s from `<python-dir>/site-packages/regina/sql/create_db.sql`.
# Cangelog
## 1.1 (2023-05-17)
- Improved database format:
- put referrer, browser and platform in own table to reduze size of the database
- route groups now part of visualization, not data collection
- Data visualization now uses more sql for improved performance
- Refactored codebase
- Bug fixes
- Changed setup.py to pyproject.toml
## 1.0 (2022-12-14)
- Initial release
# Copyright

View File

@ -1,4 +1,4 @@
.\" Automatically generated by Pandoc 2.19.2
.\" Automatically generated by Pandoc 3.0.1
.\"
.\" Define V font for inline verbatim, using C font in formats
.\" that render this, and otherwise B font.
@ -14,23 +14,28 @@
. ftr VB CB
. ftr VBI CBI
.\}
.TH "NICOLE" "1" "April 2022" "nicole 2.0" ""
.TH "REGINA" "1" "May 2023" "regina 1.1" ""
.hy
.SH NAME
.PP
\f[B]R\f[R]uling \f[B]E\f[R]mpress \f[B]G\f[R]enerating
regina - \f[B]R\f[R]uling \f[B]E\f[R]mpress \f[B]G\f[R]enerating
\f[B]I\f[R]n-depth \f[B]N\f[R]ginx \f[B]A\f[R]nalytics (obviously)
Regina is an analytics tool for nginx.
.SS Description
.PP
\f[V]regina\f[R] is a \f[B]python\f[R] program that generates
\f[B]\f[BI]analytics\f[B]\f[R] for a static webpage serverd with
\f[B]nginx\f[R].
\f[V]regina\f[R] is easy to deploy and privacy respecting: - it collects
the data from the nginx logs: no javascript/changes to your website
required - data is stored on your device in a \f[B]sqlite\f[R] database,
nothing goes to any cloud It parses the log and \f[B]stores\f[R] the
important data in an \f[I]sqlite\f[R] database.
It can then create an analytics html page that has lots of useful
\f[B]plots\f[R] and \f[B]numbers\f[R].
.SH SYNOPSIS
.PP
\f[B]regina\f[R] \[em]-config CONFIG_FILE [OPTION\&...]
.SH DESCRIPTION
.PP
It collects information from the nginx access.log and stores it in a
sqlite3 database.
Regina supports several data visualization configurations and can
generate an admin-analytics page from an html template file.
.SS Command line options
.SH COMMAND LINE OPTIONS
.TP
\f[B]-h\f[R], \f[B]\[em]-help\f[R]
Show the the possible command line arguments
@ -51,24 +56,20 @@ Visualize the data from the database
Recreate the geoip part of the database from the geoip-db csv.
The csv must have this form: lower, upper, country-code, country-name,
region, city
.SH INSTALLATION AND UPDATING
.SH GETTING STARTED
.SS Dependencies
.IP \[bu] 2
\f[B]nginx\f[R]: You need a nginx webserver that outputs the access log
in the \f[V]combined\f[R] format, which is the default
.IP \[bu] 2
\f[B]sqlite >= 3.37\f[R]
.IP \[bu] 2
\f[B]python >= 3.10\f[R]
.IP \[bu] 2
\f[B]python-matplotlib\f[R]
.SS Installation
.PP
To update regina, simply follow the installation instructions.
.SS pacman (Arch Linux)
.PP
Installing regina using the Arch Build System also installs the man-page
and a zsh completion script, if you have zsh installed.
.IP
.nf
\f[C]
git clone https://github.com/MatthiasQuintern/regina.git
cd regina
makepkg -si
\f[R]
.fi
.SS pip
.PP
You can also install regina with python-pip:
You can install regina with python-pip:
.IP
.nf
\f[C]
@ -85,19 +86,245 @@ If you also want to install the man-page and the zsh completion script:
.IP
.nf
\f[C]
sudo cp regina.1.man /usr/share/man/man1/regina.1
sudo gzip /usr/share/man/man1/regina.1
sudo cp _regina.compdef.zsh /usr/share/zsh/site-functions/_regina
sudo chmod +x /usr/share/zsh/site-functions/_regina
sudo cp regina.1.man /usr/share/man/man1/regina.1
sudo gzip /usr/share/man/man1/regina.1
sudo cp regina/package-data/_regina.compdef.zsh /usr/local/share/zsh/site-functions/_regina
sudo chmod +x /usr/share/zsh/site-functions/_regina
\f[R]
.fi
.SS Configuration
.PP
The following instructions assume you have an nginx webserver configured
for a website like this, with \f[V]/www\f[R] as root (\f[V]/\f[R]):
.IP
.nf
\f[C]
/www
|---- resources
| |---- image.jpg
|---- index.html
\f[R]
.fi
.PP
By default, nginx will generate logs in the \f[V]combined\f[R] format
with the name \f[V]access.log\f[R] in \f[V]/var/log/nginx/\f[R] and
rotate them daily.
.PP
Copy the default configuration and template from the git directory to a
directory of your choice, in this case \f[V]\[ti]/.config/regina\f[R] If
you did clone the git repo, the files should be in
\f[V]/usr/local/lib/python3.11/site-packages/regina/package-data/\f[R].
.IP
.nf
\f[C]
mkdir \[ti]/.config/regina
cp regina/package-data/default.cfg \[ti]/.config/regina/regina.cfg
cp regina/package-data/template.html \[ti]/.config/regina/template.html
\f[R]
.fi
.PP
Now edit the configuration to fit your needs.
For our example:
.IP
.nf
\f[C]
[regina]
server_name = my_server.com
access_log = /var/log/nginx/access.log.1
...
[html-generation]
html_out_path = /www/analytics/analytics.html
img_location = /img
[plot-generation]
img_out_dir = /www/analytics/img
\f[R]
.fi
.PP
Most defaults should be fine.
The default configuration should also be documented well enough for you
to know what do do.
It is strongly recommended to only use absolute paths.
.PP
Now you fill collect the data from the nginx log specified as
\f[V]access_log\f[R] in the configuration into the database specified at
the \f[V]database\f[R] location (or
\f[V]\[ti]/.local/share/regina/my-server.com.db\f[R] if left blank):
.IP
.nf
\f[C]
regina ----config \[ti]/.config/regina/regina.cfg --collect
\f[R]
.fi
.PP
To visualize the data, run:
.IP
.nf
\f[C]
regina ----config \[ti]/.config/regina/regina.cfg --visualize
\f[R]
.fi
.PP
This will generate plots and statistics and replace all variables in
\f[V]template_html\f[R] and output the result to
\f[V]html_out_path\f[R].
If \f[V]html_out_path\f[R] is in your webroot, you should now be able to
access the generated site.
.PD 0
.P
.PD
In our example, \f[V]/www\f[R] will look like this:
.IP
.nf
\f[C]
/www
|---- analytics
| |---- analytics.html
| |---- img
| |---- ranking_referer_total.svg
| |---- ranking_referer_last_x_days.svg
| ...
|---- resources
| |---- image.jpg
|---- index.html
\f[R]
.fi
.SS Automation
.PP
You will probably run \f[V]regina\f[R] once per day, after
\f[V]nginx\f[R] has filled the daily access log.
The easiest way to that is using a \f[I]cronjob\f[R].
Run \f[V]crontab -e\f[R] and enter:
\f[V]10 0 * * * /usr/bin/regina ----config /home/myuser/.config/regina/regina.cfg --collect --visualize\f[R]
This assumes, you installed \f[V]regina\f[R] system-wide.
.PD 0
.P
.PD
Now the \f[V]regina\f[R] command will be run every day, ten minutes
after midnight.
After each day, rotates the logs, so \f[V]access.log\f[R] becomes
\f[V]access.log.1\f[R].
Since \f[V]regina\f[R] is run after the log rotation, you will probably
want to run it on \f[V]access.log.1\f[R].
.SS Logfile permissions
.PP
By default, \f[V]nginx\f[R] logs are \f[V]-rw-r------- root root\f[R] so
you can not access them as user.
You could either run regina as root, which I \f[B]strongly do not
recommend\f[R] or make a root-cronjob that changes ownership of the log
after midnight.
Run \f[V]sudo crontab -e\f[R] and enter:
\f[V]9 0 * * * chown your-username /var/log/nginx/access.log.1\f[R]
This will make you the owner of the log 9 minutes after midnight, just
before \f[V]regina\f[R] needs read access.
.SS GeoIP
.PP
\f[V]regina\f[R] can show you from which country or city a visitor is
from, but you will need an \f[I]ip2location\f[R] database.
You can acquire such a database for free at
ip2location.com (https://lite.ip2location.com/) (and probably some other
sites as well!).
After creating create an account you can download several different
databases in different formats.
.PD 0
.P
.PD
For \f[V]regina\f[R], download the \f[V]IP-COUNTRY-REGION-CITY\f[R] for
IPv4 as \f[I]csv\f[R].
.PP
To configure regina to use the GeoIP database, edit
\f[V]get_visitor_location\f[R] and \f[V]get_cities_for_contries\f[R] in
section \f[V]data-collection\f[R].
.PD 0
.P
.PD
By default, \f[V]regina\f[R] only tells you which country a user is
from.
Append the two-letter country codes for countries you are interested in
to the \f[V]get_cities_for_contries\f[R] option.
.PD 0
.P
.PD
After that, add the GeoIP-data into your database:
.IP
.nf
\f[C]
regina ----config regina.cfg --update-geoip path-to-csv
\f[R]
.fi
.PP
Depending on how many countries you specified, this might take a long
time.
You can delete the \f[V]csv\f[R] afterwards.
.SH CUSTOMIZATION
.SS Generated html
.PP
The generated file does not need to be an html.
The template can be any text file.
.PD 0
.P
.PD
\f[V]regina\f[R] will only replace certain words starting with a
\f[V]%\f[R].
You can see all supported variables and their values by running
\f[V]----visualize\f[R] with \f[V]debug_level = 1\f[R].
.SS Data export
.PP
If you want to further process the data generated by regina, you can
export the data by setting the \f[V]data_out_dir\f[R] in the
\f[V]data-export\f[R] section.
The data can be exported as \f[V]csv\f[R] or \f[V]pkl\f[R].
.PD 0
.P
.PD
If you choose \f[V]pkl\f[R] as filetype, all rankings will be exported
as python type \f[V]list[tuple[int, str]]\f[R].
.SS Database
.PP
You can of course work directly with the database, as long as it is not
altered.
Editing, adding or deleting entries might make the database incompatible
with regina, so only do that if you know what you are doing.
Just querying entries will be fine though.
.SH TROUBLESHOOTING
.SS General
.PP
If you are having problems, try setting the \f[V]debug_level\f[R] in
section \f[V]debug\f[R] of the configuration file to a non-zero value.
.SS sqlite3.OperationalError: near \[lq]STRICT\[rq]: syntax error
.PP
Your sqlite3 version is probably too old.
Check with \f[V]sqlite3 ----version\f[R].
\f[V]regina\f[R] requires 3.37 or higher.
.PD 0
.P
.PD
Hotfix: Remove all \f[V]STRICT\f[R]s from
\f[V]<python-dir>/site-packages/regina/sql/create_db.sql\f[R].
.SH CHANGELOG
.SS 1.0
.SS 1.1
.IP \[bu] 2
Improved database format:
.RS 2
.IP \[bu] 2
put referrer, browser and platform in own table to reduze size of the
database
.IP \[bu] 2
route groups now part of visualization, not data collection
.RE
.IP \[bu] 2
Data visualization now uses more sql for improved performance
.IP \[bu] 2
Refactored codebase
.IP \[bu] 2
Bug fixes
.IP \[bu] 2
Changed setup.py to pyproject.toml ## 1.0
.IP \[bu] 2
Initial release
.SH COPYRIGHT
.PP
Copyright \[co] 2022 Matthias Quintern.
Copyright © 2022 Matthias Quintern.
License GPLv3+: GNU GPL version 3 <https://gnu.org/licenses/gpl.html>.
.PD 0
.P

View File

@ -1,19 +1,45 @@
% REGINA(1) regina 1.1
% Matthias Quintern
% April 2022
% May 2023
# NAME
regina - **R**uling **E**mpress **G**enerating **I**n-depth **N**ginx **A**nalytics (obviously)
## Description
`regina` is a **python** <!-- ![python-logo](/resources/img/logos/python.svg "snek make analytics go brr") --> program that generates ***analytics*** for a static webpage serverd with **nginx**.
`regina` is easy to deploy and privacy respecting:
- it collects the data from the nginx logs: no javascript/changes to your website required
- data is stored on your device in a **sqlite** database, nothing goes to any cloud
It parses the log and **stores** the important data in an *sqlite* <!-- ![sqlite-logo](/resources/img/logos/sqlite.svg) --> database.
It can then create an analytics html page that has lots of useful **plots** and **numbers**.
<!-- ## Capabilities -->
<!-- ### Statistics -->
<!-- `regina` can generate the following statistics: -->
<!-- - visitor count history -->
<!-- - request count history -->
<!-- - referrer ranking *(from which site people visit)* -->
<!-- - route ranking *(accessed files)* -->
<!-- - browser ranking -->
<!-- - platform ranking *(operating systems)* -->
<!-- - city ranking *(where your site visitors are from)* -->
<!-- - country ranking -->
<!-- - mobile visitor percentage -->
<!-- - detect if a visitor is likely to be human or a bot -->
<!-- All of those plots and numbers can be generated for the **last x days** (you can set *x* yourself) and for **all times**. -->
<!-- ### Visualization -->
<!-- `regina` can use the data above to generate a static analytics page in a single html file. -->
<!-- The visitor and ranking histories are included as plots. -->
<!-- You can view an example page [here](https://quintern.xyz/en/software/regina-example.html) -->
<!-- If that is not enough for you, you can write your own script and use data exported by regina or access the database directly. -->
# SYNOPSIS
| **regina** --config CONFIG_FILE [OPTION...]
# DESCRIPTION
Regina is an analytics tool for nginx.
It collects information from the nginx access.log and stores it in a sqlite3 database.
Regina supports several data visualization configurations and can generate an admin-analytics page from an html template file.
## Command line options
# COMMAND LINE OPTIONS
**-h**, **--help**
: Show the the possible command line arguments
@ -37,8 +63,8 @@ Regina supports several data visualization configurations and can generate an ad
## Dependencies
- **nginx**: You need a nginx webserver that outputs the access log in the `combined` format, which is the default
- **sqlite >= 3.37**
- **Python >= 3.10**
- **Python/matplotlib**
- **python >= 3.10**
- **python-matplotlib**
## Installation
You can install regina with python-pip:
@ -119,7 +145,7 @@ In our example, `/www` will look like this:
### Automation
You will probably run `regina` once per day, after `nginx` has filled the daily access log. The easiest way to that is using a *cronjob*.
Run `crontab -e` and enter:
`10 0 * * * /usr/bin/regina --config /home/myuser/.config/regina/regina.conf --collect --visualize`
`10 0 * * * /usr/bin/regina --config /home/myuser/.config/regina/regina.cfg --collect --visualize`
This assumes, you installed `regina` system-wide.
Now the `regina` command will be run every day, ten minutes after midnight.
After each day, rotates the logs, so `access.log` becomes `access.log.1`.
@ -144,7 +170,7 @@ By default, `regina` only tells you which country a user is from.
Append the two-letter country codes for countries you are interested in to the `get_cities_for_contries` option.
After that, add the GeoIP-data into your database:
```
regina --config regina.conf --update-geoip path-to-csv
regina --config regina.cfg --update-geoip path-to-csv
```
Depending on how many countries you specified, this might take a long time. You can delete the `csv` afterwards.