Copyright 2020 OpenStack Foundation

This work is licensed under a Creative Commons Attribution 3.0
Unported License.
http://creativecommons.org/licenses/by/3.0/legalcode

Website Activity Stats

https://storyboard.openstack.org/#!/story/2007387

Basic website activity stats around which pages are hit most often, which pages are 404s, and total number of visitors aid in properly running a site. With this info you can correct broken links or redirect users to appropriate locations. Popular pages can be given more attention as they are read most often. Visitor numbers help you learn if changes that are being made are effective or not.

Unfortunately for a long period of time we’ve not really published any of this useful data.

Problem Description

One of the major reasons we have not published this data historically is that many tools that work with this data over share. We are particularly concerned about publishing information that might be attributed to specific users. The ideal here is that we could publish a bare minimum of information that allows web admins to properly manage sites without leaking personal information.

In particular we don’t want to leak IP Addresses or subnets as IPs are considered PII and without significant traffic subnets typically identify specific users. We also want to avoid publishing referer information as this can be used to infer who users are as well. This can happen if users follow links from internal company wikis, bug trackers or code hosting systems.

Out of an abundance of caution we will avoid publishing Operating System, Web Browser, and google search terms as well. This data is likely safe to share, particularly if we avoid making it cross referenceable with other fields. For this reason we may add these stats in the future.

Proposed Change

We can use goaccess, a GPL tool, to produce conservative website stats reports from apache access logs. The key here is that newer goaccess (since Ubuntu Bionic) allow you to remove data from the end result report files. This allows us to tell goaccess to produce reports only with the data we feel is safe for public consumption.

We would run periodic Zuul jobs that connected to static.opendev.org, uncompressed Apache log files as necessary, then fed them through goaccess. The resulting report.html output file could then be written into AFS as well as hosted directly from the zuul logs system. This would give us reports that updated roughly daily covering the period of time for which logs are available.

To make this possible we will use Zuul’s per project ssh keys. This will allow the jobs to add static.opendev.org to the running ansible inventory then run ansible to perform the above steps.

If publishing into AFS we would write them to a known location for each site:

https://example.website.org/goaccess.html

To do this we need a configuration file that excludes the panels we do not want:

log-format COMBINED

ignore-panel VISITORS
ignore-panel REQUESTS
ignore-panel REQUESTS_STATIC
ignore-panel NOT_FOUND
ignore-panel HOSTS
ignore-panel OS
ignore-panel BROWSERS
ignore-panel VISIT_TIMES
ignore-panel VIRTUAL_HOSTS
ignore-panel REFERRERS
ignore-panel REFERRING_SITES
ignore-panel KEYPHRASES
ignore-panel STATUS_CODES
ignore-panel REMOTE_USER
ignore-panel GEO_LOCATION

enable-panel VISITORS
enable-panel REQUESTS
enable-panel REQUESTS_STATIC
enable-panel NOT_FOUND
enable-panel STATUS_CODES

Then we can run (roughly) this command in the Zuul jobs:

goaccess /var/log/apache2/example.site.org_access.log* -o example-site-report.html -p ./goaccess.conf

Alternatives

We can use tracker that run in the browser like goatcounter. One downside to this approach is that we would need to run custom 404 pages in order to collect data on 404s. This is more complicated than the web server logs approach. One upside to this approach is that we could track referrers to 404s enabling us to more easily fix our own broken links.

If we were collecting a rich set of data they would provide much more info, but because we’ve decided that we do not want to collect that information the server logs should be sufficient.

Implementation

Assignee(s)

Primary assignee:

Clark Boylan (clarkb)

Gerrit Topic

Use Gerrit topic “website-stats” for all patches related to this spec.

git-review -t website-stats

Work Items

  • Write zuul jobs to produce and publish the goaccess reports.

  • Document goaccess tooling for web admins.

Repositories

None

Servers

static.opendev.org would be updated to implement this for the sites it hosts.

DNS Entries

None

Documentation

We will need to document where the stats can be retrieved once available. We should also document the choices we made around which data is collected.

Security

We could potentially leak sensitive client information unintentionally. The example config file used above is intended to do its best to avoid that by explicitly disabling all available goaccess panels then enabling the few we know are safe.

Testing

We can run the new job against test data to ensure it works as expected without disclosing unwanted info.

Dependencies

None