In this post I’ll describe some of the steps I’ve taken to spin up a copy of Plausible Analytics on an existing Kubernetes cluster. I’ll be using Google Kubernetes Engine, but in most regards this should work fine on any provider.
This is actually an article I first penned a few months ago, but held off writing on Medium until I’d given things a little time to bed in. Well, it’s been ticking away quite happily since then — and if you want a more detailed version of the steps involved then that article can be found here.
Why Plausible?
First up — why Plausible? And why self-host it? I’ll keep this brief I promise as it’s not the main point of the article 😇
I was drawn to it after reading a few articles on the privacy concerns of using Google Analytics — the solution I was using previously to understand who was using my website and how they were using it. Firefox out-right blocks GA for example, and many people run ad-blockers in other browsers that do the same.
So partly it’s a moral thing, partly it’s an accurate data thing. The lack of any cookies also appealed — no need for annoying permissions banners on my websites! Tangentially I also found it simpler to use for my really basic use-case. Some folks have told me that GA v4 is less invasive but I’ve made the switch now and I’m happy.
If you’d like to read more about this, then Plausible have their own articles comparing the two and more on their philosophy. There’s also a demo.
As for why self-hosting it? Partly confidence in where the data is being held, but mostly because I wanted to learn about how it worked, to be honest 👷♂️. If you work for a company looking to use Plausible, I’d personally consider their hosted option first if you don’t need to keep the data on your own servers.
Running It On Kubernetes
Okay, back on topic now I promise! Plausible provide guidance in their docs and their hosting repo has some examples, but as I mentioned above I found I needed to make some tweaks.
My working code can be found on my Github. I’m still using this tool today so that repo should be appropriately fed and watered too. The pipeline is put together using Gitlab (where its mirrored from), but can be adapted quite easily to your chosen CI/CD tool, I’m sure.
Broadly, we need to solve for the following:
- Get the basic components running — the main app, its Postgres database and its Clickhouse database.
- Come up with a way of generating and storing the relevant secrets for it to work — there’s quite a few of them: the credentials for Plausible itself, Postgres credentials, Clickhouse credentials, email settings, and your Twitter token if you use that.
- Getting the emailed reports working.
- Dealing with the fact that your copy of Plausible is going to be behind some load balancers.
- Data backups and recovery.
- (Optional) Proxying the tracking Javascript (depending on how comfortable you are doing this).
I’ll give a brief overview of each step below. As I mentioned, for a more detailed version (including code samples etc) see my more detailed post on my personal blog: https://alexos.dev/2022/03/26/hosting-plausible-analytics-on-kubernetes/
The Basic Components & Secret Handling
I began with their Kubernetes manifests but adapted them using kustomize
. This is important for controlling resource usage (especially disk space for a tool like this). You can see the structure I adopted in this part of my repo. This included dropping the mail server completely — I doubt that would work well on GCP anyway, and I chose to use Sendgrid instead (they have a free tier that’s sufficient for my needs).
Note the use of a SecretGenerator
in the base to ensure that a new secret is created on any change, so that the pods automatically restart to use it. The secret values themselves are held in Google Secret Manager and patched in at deploy time (other clouds have similar, or you could use Vault).
I also subbed out the
latest
tags in the container images to use — this is rarely a good plan in Production! Here’s a good explanation why
Email Reports
As mentioned above, given the constraints on GCP I opted not to muck about trying to persuade a mail server to work within GKE, and instead signed up for Sendgrid. At the scale I’m working with, I’m well within the free tier usage. I’m not going to go through setting up your Sendgrid account itself in detail as writing that down is unlikely to age well and I found it to be pretty straight-forward — the important part is to generate yourself an API Key in there, which we then inject into the Plausible config.
The following represents the particular combination of variables that I found did the trick:
MAILER_EMAIL=THE_EMAIL_ADDRESS_YOU_CONFIGURE_IN_SENDGRID
SMTP_HOST_ADDR=smtp.sendgrid.net
SMTP_HOST_PORT=465
SMTP_HOST_SSL_ENABLED=true
SMTP_USER_NAME=apikey
SMTP_USER_PWD=$SENDGRID_KEY
SMTP_RETRIES=2
Take care to substitute the SMTP_USER_PWD
securely, as described earlier.
I found the easiest way to test this was creating an extra user in Plausible and using the forgotten password link.
X-Forwarded-For Behind a Load Balancer
For Plausible to work behind a reverse proxy load balancer like the Kubernetes nginx-ingress-controller
, some further tweaks are needed. You’ll know this is needed if the visitor countries / unique visitor tracking is not working. I found that the following configuration for the nginx-ingress-controller
did the trick:
hsts: "true"
ssl-redirect: "true"
use-forwarded-headers: "false" # not needed as not behind L7 GCLB, but YMMV
enable-real-ip: "true"
compute-full-forwarded-for: "true"
I also needed to edit my Service
of Type: LoadBalancer
to have spec.externalTrafficPolicy: Local
. This affects evenness of load balancing a little, but was required for this to work and I didn’t particularly mind this downside at my scale.
Data Backups
I wanted something in place here to allow me to recover historic data in the two databases in the event of needing to recreate my cluster (or something going wrong with it). I use this cluster to experiment with Kubernetes features fairly regularly, so there’s a certain inevitability in this for me at least 🙈
The Clickhouse database rules out a cloud-hosted PaaS option like Cloud SQL. A further issue was that, as per this github discussion, an implementation choice in Plausible means the “standard” utility within Clickhouse can’t be used either ... 💩
Instead, I ran both databases inside Kubernetes, and used Velero to snapshot the disks periodically. This is far from perfect but sufficient for my needs — I can live with a gap in data in the event of catastrophe. Velero is neat in that it is k8s-native but can make use of object storage to snapshot the persistent disks (it’s great at backing up other k8s resources too, should you need that).
Velero is taking a snapshot of block storage whilst the disks are in use. This carries risk! At the small scale I’m using Plausible at, this is not as likely to be an issue, but if you’re expecting high load then it might be — you’ll definitely want to set up alerts for failed backups, and test this frequently … 😬
The setup of Velero is not part of my public github, but it is pretty easy to do to be honest — and their docs are good. You’ll need to create a storage bucket and configure the VoumeSnapshotLocation
and BackupStorageLocation
for that, as well as ensuring the credentials are available to the runtime(if on GKE, Workload Identity is a fantastic way to do this securely and with minimal configuration hassle).
My backup schedule specifically for Plausible looks like this:
---
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: plausible-pv-backup
spec:
schedule: 0 */1 * * *
template:
includedResources:
- persistentvolumeclaims
- persistentvolumes
includeClusterResources: true
includedNamespaces:
- 'plausible'
ttl: 168h0m0s
As you can see, I’m not very sensitive about the freshness of the data here — snapshotting hourly is fine by me.
Restores rely on you setting up the Velero client and using a combination of the velero backup get
and velero restore create
commands to select and then restore the data, replacing the old disks (or creating new ones on your new cluster, if applicable).
Proxying the Request
Now then, decision time. Whether you do this or not is a bit of a moral choice. By disguising the Plausible tracking javascript like this, you are being a bit disingenuous with your users — although keep in mind that their code is respectful. Despite their approach, some browsers / browser extensions are sensitive enough to block the Plausible tracker, assuming it is just as naughty as the Google one. This technique helps you avoid that for more accurate analytics capture, if you’re ok with that.
In my case, the majority of my sites are served via NGINX, so the guidance here covers what I need. You can see one of my examples of this customisation here.
You then update your tracking code to point to /js/visits.js
instead, something like this:
<script defer data-api="/api/event" data-domain="yourwebsite.com" src="/js/visits.js"></script>
You could of course host the Javascript itself within your side code too (skipping the first location block) — although you still need to make sure the /api/event
calls are proxied on to your self-hosted Plausible instance to capture the visits. You’d also not pick up upgrades to the tracking code automatically if you did this.
Summary
I hope you found that run-through useful — I’ve been using it like this for around 6 months now without issue, for around half a dozen sites. I’ll be keeping my repo up to date with any tweaks I make along the way, so hopefully that will continue to prove a useful resource for anyone attempting to do what I have.
Oh, and if you work for a company looking to use Plausible, do consider their hosted option if you don’t need to keep the data on your own servers. It looks pretty sensibly priced to me, and helps them to keep improving their product!