In this post I’ll describe some of the steps I’ve taken to spin up a copy of Plausible Analytics on an existing Kubernetes cluster. I’ll be using Google Kubernetes Engine, but in most regards this should work fine on any provider.
This is actually an article I first penned a few months ago, but held off writing on Medium until I’d given things a little time to bed in. Well, it’s been ticking away quite happily since then — and if you want a more detailed version of the steps involved then that article can be found here.
First up — why Plausible? And why self-host it? I’ll keep this brief I promise as it’s not the main point of the article 😇
I was drawn to it after reading a few articles on the privacy concerns of using Google Analytics — the solution I was using previously to understand who was using my website and how they were using it. Firefox out-right blocks GA for example, and many people run ad-blockers in other browsers that do the same.
So partly it’s a moral thing, partly it’s an accurate data thing. The lack of any cookies also appealed — no need for annoying permissions banners on my websites! Tangentially I also found it simpler to use for my really basic use-case. Some folks have told me that GA v4 is less invasive but I’ve made the switch now and I’m happy.
As for why self-hosting it? Partly confidence in where the data is being held, but mostly because I wanted to learn about how it worked, to be honest 👷♂️. If you work for a company looking to use Plausible, I’d personally consider their hosted option first if you don’t need to keep the data on your own servers.
Running It On Kubernetes
My working code can be found on my Github. I’m still using this tool today so that repo should be appropriately fed and watered too. The pipeline is put together using Gitlab (where its mirrored from), but can be adapted quite easily to your chosen CI/CD tool, I’m sure.
Broadly, we need to solve for the following:
- Get the basic components running — the main app, its Postgres database and its Clickhouse database.
- Come up with a way of generating and storing the relevant secrets for it to work — there’s quite a few of them: the credentials for Plausible itself, Postgres credentials, Clickhouse credentials, email settings, and your Twitter token if you use that.
- Getting the emailed reports working.
- Dealing with the fact that your copy of Plausible is going to be behind some load balancers.
- Data backups and recovery.
I’ll give a brief overview of each step below. As I mentioned, for a more detailed version (including code samples etc) see my more detailed post on my personal blog: https://alexos.dev/2022/03/26/hosting-plausible-analytics-on-kubernetes/
The Basic Components & Secret Handling
I began with their Kubernetes manifests but adapted them using
kustomize. This is important for controlling resource usage (especially disk space for a tool like this). You can see the structure I adopted in this part of my repo. This included dropping the mail server completely — I doubt that would work well on GCP anyway, and I chose to use Sendgrid instead (they have a free tier that’s sufficient for my needs).
Note the use of a
SecretGenerator in the base to ensure that a new secret is created on any change, so that the pods automatically restart to use it. The secret values themselves are held in Google Secret Manager and patched in at deploy time (other clouds have similar, or you could use Vault).
I also subbed out the
latesttags in the container images to use — this is rarely a good plan in Production! Here’s a good explanation why
As mentioned above, given the constraints on GCP I opted not to muck about trying to persuade a mail server to work within GKE, and instead signed up for Sendgrid. At the scale I’m working with, I’m well within the free tier usage. I’m not going to go through setting up your Sendgrid account itself in detail as writing that down is unlikely to age well and I found it to be pretty straight-forward — the important part is to generate yourself an API Key in there, which we then inject into the Plausible config.
The following represents the particular combination of variables that I found did the trick:
Take care to substitute the
SMTP_USER_PWD securely, as described earlier.
I found the easiest way to test this was creating an extra user in Plausible and using the forgotten password link.
X-Forwarded-For Behind a Load Balancer
For Plausible to work behind a reverse proxy load balancer like the Kubernetes
nginx-ingress-controller, some further tweaks are needed. You’ll know this is needed if the visitor countries / unique visitor tracking is not working. I found that the following configuration for the
nginx-ingress-controller did the trick:
use-forwarded-headers: "false" # not needed as not behind L7 GCLB, but YMMV
I also needed to edit my
Type: LoadBalancer to have
spec.externalTrafficPolicy: Local. This affects evenness of load balancing a little, but was required for this to work and I didn’t particularly mind this downside at my scale.
I wanted something in place here to allow me to recover historic data in the two databases in the event of needing to recreate my cluster (or something going wrong with it). I use this cluster to experiment with Kubernetes features fairly regularly, so there’s a certain inevitability in this for me at least 🙈
The Clickhouse database rules out a cloud-hosted PaaS option like Cloud SQL. A further issue was that, as per this github discussion, an implementation choice in Plausible means the “standard” utility within Clickhouse can’t be used either ... 💩
Instead, I ran both databases inside Kubernetes, and used Velero to snapshot the disks periodically. This is far from perfect but sufficient for my needs — I can live with a gap in data in the event of catastrophe. Velero is neat in that it is k8s-native but can make use of object storage to snapshot the persistent disks (it’s great at backing up other k8s resources too, should you need that).
Velero is taking a snapshot of block storage whilst the disks are in use. This carries risk! At the small scale I’m using Plausible at, this is not as likely to be an issue, but if you’re expecting high load then it might be — you’ll definitely want to set up alerts for failed backups, and test this frequently … 😬
The setup of Velero is not part of my public github, but it is pretty easy to do to be honest — and their docs are good. You’ll need to create a storage bucket and configure the
BackupStorageLocation for that, as well as ensuring the credentials are available to the runtime(if on GKE, Workload Identity is a fantastic way to do this securely and with minimal configuration hassle).
My backup schedule specifically for Plausible looks like this:
schedule: 0 */1 * * *
As you can see, I’m not very sensitive about the freshness of the data here — snapshotting hourly is fine by me.
Restores rely on you setting up the Velero client and using a combination of the
velero backup get and
velero restore create commands to select and then restore the data, replacing the old disks (or creating new ones on your new cluster, if applicable).
Proxying the Request
You then update your tracking code to point to
/js/visits.js instead, something like this:
<script defer data-api="/api/event" data-domain="yourwebsite.com" src="/js/visits.js"></script>
/api/event calls are proxied on to your self-hosted Plausible instance to capture the visits. You’d also not pick up upgrades to the tracking code automatically if you did this.
I hope you found that run-through useful — I’ve been using it like this for around 6 months now without issue, for around half a dozen sites. I’ll be keeping my repo up to date with any tweaks I make along the way, so hopefully that will continue to prove a useful resource for anyone attempting to do what I have.
Oh, and if you work for a company looking to use Plausible, do consider their hosted option if you don’t need to keep the data on your own servers. It looks pretty sensibly priced to me, and helps them to keep improving their product!