The original post: /r/programming by /u/sgielen on 2024-12-30 13:44:28.
TLDR: Run pglocklogger on your prod databases. Whenever there’s an outage due to blocking in PostgreSQL, it may give you essential observability.
Recently, I experienced a prod outage caused by a blocked migration.
The migration itself was almost instantaneous, but it picks up a lock before it actually runs. That may take a while, but that’s no problem, right? The migration runs a bit later, no worries.
That’s when I learned that an ALTER TABLE waiting on a lock, itself also blocks benign work that would otherwise run concurrently. So having your blocked ALTER TABLE can spiral into an application-wide outage quite easily.
For example, a SELECT WITH UPDATE can run simultaneously with a SELECT, but not if an ALTER TABLE is executed in between. This is to prevent livelocks. A simple solution is to set a lock_timeout in your migrations, which causes the ALTER to fail early and allow all other work to continue. Then, retry the migration later.
I figured this out during the post-mortem analysis, but during the outage, this was hard to infer, even from PostgreSQL’s own slow query log. There’s a number of resources on how to discover this during an outage, but afterwards, all that information is gone.
Not with pglocklogger. It inspects PostgreSQL’s in-memory state (using in-memory tables intended for this) and logs the output when it crosses particular thresholds. In an outage, or during post-mortem investigations, this may turn out to be an essential observability tool.