Reliability & Observability

The frontend isn't done when it renders

Observability, blue-green delivery, and rollback as a first-class idea. The enterprise version I can only describe, and the version running on my own hardware that you can poke at right now.

Context
Enterprise delivery systems with firm-wide observability, plus a self-hosted proof point I fully own.
Core technologies
OpenTelemetry, Jenkins, Spinnaker, Grafana Loki, Caddy, blue-green deployment

A lot of frontend engineers think “done” means “it renders and the tests are green.” On a regulated, high-scale platform that definition is too small by half. The same code shows up in the deploy, in the incident channel at 3am, and in the traces an on-call engineer is squinting at while production is on fire. If you can’t see what your surface is doing in production, and you can’t pull a bad release back out cleanly, you didn’t finish. You just stopped typing.

On the enterprise side, the machinery was real and I worked at the seam of it. Delivery ran through Jenkins and Spinnaker with blue-green deploys, so a new version came up next to the old one and traffic cut over, which also means it could cut back without a destructive in-place replacement. Services were instrumented with OpenTelemetry as part of a firm-wide push, so frontend behavior could be correlated with backend traces instead of guessed at from a screenshot and a vibe. Underneath, the platform was modernizing toward container-first delivery and had moved its gateway off Zuul onto Spring Cloud Gateway.

That’s about as far as I’ll go on the employer’s systems, on purpose. So let me show you the part I can actually hand you.

The proof you can poke at

This site is the demonstration. rosewire.net is a static Astro build that I deploy to a Caddy file server on hardware I own and run, and the deploy is built on the same instincts the enterprise work is about, just shrunk to one person’s scale:

  • The deploy script does nothing until you pass --apply. Run it bare and it prints exactly what it would do. No surprise live changes. I’ve been burned by “oops that was prod” enough times to make that the default.
  • Before the web root gets replaced, the current live site is copied to a timestamped archive. Rolling back is mv of a directory, not a panicked git-archaeology session while the site is down.
  • There’s no server-side rendering, no production Node process, no extra container. The artifact is static HTML and nothing about a portfolio needs more. Less running in production is less that can break at 3am, and that’s a reliability decision, not laziness.
Blue-green at a bank and "mv the old directory back" on my homelab are the same idea wearing different budgets: never destroy the thing that's currently working.

The whole serving and rollback model is written up on the colophon, and the broader stack it runs on, Proxmox, Cloudflare, a pile of services I actually use, gets its own writeup in Running my own infrastructure.

The reason I lead the reliability story with my own infra instead of the bank’s is simple: I can prove every word of it, you can inspect it, and there’s no NDA between us. That’s the honest version, and honest is more convincing than impressive anyway.