More

jamiesonbecker · 2026-04-16T22:37:06 1776379026

Rotating keys is easy with the right software. (I work @ Userify) Agree with the auditing point

Token-based keys, to tptacek's point, is that they can be a giant pain once you start scripting across fleets.

jamiesonbecker · 2026-04-16T21:52:57 1776376377

One key per device is exactly what we recommend too. Private keys should always be protected as much as possible within that device and should never leave that device.

Just paste all of your devices' public keys into your authorized_keys file and leave a comment at the end for what device it's for. in Userify, it literally goes right into your nodes' authorized_keys file almost verbatim. (disclaimer: I work at https://Userify.com)

And then, if you leave your token or laptop at the airport or whatever, just remove that key right from your phone and it'll take effect in seconds across all the nodes/instances (if you're using Userify) or you can just write a quick for-inline-sed loop to remove it from your authorized keys everywhere.

jamiesonbecker · 2026-04-07T17:00:32 1775581232

The next one linked at the bottom, https://jonno.nz/posts/stealing-nanoclaw-patterns-for-webapp... has this bold and frankly unbelievable claim:

"70% of startups fail due to premature scaling"

.. which is a link to another blog post somewhere else that says nothing even slightly related.

jamiesonbecker · 2026-04-04T00:58:48 1775264328

The experience might be better right up until you're running it in prod and someone happens to ask about:

   Cert revocation (or even expiration)

   Sudo roles

   User removal and process termination

   Is the cert server HA and locked down

   How you log in when the cert server is down or under attack (rich target!)

   How to easily add Alice to server group A, Bob to B, and Carlos to both A and B, and then to remove them..

(disclaimer we're celebrating our 15th anniversary at https://Userify.com, but those are actually legit concerns and not only a sales pitch. You certainly can build a solid and secure ssh cert infra, but doing it in production is just not an easy set-it-and-forget-it sort of thing.)

waynesonfire · 2026-04-04T01:59:33 1775267973

Sorry to pop your bubble, but a SaaS is the worst possible option.

jamiesonbecker · 2026-04-04T03:03:03 1775271783

Then install your own:

    curl i.userify.com | sudo - sE

jamiesonbecker · 2026-04-03T16:26:46 1775233606

SSH certs quietly hurt in prod. Short-lived creds + centralized CA just moves complexity upward without solving the core problem: user management.

The system shifts from many small local states to one highly coupled control point. That control point has to be correct and reachable all the time. When it isn’t, failures go wide instead of narrow.

Example: a few boxes get popped and start hammering the CA. Now what? Access is broken everywhere at once.

Common friction points:

     1. your signer that has to be up and correct all the time
     2. trust roots everywhere (and drifting)
     3. TTL tuning nonsense (too short = random lockouts, too long = what was the point)
     4. limited on-box state makes debugging harder than it should be
     5. failures tend to fan out instead of staying contained

Revocation is also kind of a lie. Just waiting for expiry and hoping that’s good enough.

What actually happens is people reintroduce state anyway: sidecars, caches, agents… because you need it.

We went the opposite direction:

     1. nodes pull over outbound HTTPS
     2. local authorized_keys is the source of truth locally
     3. users/roles are visible on the box
     4. drift fixes itself quickly
     5. no inbound ports, no CA signatures (WELL, not strictly true*!)

You still get central control, but operation and failure modes are local instead of "everyone is locked out right now."

That’s basically what we do at Userify (https://userify.com). Less elegant than certs, more survivable at 2am. Also actually handles authz, not just part of authn.

And the part that usually gets hand-waved with SSH CAs:

     1. creating the user account
     2. managing sudo roles
     3. deciding what happens to home directories on removal
     4. cleanup vs retention for compliance/forensics

Those don’t go away - they're just not part of the certificate solution.

* (TLS still exists here, just at the transport layer using the system trust store. That channel delivers users, keys, and roles. The rest is handled explicitly instead of implied.)

ngrilly · 2026-04-03T16:52:59 1775235179

How do you solve TOFU?

jamiesonbecker · 2026-04-03T21:10:43 1775250643

Well, TOFU is really just the model for how the chain of trust is established.

In practice there isn’t really trust on first use: there’s verify the key matches what’s expected, or distribute keys out-of-band (including certs).

If that verification step isn’t happening, then it’s not TOFU, it’s just blind trust.

From an automation/autoscaling angle, the same thing shows up again:

1. either keys are pre-baked / distributed

2. or, something signs them at boot

Signing an instance key is just another way of distributing trust. It doesn’t remove the need for a root of trust, it moves it.

Certificates just add extra steps around the same underlying task.

ngrilly · 2026-04-03T21:39:30 1775252370

I agree. I was just wondering if Userify had a solution for distribution the server signatures to the users.

jamiesonbecker · 2026-04-04T00:31:53 1775262713

Great question. Not yet ;)

ngrilly · 2026-04-04T07:31:16 1775287876

Fair enough :)

jamiesonbecker · 2026-04-03T16:16:05 1775232965

But then you can't log in if your box goes offline for any reason.

blipvert · 2026-04-03T16:22:04 1775233324

Hmm. For user certs you can have the service sign them for, say an hour, so long as you can ssh to your server in that time then there’s no need for any other interaction.

Sure you need your signing service to be reasonably available, but that’s easily accomplished.

Maybe I misunderstand?

jamiesonbecker · 2026-04-03T16:40:47 1775234447

That works for authn in the happy path: short-lived cert, grab it, connect, done.

Except for everything around that:

* user lifecycle (create/remove/rename accounts)

* authz (who gets sudo, what groups, per-host differences)

* cleanup (what happens when someone leaves)

* visibility (what state is this box actually in right now?)

SSH certs don’t really touch any of that. They answer can this key log in right now, not what should exist on this machine.

So in practice, something else ends up managing users, groups, sudoers, home dirs, etc. Now there are two systems that both have to be correct.

On the availability point: "reasonably available" is doing a lot of work ;)

Even with 1-hour certs:

* new sessions depend on the signer

* fleet-wide issues hit everything at once

* incident response gets awkward if the signer is part of the blast radius

The failure mode shifts from a few boxes don't work to nobody can get in anywhere

The pull model just leans the other way:

* nodes converge to desired state

* access continues even if control plane hiccups

* authn and authz live together on the box

Both models can work - it’s more about which failure mode is tolerable to you.

blipvert · 2026-04-03T16:51:04 1775235064

Well, yes, pick your poison.

But for just getting access to role accounts then I find it a lot nicer than distributing public keys around.

And for everything else, a periodic Ansible :-)

gnufx · 2026-04-03T19:02:15 1775242935

Public keys (for OpenSSH) can be in DNS (VerifyHostKeyDNS) or in, say, LDAP via KnownHostsCommand and AuthorizedKeysCommand.

moviuro · 2026-04-03T16:51:07 1775235067

That sounds like a lot of extra steps. How do I validate the authenticity of a signing request? Should my signing machine be able to challenge the requester? (This means that the CA key is on a machine with network access!!)

Replacing the distribution of a revocation list with short-lived certificates just creates other problems that are not easier to solve. (Also, 1h is bonkers, even letsencrypt doesn't do it)

toast0 · 2026-04-03T17:50:47 1775238647

1h is bonkers for certs in https, but it's not unreasonable for authorized user certs, if your issuance path is available enough.

IMHO, if you're pushing revocation lists at low latency, you could also push authorized keys updates at low latency.

jamiesonbecker · 2026-04-04T00:40:02 1775263202

Honestly, we used to replace a lot of pam_ldap and similar sorts of awful solutions. With those, if your LDAP went down even for a heartbeat, you couldn't log in at all.

So I totally agree: if I had to do certificates and didn't have something like Userify, a 1 hour (or even shorter if possible) expiration seems quite worth chasing, especially with suitable highly available configuration. (Of course, TFA doesn't even bother mentioning revocation and expiration, which should give you a clue as to how much fun those are lol)

And for more normal, lower-security requirements or non-HA, 6 or 8 hours or so would probably work and give you plenty of time for even serious system outages before the certs expired.

Not to hard shill or anything (apologies in advance, just skip if you're not interested), but there are two significant security and reliability differences between standard SSH (with or without certificates) and Userify:

1. Userify Cloud updates by default every three minutes, and on-premise Userify Express/Enterprise updates every ten seconds, but it doesn't have to update at all; even if your Userify server goes offline forever, you can still log in because the accounts are standard UNIX accounts (literally created with `useradd`)

2. When accounts are removed, Userify also completely nukes the user account, removes its sudo perms, and totally kill -9 's any tmux/screen/etc sessions (all processes owned by the user are terminated across the entire enterprise within seconds), which is also not something that a certificate expiration would ever do.

jamiesonbecker · 2026-04-03T16:13:36 1775232816

We're in the process of updating the experience to this century! ;)

We've always taken the stance that crusty is better than vulnerable, but it turns out that not having a modern experience after 15 years is starting to feel like maybe we need to step up the features and shininess :)

jamiesonbecker · 2026-04-03T16:10:57 1775232657

Exactly. We'd had discussions about building https://Userify.com (plug!) around SSH certificates, but elected to go with keys instead, because Userify delivers most of the good things around certificates without the jank and insecurity.

It's not that certificates themselves are insecure themselves, it's that the workflows (as the parent points out) are awful. We might still add some automation around that (and I think I saw some competitor tooling out there if you're committed to that path) but I personally feel like it's an answer to the wrong question.

jamiesonbecker · 2026-03-03T23:17:34 1772579854

Classic OpenSSH safety check: if /home/$user (or ~/.ssh) is too open, or ownership/modes are off, sshd will refuse pubkey auth. Annoying, but correct.

If you still have some access (console, password login, another sudo user), this usually fixes it:

    username=bob
    sudo chown "$username:$username" /home/$username
    sudo chmod 700 /home/$username

    sudo install -d -m 700 -o "$username" -g "$username" /home/$username/.ssh
    echo "ssh-ed25519 AAAA....insertyourpubkeyhere" | sudo tee /home/$username/.ssh/authorized_keys >/dev/null
    sudo chown "$username:$username" /home/$username/.ssh/authorized_keys
    sudo chmod 600 /home/$username/.ssh/authorized_keys

(optional, if the user needs sudo)

    echo "$username ALL=(ALL) NOPASSWD: ALL" | sudo tee /etc/sudoers.d/$username >/dev/null
    sudo chmod 440 /etc/sudoers.d/$username

Not to shill too hard, but this exact "keys/perms/sudo drift" failure mode is why Userify exists (est. 2011): local accounts on every box + a tiny outbound-only agent that polls and overwrites desired state (keys, perms, sudo role). If scp/rsync/deploy steps clobber stuff, the next poll re-converges it (cloud default ~90s; self-host default ~10s; configurable). Removing a user also kills their sessions. No inbound ports to nodes, no PAM/NSS hooks, auditable.

Shim (old but readable): https://github.com/userify/shim/blob/master/shim.py#L308 (obligatory): https://userify.com

jamiesonbecker · 2025-12-23T19:02:53 1766516573

at least it had a minimum of Clause. Clause. Punchline.