One key per device is exactly what we recommend too. Private keys should always be protected as much as possible within that device and should never leave that device.
Just paste all of your devices' public keys into your authorized_keys file and leave a comment at the end for what device it's for. in Userify, it literally goes right into your nodes' authorized_keys file almost verbatim. (disclaimer: I work at https://Userify.com)
And then, if you leave your token or laptop at the airport or whatever, just remove that key right from your phone and it'll take effect in seconds across all the nodes/instances (if you're using Userify) or you can just write a quick for-inline-sed loop to remove it from your authorized keys everywhere.
The experience might be better right up until you're running it in prod and someone happens to ask about:
Cert revocation (or even expiration)
Sudo roles
User removal and process termination
Is the cert server HA and locked down
How you log in when the cert server is down or under attack (rich target!)
How to easily add Alice to server group A, Bob to B, and Carlos to both A and B, and then to remove them..
(disclaimer we're celebrating our 15th anniversary at https://Userify.com, but those are actually legit concerns and not only a sales pitch. You certainly can build a solid and secure ssh cert infra, but doing it in production is just not an easy set-it-and-forget-it sort of thing.)
SSH certs quietly hurt in prod. Short-lived creds + centralized CA just moves complexity upward without solving the core problem: user management.
The system shifts from many small local states to one highly coupled control point. That control point has to be correct and reachable all the time. When it isn’t, failures go wide instead of narrow.
Example: a few boxes get popped and start hammering the CA. Now what? Access is broken everywhere at once.
Common friction points:
1. your signer that has to be up and correct all the time
2. trust roots everywhere (and drifting)
3. TTL tuning nonsense (too short = random lockouts, too long = what was the point)
4. limited on-box state makes debugging harder than it should be
5. failures tend to fan out instead of staying contained
Revocation is also kind of a lie. Just waiting for expiry and hoping that’s good enough.
What actually happens is people reintroduce state anyway: sidecars, caches, agents… because you need it.
We went the opposite direction:
1. nodes pull over outbound HTTPS
2. local authorized_keys is the source of truth locally
3. users/roles are visible on the box
4. drift fixes itself quickly
5. no inbound ports, no CA signatures (WELL, not strictly true*!)
You still get central control, but operation and failure modes are local instead of "everyone is locked out right now."
That’s basically what we do at Userify (https://userify.com). Less elegant than certs, more survivable at 2am. Also actually handles authz, not just part of authn.
And the part that usually gets hand-waved with SSH CAs:
1. creating the user account
2. managing sudo roles
3. deciding what happens to home directories on removal
4. cleanup vs retention for compliance/forensics
Those don’t go away - they're just not part of the certificate solution.
* (TLS still exists here, just at the transport layer using the system trust store. That channel delivers users, keys, and roles. The rest is handled explicitly instead of implied.)
Hmm. For user certs you can have the service sign them for, say an hour, so long as you can ssh to your server in that time then there’s no need for any other interaction.
Sure you need your signing service to be reasonably available, but that’s easily accomplished.
That sounds like a lot of extra steps. How do I validate the authenticity of a signing request? Should my signing machine be able to challenge the requester? (This means that the CA key is on a machine with network access!!)
Replacing the distribution of a revocation list with short-lived certificates just creates other problems that are not easier to solve. (Also, 1h is bonkers, even letsencrypt doesn't do it)
Honestly, we used to replace a lot of pam_ldap and similar sorts of awful solutions. With those, if your LDAP went down even for a heartbeat, you couldn't log in at all.
So I totally agree: if I had to do certificates and didn't have something like Userify, a 1 hour (or even shorter if possible) expiration seems quite worth chasing, especially with suitable highly available configuration. (Of course, TFA doesn't even bother mentioning revocation and expiration, which should give you a clue as to how much fun those are lol)
And for more normal, lower-security requirements or non-HA, 6 or 8 hours or so would probably work and give you plenty of time for even serious system outages before the certs expired.
Not to hard shill or anything (apologies in advance, just skip if you're not interested), but there are two significant security and reliability differences between standard SSH (with or without certificates) and Userify:
1. Userify Cloud updates by default every three minutes, and on-premise Userify Express/Enterprise updates every ten seconds, but it doesn't have to update at all; even if your Userify server goes offline forever, you can still log in because the accounts are standard UNIX accounts (literally created with `useradd`)
2. When accounts are removed, Userify also completely nukes the user account, removes its sudo perms, and totally kill -9 's any tmux/screen/etc sessions (all processes owned by the user are terminated across the entire enterprise within seconds), which is also not something that a certificate expiration would ever do.
We're in the process of updating the experience to this century! ;)
We've always taken the stance that crusty is better than vulnerable, but it turns out that not having a modern experience after 15 years is starting to feel like maybe we need to step up the features and shininess :)
Exactly. We'd had discussions about building https://Userify.com (plug!) around SSH certificates, but elected to go with keys instead, because Userify delivers most of the good things around certificates without the jank and insecurity.
It's not that certificates themselves are insecure themselves, it's that the workflows (as the parent points out) are awful. We might still add some automation around that (and I think I saw some competitor tooling out there if you're committed to that path) but I personally feel like it's an answer to the wrong question.
Not to shill too hard, but this exact "keys/perms/sudo drift" failure mode is why Userify exists (est. 2011): local accounts on every box + a tiny outbound-only agent that polls and overwrites desired state (keys, perms, sudo role). If scp/rsync/deploy steps clobber stuff, the next poll re-converges it (cloud default ~90s; self-host default ~10s; configurable). Removing a user also kills their sessions. No inbound ports to nodes, no PAM/NSS hooks, auditable.
Token-based keys, to tptacek's point, is that they can be a giant pain once you start scripting across fleets.
reply