PKI tokens in Keystone suffered from many things, most essentially the trials due to the various forms of revocation. I never wanted revocation in the first place. What could we have done differently? It just (I mean moments ago) came to me.
A PKI token is a signed document that says “at this point in time, these things are true” where “these things” have to do with users roles in projects. Revocation means “these things are no longer true.” But long running tasks need long running authentication. PKI tokens seem built for that.
What we should distinguish is a difference between kicking off a new job, and continued authorization for an old job. When a user requests something from Nova, the only identity that comes into play is the users own Identity. Nova needs to confirm this, but, in a PKI token world, there is no need to go and ask Keystone.
In a complex operation like launching a VM, Nova needs to ask Glance to do something. Today, Nova passes on the token it received, and all is well. This makes tokens into true bearer tokens, and they are passed around far too much for my comfort.
Lets say that, to start, when Nova calls Glance, Nova’s own Identity should be confirmed. Tokens are really poor for this, a much better way would be to use X509. While Glance would need to do a mapping transform, the identity of Nova would not be transferable. Put another way, Nova would not be handing off a bearer token to Glance. Bearer tokens from Powerful systems like Nova are a really scary thing.
If we had this combination of user-confirmed-data and service-identity, we would have a really powerful delegation system. Why could this not be done today, with UUID/Fernet tokens? If we only ever had to deal with a max of two hops, (Nova to Glance, Nova to Neutron) we could.
Enter Trove, Heat, Sahara, and any other process that does work on behalf of a user. Lets make it really fun and say that we have the following chain of operations:
If any one links in this chain is untrusted, we cannot pass tokens along.
What if, however, each step had a rule that said “I can accept tokens for users from Endpoint E” and passed a PKI token along. User submits a PKI token to Heat. Heat passes this. plus its own identity on to Sahara, that trusts Heat. And so on down the line.
OK…revocations. We say here that a PKI token is never revoked. We make it valid for the length of long running operations…say a day.
But we add an additional rule:Â A user can only use a PKI token within 5 minutes of issue.
Service to Service calls can use PKI tokens to say “here is when it was authorized, and it was good then.”
A user holds on to A PKI token for 10 minutes, tries to call Nova, and the token is rejected as “too old.”
This same structure would work with Fernet tokens, assuming a couple things:
- We get rid of revocations checks for tokens validated with service tokens.
- If a user loses a role, we are OK with having a long term operation depending on that role failing.
I think this general structure would make OpenStack a hell of a lot more scalably secure than it is today.
Huge thanks to Jamie Lennox for proposing a mechanism along these lines.