The term Liveness here refers to the need to ensure that the data used to make an authorization check is valid at the time of the check.
The mistake I made with PKI tokens was in not realizing how important Liveness was. The mistake was based on the age old error of confusing authentication with authorization. Since a Keystone token is used for both, I was confused into thinking that the primary importance was on authentication, but the reality is that the most important thing a token tells you is information essential to making an authorization decision.
Who you are does not change often. What you can do changes much more often. What OpenStack needs in the token protocol is a confirmation that the user is authorized to make this action right now. PKI tokens, without revocation checks, lost that liveness check. The revocation check undermined the primary value of PKI.
That is the frustration most people have with certificate revocation lists (CRLs). Since Certificates are so long lived, there is very little “freshness” to the data. A CRL is a way to say “not invalidated yet” but, since a cert might carry data more than just “who are you” certificates can often become invalid. Thus, any active system built on X509 for authorization (not just authentication) is going to have many many revocations. Keystone tokens fit that same profile. The return to server validated tokens (UUID or Fernet) return that Freshness check.
However, bearer tokens have a different way of going stale. If I get a token, use it immediately, the server knows that It was very highly probably that the token came from me. If I wait, the probability drops. The more I use the same token, and the longer I use it, the greater the probability is that someone other than me is going to get access to that token. And that means the probability that it is going to be misused has also increased.
I’ve long said that what I want is a token that lasts roughly five minutes. That means that it is issued, used, and discarded, with a little wiggle room for latency and clock skew across the network. The problem with this is that a token is often used for a long running task. If a task takes 3 hours, but a token is good for only five minutes, there is no way to perform the task with just that token.
One possible approach to returning this freshness check is to always have some fresh token on a call, just not necessarily the one that the user originally requested. This is the idea behind the Trust API. A Trust is kind-of-like a long term token, but one that is only valid when paired with a short term token for the trustee. But creating a trust every time a user wants to create a new virtual machine is too onerous, too much overhead. What we want, instead is a rule that says:
When Nova calls Glance on behalf of a user, Nova passes a freshly issued token for itself along with the original users token. The original user’s token will be validated based on when it was issued. Authorization requires the combination of a fresh token for the Nova service user and a not-so-fresh-but-with-the-right-roles token for the end user.
This could be done with no changes to the existing token format. Set the token expiration to 12 hours. The only change would be inside python-keystonemiddleware. It would have a pair of rules:
- If a single token is passed in, it must have been issued within five minutes. Otherwise, the operation returns a 401.
- If a service token is passed in with the user’s token, the service token must have been issued within five minutes. The users token is validated normally.
An additional scope limiting mechanism would further reduce the possibility of abuse. For example,
- Glance could limit the service-token scoped operations from Nova to fetching an image and saving a snapshot.
- Nova might only allow service-scoped tokens from a service like Trove within a 15 minute window.
- A user might have to ask for an explicit “redelegation” role on a token before handing it off to some untrusted service run off site.
With Horizon, we already have a mechanism that says that it has to fetch an unscoped token first, and then use that to fetch a scoped token. Horizon can be smart enough to fetch an scoped token before each bunch of calls to a remote server, cache if for only a minute, and use the unscoped token only in communication with Keystone. The unscoped token, being validated by Keystone, is sufficient for maintaining “Liveness” of the rest of the data for a particular workflow.
Its funny how little change this would require to OpenStack, and how big an impact it would make on security. It is also funny how long it took for this concept to coalesce.
One issue with this proposal is that some operations (e.g. a 2tb cinder backup) can take more than 12 hours, indeed sometime more than 24 (yes, that code does need some optimisation. It is cpu bound ssl & compression (with encryption maybe being added) – if somebody is familiar with multi-process work in python and has free time, please shout up)
I would be reluctant to tell people to set up tokens to live more than 12 hours in a public cloud, but for a private cloud where this was known and expected behaviour, you could potentially extend their life. So, this proposal does not really solve that. I would expect Trusts or OAUTH1 to be a better approach for long operations like this. I also wonder if the backup job could be reworked to do all of the authentication upfront, and then ssl/compression/encryption performed afterwards.