Kolla has become the primary source of Containers for running OpenStack services. Since if has been a while since I tried deliberately running just the Keystone container, I decided to build the Kolla version from scratch and run it.Continue reading
Many applications have a data directory, usually due to having an embedded database. For the set I work with, this includes Red Hat IdM/FreeIPA, CloudForms/ManageIQ, Ansible Tower/AWX, and OpenShift/Kubernetes. Its enough of a pattern that I have Ansible code for pairing a set of newly allocated partitions with a set of previously built virtual machines.
Part of my Job is making sure our customers can run our software in Public clouds. Recently, I was able to get CloudForms Management Engine (CFME) to deploy to Azure. Once I got it done manually, I wanted to automate the deployment, and that means Ansible. Turns out that launching custom images from Ansible is not support int the current GA version of the Azure modules, but has been implemented upstream.
Today I tried to use our local OpenStack instance to deploy CloudForms Management Engine (CFME). Our OpenStack deployment has a set of flavors that all are defined with 20 GB Disks. The CFME image is larger than this, and will not deploy on the set of flavors. Here is how I worked around it.
I was monitoring my system, so I knew the file was /dev/sdb was the new iSCSI target I was trying to turn into a file system. TO provide it, I ran:
iscsiadm -m session --print=3
... scsi4 Channel 00 Id 0 Lun: 0 scsi4 Channel 00 Id 0 Lun: 1 Attached scsi disk sdb State: running
But what did that do? Using Strace helped me sort it a little. I worked backwards.
Keystone has supported identity federation for several releases. I have been working on a proof-of-concept integration of identity federation in a TripleO deployment. I was able to successfully login to Horizon via WebSSO, and want to share my notes.
A federation deployment requires changes to the network topology, Keystone, the HTTPD service, and Horizon. The various OpenStack deployment tools will have their own ways of applying these changes. While this proof-of-concept can’t be called production-ready, it does demonstrate that TripleO can support Federation using SAML. From this proof-of-concept, we should be to deduce the necessary steps needed for a production deployment.
Keystone Tokens are bearer tokens, and bearer tokens are vulnerable to replay attacks. What if we wanted to get rid of them?
The maximum header size between a HTTPD and an WSGI process is fixed at 8 Kilobytes. With a sufficiently large catalog, the token in PKI format won’t fit. Compression seems like it would be such an easy solution. But the there is a Hobgoblin or two hiding in the shadows.
As OpenStack evolves, its requirements for Identity Management evolve with it. In the early days, there was a single Nova server, and that stored user-id and password. Once OpenStack evolved into a body of servers, copying passwords around comprised too big a security risk. Keystone was first implemented as a central repository for those passwords.
Keystone tokens were originally implemented as a unique identifier. A user went to Keystone, submitted a request with their user-id and password, and received a UUID. That UUID was passed to a remote service such as the Nova API web service in place of authentication data. The remote service would then make a call to Keystone to verify the token. Thus, each remote call required, at the absolute minimum, an additional round trip to Keystone. The network traffic was exacerbated by the fact that it was driven by the command line clients, which had no way of storing the ephemeral tokens. Thus, one remote call required three round trips.
The call to validate a Keystone token returns much information. In addition to the return code, which indicates the validity of the token, the response contains a collection of role assignments for the project specified by the token. These role assignments are later matched against rules assigned to each of the web remote APIs. An example rule might state that in order to perform a given action, a user must be an administrator for the associated project. While OpenStack calls this Role Based Access Control (RBAC) there is nothing in the mechanism that specifies that only roles can be used for these decisions. Any attribute in the token response could reasonably be used to provide/deny access. Thus, we speak of the token as containing authorization attributes.
Two technologies helped to decrease the load on Keystone. The first was client side caching, implemented using the Python-Keyring library. This allows the reuse of a token. The second was the use of Public Key based document signing to allow in process validation of the keystone tokens. This new mechanism, termed PKI tokens, used the Crypto Message Syntax that is the basis for Secure Mail (SMIME). The size of the tokens increased dramatically, but now a single web request could be performed with a single round trip.
The are multiple systems that provide a comparable document with authorization attributes that is validated using cryptography. Security Assertion Markup Language (SAML) is probably the most widely deployed. JSON Web Tokens are the equivalent for people that prefer JSON to XML. The major authentication mechanisms have their own approach to reliably distributing authorization attributes linked to the authentication of the user. For Kerberos, it is the Privilege Attribute Certificate (PAC) mechanism. For X509, it is attribute certificates or proxy certificates.
Both UUID and PKI tokens are what is termed “Bearer Tokens.” “Any party in possession of a bearer token (a “bearer”) can use it to get access to the associated resources (without demonstrating possession of a cryptographic key).” A bearer token is a token that cannot be verifiably linked to the person presenting it. It is a bit like using a credit card number to establish your age on a website: if you steal someone’s credit card, you can make a false assertion of your identity. Just so, if a malicious user steals a bearer token, they can impersonate the token’s user. The complex workflows for OpenStack span multiple services. In the current system, a bearer token is attached to a request that will flow through the entire system.
The current token system has an additional drawback. A Keystone token is valid anywhere in the OpenStack system. That means that a token stolen from one system can be used on another system. To limit the damage done with tokens, many other systems limit them to a single target system. For example, in Kerberos, a user gets a service token that, while it is a bearer token, is only usable on a single a remote system. To talk to an additional system requires an additional service token.
To move beyond bearer tokens requires multiple steps. In order to link the token to a user, the user needs to use a secure authentication mechanism, and then link the token to that mechanism. A mechanism for that will be present in the Havana release. Its use will be optional to start; once we disable bearer tokens, we risk breaking the entire OpenStack system. If tokens must be bound to the user that initially requested them, how can a system call second and third system to do work on behalf of the user? If a token can only be used for a specific system, how can a workflow progress across multiple systems?
Token Revocation and Lifespan
There are other problems with the current token system. Tokens are long lived, in order to survive for the entire duration of the long workflows. However, we want to be able to quickly remove privileges from a user. This means that tokens can be revoked. Token revocation places a heavy burden on the system as remote systems must periodically check for token revocation lists from keystone, or return to the original scheme of online verification. In addition, since tokens can create other tokens, Keystone now has complex rules to track revocation events and properly revoke the correct set of tokens. To track these changes, the tokens are stored in a database attached to Keystone. This database cannot be ephemeral, or tokens are improperly recorded as invalid, but they must be flushed periodically to remove expired tokens, or risk filling up their storage.
Heat is the orchestration engine for OpenStack. Heat has a requirement that it must be able to perform an operation on behalf of a user even if that user is no longer available. To support heat, Keystone now has a mechanism for delegation of authorization data. This mechanism is called Trusts, and it uses the same language as is used in the legal world: the user that creates a trust is a trustor, the one that executes the trust is a trustee. The user is the trustor and Heat is the trustee. A user creates a trust with exactly the set of role assignments they want to give to the trustee. This set of role assignments is validated when the trustee uses the trust id to fetch a token. If the user is missing the role assignment, the trust is invalid and no token is returned. Upon success, the trustee receives a valid token. The trustee can use this token to perform work for the trustor.
Trusts are related to OAuth, with some significant differences. Probably the most important difference is that only users of the system can be trustees. In Oath, a Consumer can be, and is expected to be, an external system. Trusts specify the format of the data that is delegated. OAuth does not. A user must specify the data necessary to create a trusts, whereas in OAuth, the remote system crafts the request for the user, but then allows the user to inspect and verify the request. However, whether it is trusts, OAuth, or another comparable mechanism, the Keystone service has and will continue to provide a delegation mechanism.
Building upon the current state of Keystone, we can project forward to a system that deals with the shortcomings of Keystone. Instead of using long lived tokens,long running workflows can instead use a mechanism for delegation of authority. As an example, take a workflow for launching a virtual machine. This workflow needs to perform several operations across several services. It needs to fetch an image from glance, deploy it to the compute node, communicate with Cinder to get access to the remote disk partition, start the virtual machine, mount the remote partition, and connect the virtual machines interface to a network in Neutron. The user specifies this workload from the Nova API server, but the service that actually performs them will pull the operations out of a queue. They are scheduled by the scheduler, and mostly performed by the nova compute service. In addition, while requests are posted to the other API servers, they are also performed by worker processes. Whenever one service calls another API service, it will need a delegation. For the example above, the delegation would be from the user requesting the new virtual machine to the nova-compute user. The nova-compute user would use the trust to request a token on behalf of the end user. That token would contain the role assignments necessary to perform the operations on the other remote services.
Once delegations replace long lived tokens, we can shorten up the lifespan of the tokens such that they cover multiple web round trips, but are short as the acceptable window for processing revocation events. As a working value, assume a token will live around five minutes. Tokens that live this short of a time will not require a revocation list. Without the need for a revocation list, we can drop recording the tokens in the backend system. Keystone can use the same cryptographic approach to validating tokens that the rest of the systems use.
Developing New Policy
When a user makes a call on a web service, they do not know the policy that will allow or deny them access. They only know the end state of success or failure. If a deployer wishes to deploy new policy, they will also need to establish a set of roles that users will have in order to execute that policy. For example, a deployer may wish to split access to Glance images into a reader role and a writer role. Most work flows only will require the writer role. When performing the deploy instance workflow, the user should delegate the reader role to the Nova Compute service user, but not the writer role.
In order to create the delegations for these complex work flows, we are going to need a tool that will tell us what role assignments are required. To generate this information, we take a page from the book of SELinux: permissive policy enforcement. To test out new policy, the deployer will set up a test-bed OpenStack deployment. They will set the policy enforcement on the various services to “permissive” which will allow all actions, but will record the operations that would not have been permitted, and the rules that would have denied access to those operations. The deployer can then perform the various workflows, and generate a series of logs. From those logs, the deployer can deduce what role assignments to grant to users, and to generate a work plan for delegating those roles.
A workplan is the set of delegated role assignments required to perform a workflow. If a user wishes to perform a workflow, they will take the workplan and generate a series of trusts. The trust ids can be collected up into a single document and attached to a routing slip, or some other message decoration that will follow the workflow throughout the system. Whenever a system needs to perform a remote operation on behalf of the user, they will get the trust ID out of the workflow and use it to fetch a token from Keystone.
OpenStack is in wide deployment today. The solutions proposed in this document cannot disrupt ongoing operations. Instead, we need a phased approach that will add features incrementally. Many of the mechanisms required are in place, but not required for operations. All mechanisms need to be optional, not only to facilitate development, but to allow for uninterrupted operations should a policy or mechanism stop a critical operation. Ideally, the workplans and workflows will be implemented in addition to the current use of bearer tokens. Systems will, instead, execute the delegation based operations in conjunction with the bearer tokens, and record the failures, but allow existing workflows to continue. After a “breaking in” period where the deployers and administrators deal with the failing cases, they can switch over to enforcing the token binding and workplans.