When building a strategy for computing, we need to think large scale. I’ve been trying to frame the discussion in terms of a million nodes in a dozen data centers. How is OpenStack going to be able to handle this?
Table of contents
Lets get a few more of the requirements recorded before tackling the solutions.
In practice today, TripleO tops out at about 600 nodes, and Keystone is limited to one deployment, so people are running many separate distinct Keystone servers for each distinct TripleO cluster.
The Cells abstraction, which should be GA in the Train release of TripleO, should up the numbers. Assuming a single Nova cell can manage about 500 nodes, and an OpenStack cluster is composed of, say 20 cells, that means that we have a bout 10000 nodes per OpenStack cluster. To get to a million nodes, we need 100 of them. Even if we up the number of nodes to 100,000 managed in a single OpenStack deployment, we are going to need to synchronize 10 of them.
Keystone is involved in all operations in OpenStack; if Keystone is not available, OpenStack is not available. Galera scales to roughly 15 Nodes maximum. If there are 3 nodes at a given site, that scales to 5 sites. So it is likely that we are going to have to run more than one Keystone servers.
We know that upgrades are painful, essential, and inevitable. We need be to upgrading systems without impacting overall behavior and reliability. Often this means green/blue deployments of new features, or a canary cluster that is used before a feature is rolled out everywhere. If we need to upgrade all of the Keystone servers in lock step, we can’t do that. The fact that we are going to have two or more Keystone Servers running at different versions means that a database level sync will not work; N and N+1 versions of Keystone are going to have different schemas.
Each resource in Keystone has a unique identifier. For the majority of resources, the identifiers are currently generated as UUIDs. In addition, the identifiers are assigned by the system, and are not something an end user can specify when creating the resource. The theory is that this would prevent identifier squatting, where a user creates a resource with a specified ID in order to deny that ID to another user, or hijack the use of the identifier for some other reason. In practice it means that two Keystone deployments will have different identifiers for resources that should be common, such as role, project, or user groups.
This identifier skew means that to track something for audit purposes you can only correlate on the resource name. But resource names are modifiable.
We’ve been batting around a few ideas that would have been wonderful if Keystone had implemented them early on. The biggest is to generate Identifiers based on a the unique data fed in to them. For example, a Federated User has the ID generated from a hash of the username and the domain id. This work if the data fed in is immutable. If, however, the string is mutable, and is changed, then the hash no longer matches the string and the approach is not usable for synchronization. This is the case for projects, roles, groups, and non-Federated users.
The limiting fact for using the API to duplicate data from one Keystone system to another is the generation of the identifier. Since a new record always gets a new identifier, and the the value for the identifier can only be generated, the API does not allow matching of records.
However, allowing all users to specify the identifiers when creating records would create the potential for squatting on the identifier, and also prevent synchronization.
Thus, for normal record generation, the identifiers should be generated by the system, and explicit identifier specification should be reserved for the synchronization use case only.
With the advent of System scoped roles, we can split the RBAC enforcement on the creation APIs. Normal creation should require a project or domain scoped token. Synchronization should require a system scoped token.
Assuming we are using a hub-and-spoke model, all changes that go into the hub need to be reflected in the spokes. This is a non-trivial problem to solve.
The first approach might be so use a notification listener that would then generate an API call to each of the spokes upon creation. However, if implemented naively, messages will get dropped when remote systems are unavailable. In order to guarantee delivery, notifications must be kept in a queue prior to execution of the remote creates, and only removed once the creates have successfully completed. Since notifications are ephemeral, they will not be replayed if the notification listener is not available at creation time either. Thus, a process needs to sweep through the hub to regenerate any missed notifications Since most resources do not have a creation timestamp, there is no way to filter this list for events that happened after a certain point-in-time.
Two Way Synchronization
When a user wants to get work done, they are going to need to make changes. Assuming that role assignments are done centrally, the only changes a user should be making in the spokes are specific to workflows inside the services at the spoke. For most cases, the user can be content with changes made at the center. However, for delegation, the user is going to want to make changes and see the effect of those changes within the scope of a given workflow; waiting for eventually consistent semantics is going to impact productivity.
Assuming the user needs to create an application credential in a spoke Keystone, the question then becomes “should we synchronize this credential with the central one.” While the immediate reaction would seem to be “no” we often find use cases that are not obvious but nevertheless essential to a broad class of users. In those cases, synchronization from spoke to hub is going to follow the same pattern as hub to spoke, with a later “trickle down” to the other spokes.
The issue is what to do when two different systems have different definitions of a resource. This could happen due to a network partition, or due to rogue automation creating resources in two different spokes. With mutable resources, it could also happen with creation in one spoke and modification in another or hub.
There are two pieces to conflict resolution: Which agent is responsible for are resolving the conflict and what data should be considered the correct version. An example of a resolution strategy would be: a single agent performs all synchronization, drive from the hub. The Hub version is considered definitive unless a spoke has been designated definitive for a specific subset of data. This strategy has the advantage of providing for local edits, but makes it more difficult to orchestrate them in the center. Contrast this with a strategy, still driven by a central agent, but that also considers the hub version to be definitive. Any changes made at the spoke will be overwritten by changes in the center.
Keystone does not perform soft deletes; once a resource is gone, it is gone forever. Thus, if a resource is deleted out of one Keystone datasource, and no record of that deletion has been captured, there is no way to reconcile between that Keystone server and a separate one short of a complete list/diff of all instances of the resource. This is, obviously a very expensive operation. It also requires the application of the conflict resolution strategy to determine if the resource should be deleted in one server, or created in the other.
Synchronization between Keystones brings in all of the issues of an eventually-consistent database model. Building mechanisms into the Keystone API to support synchronization is going to be necessary, but not sufficient, to building out a scalable cloud infrastructure.