Learning about the Overcloud Deploy Process

The process of deploying the overcloud goes through several technologies. Here’s what I’ve learned about tracing it.

I am not a Heat or Tripleo developer. I’ve just started working with Tripleo, and I’m trying to understand this based on what I can gather, and the documentation out there. And also from the little bit of experience I’ve had working with Tripleo. Anything I say here might be wrong. If someone that knows better can point out my errors, please do so.

[UPDATE]: Steve Hardy has corrected many points, and his comments have been noted inline.

To kick the whole thing off in the simplest case, you would run the command openstack overcloud deploy .
VM-config-changes-via-Heat-1

Roughly speaking, here is the sequence (as best as I can tell)

Â User typesÂ openstack overcloud deploy on the command line
This calls up the common cli, which parses the command, and matches the tripleo client with the overcloud deploy subcommand.
tripleo client is a thin wrapper around the Heat client, and calls the equivalent of heat stack-create overcloud
python-heatclient (after Keystone token stuff) calls the Heat API server with the URL and data to do a stack create
Heat makes the appropriate calls to Nova (running the Ironic driver) to activate a baremetal node and deploy the appropriate instance on it.
Before the node is up and running, Heat has posted Hiera data to the metadata server.
The newly provisioned machine will run cloud-init ~~which in turn runs os-collect-config~~.
[update] Steve Hardy’s response:

This isn’t strictly accurate – cloud-init is used to deliver some data that os-collect-config consumes (via the heat-local collector), but cloud-init isn’t involved with actually running os-collect-config (it’s just configured to start in the image).
os-collect-config will start polling for changes to the metadata.
[update] os-collect-config will start calling ~~Puppet Apply based on the hiera data~~ [UPDATE] os-refresh-config only, which then invokes a script that runs puppet. .
Steve’s note:

os-collect-config never runs puppet, it runs os-refresh-config only, which then invokes a script that runs puppet.
The Keystone Puppet module will set values in the Keystone config file, httpd/conf.d files, and perform other configuration work.

Here is a diagram of how os-collect-config is designed

When a controller image is built for Tripleo, Some portion of the Hiera data is stored in /etc/puppet/. There is a file /etc/puppet/hiera.yaml (which looks a lot like /etc/hiera.yaml, an RPM controlled file) and sub file in /etc/puppet/hieradata such as
/etc/puppet/hieradata/heat_config_ControllerOvercloudServicesDeployment_Step4.json

UPDATE: Response from Steve Hardy

This is kind-of correct – we wait for the server to become ACTIVE, which means the OS::Nova::Server resource is declared CREATE_COMPLETE. Then we do some network configuration, and *then* we post the hieradata via a heat software deployment.

So, we post the hieradata to the heat metadata API only after the node is up and running, and has it’s network configured (not before).

https://github.com/openstack/tripleo-heat-templates/blob/master/puppet/controller.yaml#L610

Note the depends_on – we use that to control the ordering of configuration performed via heat.

However, the dynamic data seems to be stored in /var/lib/os-collect-config/

$ ls -la  /var/lib/os-collect-config/*json
-rw-------. 1 root root   2929 Jun 16 02:55 /var/lib/os-collect-config/ControllerAllNodesDeployment.json
-rw-------. 1 root root    187 Jun 16 02:55 /var/lib/os-collect-config/ControllerBootstrapNodeDeployment.json
-rw-------. 1 root root   1608 Jun 16 02:55 /var/lib/os-collect-config/ControllerCephDeployment.json
-rw-------. 1 root root    435 Jun 16 02:55 /var/lib/os-collect-config/ControllerClusterDeployment.json
-rw-------. 1 root root  36481 Jun 16 02:55 /var/lib/os-collect-config/ControllerDeployment.json
-rw-------. 1 root root    242 Jun 16 02:55 /var/lib/os-collect-config/ControllerSwiftDeployment.json
-rw-------. 1 root root   1071 Jun 16 02:55 /var/lib/os-collect-config/ec2.json
-rw-------. 1 root root    388 Jun 15 18:38 /var/lib/os-collect-config/heat_local.json
-rw-------. 1 root root   1325 Jun 16 02:55 /var/lib/os-collect-config/NetworkDeployment.json
-rw-------. 1 root root    557 Jun 15 19:56 /var/lib/os-collect-config/os_config_files.json
-rw-------. 1 root root 263313 Jun 16 02:55 /var/lib/os-collect-config/request.json
-rw-------. 1 root root   1187 Jun 16 02:55 /var/lib/os-collect-config/VipDeployment.json

For each of these files there are two older copies that end in .last and .orig as well.

In my previous post, I wrote about setting Keystone configuration options such as ‘identity/domain_specific_drivers_enabled’: value => ‘True’;. I can see this value set in /var/lib/os-collect-config/request.json with a large block keyed “config”.

When I ran the openstack overcloud deploy, one way that I was able to track what was happening on the node was to tail the journal like this:

 sudo journalctl -f | grep collect-config

Looking through the journal output, I can see the line that triggered the change:

... /Stage[main]/Main/Keystone_config[identity/domain_specific_drivers_enabled]/ensure: ...

Hi,

Thanks a for such a nice write up. It’s really helpful.

I’m trying to deploy RHOSP12 on 9 servers(3 controllers, 2 compute with ovsdpdk , 2 compute with sriov and 2 ceph storage nodes). Overcloud nodes are provisioned.
But overcloud stack is failed. Analysed the nested stacks it shows reason just as deployment cancelled. Also the log file /var/log/messages inside the provisoned overcloud nodes shows the below message related t os-collect-config:

Jun 1 23:53:18 controller-2 NetworkManager[1145]: [1527911598.4166] device (p2p2): state change: ip-config -> failed (reason ‘ip-config-unavailable’, sys-iface-state: ‘managed’)
Jun 1 23:53:18 controller-2 NetworkManager[1145]: [1527911598.4168] device (p2p2): Activation: failed for connection ‘System p2p2’
Jun 1 23:53:18 controller-2 NetworkManager[1145]: [1527911598.4175] device (em1): state change: failed -> disconnected (reason ‘none’, sys-iface-state: ‘managed’)
Jun 1 23:53:18 controller-2 NetworkManager[1145]: [1527911598.4200] device (em2): state change: failed -> disconnected (reason ‘none’, sys-iface-state: ‘managed’)
Jun 1 23:53:18 controller-2 NetworkManager[1145]: [1527911598.4222] device (p2p1): state change: failed -> disconnected (reason ‘none’, sys-iface-state: ‘managed’)
Jun 1 23:53:18 controller-2 NetworkManager[1145]: [1527911598.4247] device (p2p2): state change: failed -> disconnected (reason ‘none’, sys-iface-state: ‘managed’)
Jun 1 23:53:19 controller-2 ntpd[17497]: Deleting interface #422 em1, fe80::266e:96ff:fe5f:dcd0#123, interface stats: received=0, sent=0, dropped=0, active_time=43 secs
Jun 1 23:53:19 controller-2 ntpd[17497]: Deleting interface #421 em2, fe80::266e:96ff:fe5f:dcd2#123, interface stats: received=0, sent=0, dropped=0, active_time=43 secs
Jun 1 23:53:19 controller-2 ntpd[17497]: Deleting interface #420 p2p1, fe80::a236:9fff:feef:f72c#123, interface stats: received=0, sent=0, dropped=0, active_time=43 secs
Jun 1 23:53:19 controller-2 ntpd[17497]: Deleting interface #419 p2p2, fe80::a236:9fff:feef:f72e#123, interface stats: received=0, sent=0, dropped=0, active_time=43 secs
Jun 1 23:53:29 controller-2 os-collect-config: /var/lib/os-collect-config/local-data not found. Skipping
Jun 1 23:53:29 controller-2 os-collect-config: No local metadata found ([‘/var/lib/os-collect-config/local-data’])
Jun 1 23:53:59 controller-2 os-collect-config: /var/lib/os-collect-config/local-data not found. Skipping
Jun 1 23:53:59 controller-2 os-collect-config: No local metadata found ([‘/var/lib/os-collect-config/local-data’])
Jun 1 23:54:29 controller-2 os-collect-config: /var/lib/os-collect-config/local-data not found. Skipping
Jun 1 23:54:29 controller-2 os-collect-config: No local metadata found ([‘/var/lib/os-collect-config/local-data’])
Jun 1 23:54:59 controller-2 os-collect-config: /var/lib/os-collect-config/local-data not found. Skipping
Jun 1 23:54:59 controller-2 os-collect-config: No local metadata found ([‘/var/lib/os-collect-config/local-data’])

ANY sort of help or inputs for this? Please suggest

2 thoughts on “Learning about the Overcloud Deploy Process”

Praveen on June 1, 2018 at 9:04 pm said:

Hi,

Thanks a for such a nice write up. It’s really helpful.

I’m trying to deploy RHOSP12 on 9 servers(3 controllers, 2 compute with ovsdpdk , 2 compute with sriov and 2 ceph storage nodes). Overcloud nodes are provisioned.
But overcloud stack is failed. Analysed the nested stacks it shows reason just as deployment cancelled. Also the log file /var/log/messages inside the provisoned overcloud nodes shows the below message related t os-collect-config:

Jun 1 23:53:18 controller-2 NetworkManager[1145]: [1527911598.4166] device (p2p2): state change: ip-config -> failed (reason ‘ip-config-unavailable’, sys-iface-state: ‘managed’)
Jun 1 23:53:18 controller-2 NetworkManager[1145]: [1527911598.4168] device (p2p2): Activation: failed for connection ‘System p2p2’
Jun 1 23:53:18 controller-2 NetworkManager[1145]: [1527911598.4175] device (em1): state change: failed -> disconnected (reason ‘none’, sys-iface-state: ‘managed’)
Jun 1 23:53:18 controller-2 NetworkManager[1145]: [1527911598.4200] device (em2): state change: failed -> disconnected (reason ‘none’, sys-iface-state: ‘managed’)
Jun 1 23:53:18 controller-2 NetworkManager[1145]: [1527911598.4222] device (p2p1): state change: failed -> disconnected (reason ‘none’, sys-iface-state: ‘managed’)
Jun 1 23:53:18 controller-2 NetworkManager[1145]: [1527911598.4247] device (p2p2): state change: failed -> disconnected (reason ‘none’, sys-iface-state: ‘managed’)
Jun 1 23:53:19 controller-2 ntpd[17497]: Deleting interface #422 em1, fe80::266e:96ff:fe5f:dcd0#123, interface stats: received=0, sent=0, dropped=0, active_time=43 secs
Jun 1 23:53:19 controller-2 ntpd[17497]: Deleting interface #421 em2, fe80::266e:96ff:fe5f:dcd2#123, interface stats: received=0, sent=0, dropped=0, active_time=43 secs
Jun 1 23:53:19 controller-2 ntpd[17497]: Deleting interface #420 p2p1, fe80::a236:9fff:feef:f72c#123, interface stats: received=0, sent=0, dropped=0, active_time=43 secs
Jun 1 23:53:19 controller-2 ntpd[17497]: Deleting interface #419 p2p2, fe80::a236:9fff:feef:f72e#123, interface stats: received=0, sent=0, dropped=0, active_time=43 secs
Jun 1 23:53:29 controller-2 os-collect-config: /var/lib/os-collect-config/local-data not found. Skipping
Jun 1 23:53:29 controller-2 os-collect-config: No local metadata found ([‘/var/lib/os-collect-config/local-data’])
Jun 1 23:53:59 controller-2 os-collect-config: /var/lib/os-collect-config/local-data not found. Skipping
Jun 1 23:53:59 controller-2 os-collect-config: No local metadata found ([‘/var/lib/os-collect-config/local-data’])
Jun 1 23:54:29 controller-2 os-collect-config: /var/lib/os-collect-config/local-data not found. Skipping
Jun 1 23:54:29 controller-2 os-collect-config: No local metadata found ([‘/var/lib/os-collect-config/local-data’])
Jun 1 23:54:59 controller-2 os-collect-config: /var/lib/os-collect-config/local-data not found. Skipping
Jun 1 23:54:59 controller-2 os-collect-config: No local metadata found ([‘/var/lib/os-collect-config/local-data’])

ANY sort of help or inputs for this? Please suggest
Adam Young on June 2, 2018 at 5:45 pm said:

I’m not familiar with that error. I’d ask in #tripleo in Freenode IRC.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Adam Young's Web Log

The Notebook of a Programmer Climber Musician Ex-Soldier Woodworker and a few other things

Learning about the Overcloud Deploy Process

2 thoughts on “Learning about the Overcloud Deploy Process”

Leave a Reply