Hardy Steven has provided an invaluable reference with his troubleshooting blog post. However, I recently had a problem that didn’t quite match what he was showing. Zane Bitter got me oriented.
Upon a redeploy, I got a failure.
$ openstack stack list +--------------------------------------+------------+---------------+---------------------+---------------------+ | ID | Stack Name | Stack Status | Creation Time | Updated Time | +--------------------------------------+------------+---------------+---------------------+---------------------+ | 816c67ab-d360-4f9b-8811-ed2a346dde01 | overcloud | UPDATE_FAILED | 2016-08-16T13:38:46 | 2016-08-16T14:41:54 | +--------------------------------------+------------+---------------+---------------------+---------------------+
Listing the Failed resources:
$ heat resource-list --nested-depth 5 overcloud | grep FAILED | ControllerNodesPostDeployment | 7ae99682-597f-4562-9e58-4acffaf7aaac | OS::TripleO::ControllerPostDeployment | UPDATE_FAILED | 2016-08-16T14:44:42 | overcloud
No deployment listed. How to display the error? We want to show the resource named ControllerNodesPostDeployment associated with the overcloud stack:
$ heat resource-show overcloud ControllerNodesPostDeployment +------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Property | Value | +------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | attributes | {} | | creation_time | 2016-08-16T13:38:46 | | description | | | links | http://192.0.2.1:8004/v1/7ec16202298c41f696b3f326790bebd3/stacks/overcloud/816c67ab-d360-4f9b-8811-ed2a346dde01/resources/ControllerNodesPostDeployment (self) | | | http://192.0.2.1:8004/v1/7ec16202298c41f696b3f326790bebd3/stacks/overcloud/816c67ab-d360-4f9b-8811-ed2a346dde01 (stack) | | | http://192.0.2.1:8004/v1/7ec16202298c41f696b3f326790bebd3/stacks/overcloud-ControllerNodesPostDeployment-qelkqyung4xr/7ae99682-597f-4562-9e58-4acffaf7aaac (nested) | | logical_resource_id | ControllerNodesPostDeployment | | physical_resource_id | 7ae99682-597f-4562-9e58-4acffaf7aaac | | required_by | BlockStorageNodesPostDeployment | | | CephStorageNodesPostDeployment | | resource_name | ControllerNodesPostDeployment | | resource_status | UPDATE_FAILED | | resource_status_reason | Engine went down during resource UPDATE | | resource_type | OS::TripleO::ControllerPostDeployment | | updated_time | 2016-08-16T14:44:42 | +------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Note this message:
Engine went down during resource
Looking in the journal:
Aug 16 15:16:15 undercloud kernel: Out of memory: Kill process 17127 (heat-engine) score 60 or sacrifice child Aug 16 15:16:15 undercloud kernel: Killed process 17127 (heat-engine) total-vm:834052kB, anon-rss:480936kB, file-rss:1384kB
Just like Brody said, we are going to need a bigger boat.