Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jg ironic rebalance #5

Open
wants to merge 5 commits into
base: stackhpc/train
Choose a base branch
from
Open

Commits on Sep 10, 2020

  1. Add functional regression test for bug 1853009

    Bug 1853009 describes a race condition involving multiple nova-compute
    services with ironic. As the compute services start up, the hash ring
    rebalances, and the compute services have an inconsistent view of which
    is responsible for a compute node.
    
    The sequence of actions here is adapted from a real world log [1], where
    multiple nova-compute services were started simultaneously. In some
    cases mocks are used to simulate race conditions.
    
    There are three main issues with the behaviour:
    
    * host2 deletes the orphan node compute node after host1 has taken
      ownership of it.
    
    * host1 assumes that another compute service will not delete its nodes.
      Once a node is in rt.compute_nodes, it is not removed again unless the
      node is orphaned. This prevents host1 from recreating the compute
      node.
    
    * host1 assumes that another compute service will not delete its
      resource providers. Once an RP is in the provider tree, it is not
      removed.
    
    This functional test documents the current behaviour, with the idea that
    it can be updated as this behaviour is fixed.
    
    [1] http://paste.openstack.org/show/786272/
    
    Co-Authored-By: Matt Riedemann <[email protected]>
    
    Change-Id: Ice4071722de54e8d20bb8c3795be22f1995940cd
    Related-Bug: #1853009
    Related-Bug: #1853159
    markgoddard authored and JohnGarbutt committed Sep 10, 2020
    Configuration menu
    Copy the full SHA
    5cfa700 View commit details
    Browse the repository at this point in the history

Commits on Sep 11, 2020

  1. Clear rebalanced compute nodes from resource tracker

    There is a race condition in nova-compute with the ironic virt driver as
    nodes get rebalanced. It can lead to compute nodes being removed in the
    DB and not repopulated. Ultimately this prevents these nodes from being
    scheduled to.
    
    The issue being addressed here is that if a compute node is deleted by a host
    which thinks it is an orphan, then the compute host that actually owns the node
    might not recreate it if the node is already in its resource tracker cache.
    
    This change fixes the issue by clearing nodes from the resource tracker cache
    for which a compute node entry does not exist. Then, when the available
    resource for the node is updated, the compute node object is not found in the
    cache and gets recreated.
    
    Change-Id: I39241223b447fcc671161c370dbf16e1773b684a
    Partial-Bug: #1853009
    markgoddard authored and JohnGarbutt committed Sep 11, 2020
    Configuration menu
    Copy the full SHA
    e2c3c94 View commit details
    Browse the repository at this point in the history
  2. Invalidate provider tree when compute node disappears

    There is a race condition in nova-compute with the ironic virt driver
    as nodes get rebalanced. It can lead to compute nodes being removed in
    the DB and not repopulated. Ultimately this prevents these nodes from
    being scheduled to.
    
    The issue being addressed here is that if a compute node is deleted by a
    host which thinks it is an orphan, then the resource provider for that
    node might also be deleted. The compute host that owns the node might
    not recreate the resource provider if it exists in the provider tree
    cache.
    
    This change fixes the issue by clearing resource providers from the
    provider tree cache for which a compute node entry does not exist. Then,
    when the available resource for the node is updated, the resource
    providers are not found in the cache and get recreated in placement.
    
    Change-Id: Ia53ff43e6964963cdf295604ba0fb7171389606e
    Related-Bug: #1853009
    Related-Bug: #1841481
    markgoddard authored and JohnGarbutt committed Sep 11, 2020
    Configuration menu
    Copy the full SHA
    905ee15 View commit details
    Browse the repository at this point in the history
  3. Prevent deletion of a compute node belonging to another host

    There is a race condition in nova-compute with the ironic virt driver as
    nodes get rebalanced. It can lead to compute nodes being removed in the
    DB and not repopulated. Ultimately this prevents these nodes from being
    scheduled to.
    
    The main race condition involved is in update_available_resources in
    the compute manager. When the list of compute nodes is queried, there is
    a compute node belonging to the host that it does not expect to be
    managing, i.e. it is an orphan. Between that time and deleting the
    orphan, the real owner of the compute node takes ownership of it ( in
    the resource tracker). However, the node is still deleted as the first
    host is unaware of the ownership change.
    
    This change prevents this from occurring by filtering on the host when
    deleting a compute node. If another compute host has taken ownership of
    a node, it will have updated the host field and this will prevent
    deletion from occurring. The first host sees this has happened via the
    ComputeHostNotFound exception, and avoids deleting its resource
    provider.
    
    Closes-Bug: #1853009
    Related-Bug: #1841481
    Change-Id: I260c1fded79a85d4899e94df4d9036a1ee437f02
    markgoddard authored and JohnGarbutt committed Sep 11, 2020
    Configuration menu
    Copy the full SHA
    56a44f3 View commit details
    Browse the repository at this point in the history
  4. Fix inactive session error in compute node creation

    In the fix for bug 1839560 [1][2], soft-deleted compute nodes may be
    restored, to ensure we can reuse ironic node UUIDs as compute node
    UUIDs. While this seems to largely work, it results in some nasty errors
    being generated [3]:
    
        InvalidRequestError This session is in 'inactive' state, due to the
        SQL transaction being rolled back; no further SQL can be emitted
        within this transaction.
    
    This happens because compute_node_create is decorated with
    pick_context_manager_writer, which begins a transaction. While
    _compute_node_get_and_update_deleted claims that calling a second
    pick_context_manager_writer decorated function will begin a new
    subtransaction, this does not appear to be the case.
    
    This change removes pick_context_manager_writer from the
    compute_node_create function, and adds a new _compute_node_create
    function which ensures the transaction is finished if
    _compute_node_get_and_update_deleted is called.
    
    The new unit test added here fails without this change.
    
    This change marks the removal of the final FIXME from the functional
    test added in [4].
    
    [1] https://bugs.launchpad.net/nova/+bug/1839560
    [2] https://git.openstack.org/cgit/openstack/nova/commit/?id=89dd74ac7f1028daadf86cb18948e27fe9d1d411
    [3] http://paste.openstack.org/show/786350/
    [4] https://review.opendev.org/#/c/695012/
    
    Change-Id: Iae119ea8776bc7f2e5dbe2e502a743217beded73
    Closes-Bug: #1853159
    Related-Bug: #1853009
    markgoddard authored and JohnGarbutt committed Sep 11, 2020
    Configuration menu
    Copy the full SHA
    4fe087c View commit details
    Browse the repository at this point in the history