Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove ACM in C++ #2067

Merged
merged 2 commits into from
Apr 24, 2024
Merged

Remove ACM in C++ #2067

merged 2 commits into from
Apr 24, 2024

Conversation

bernardnormier
Copy link
Member

@bernardnormier bernardnormier commented Apr 19, 2024

This PR removes ACM from C++ and the ACM JS test (since it relies on the now removed C++ test).

It's layered over #2059.

Note: this PR does not remove the various SessionTimeout properties and related implementation code (keep alive and the like) from Glacier2 and IceGrid. This is for a follow-up PR.

@bernardnormier bernardnormier marked this pull request as draft April 19, 2024 21:46
@@ -15,7 +15,6 @@
<property name="Glacier2.Client.Endpoints" value="${client-endpoints}"/>
<property name="Glacier2.Server.Endpoints" value="${server-endpoints}"/>
<property name="Glacier2.InstanceName" value="${instance-name}"/>
<property name="Glacier2.SessionTimeout" value="${session-timeout}"/>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove the IceGrid SessionTimeout properties in a follow-up PR.

{
return current.con->getACM().timeout;
// TODO: better way to retrieve idle timeout
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a minimum, we should have a single spot for the default (60s).

}

// TODO: verify remote idle timeout is compatible with local idle timeout
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related question: what are compatible idle timeouts?
Do we take into account the setting for the local EnableIdleCheck too?

@@ -495,14 +497,9 @@ SessionHelperI::connected(const Glacier2::RouterPrx& router, const optional<Glac
_session = session;
_connected = true;

if (acmTimeout > 0)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me why the close callback was sent only for acmTimeout > 0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't know why we do this.

{
// Ensure all the connections are finished and reapable at this point.
vector<Ice::ConnectionIPtr> cons;
_monitor->swapReapedConnections(cons);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the reaping logic in the ACM code.

Presumably when the new idle check code aborts a connection, this connection gets cleaned up. It doesn't need to be "reaped".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The connection doesn't get removed from the factory because the connection doesn't have a reference to the factory.

When a connection is done, it's added to a "reap connection" collection. This collection is held by the ACM code and it's indeed not correct since it has nothing to do with ACM.

Each time a new connection is added to the outgoing/incoming connection factories, the "reap connection" collection is checked and connections from this collection are removed from the factory.

Removing this code implies that connections will leak until the factory is destroyed.

Either you add another class to hold the "reap connection" collection or you fix the connection class to have a circular reference to the connection factory.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems simpler for the connection to hold a weak_ptr on the connection factory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it can probably also just be a shared_ptr given that the connections are closed on factory destruction and the connections maps are cleared in the factories.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to remove the connection from the factory in "reap" (see the latest code) but this doesn't work.

For outgoing connections, this results in a lock acquisition order deadlock because reap is always called with the ConnectionI mutex lock. It also hangs for incoming connections but I didn't debug it to identify the issue.

Any suggestion?
I'd rather not start a thread just to cleanup these "finished" connections from time to time in the background.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to hold the connection lock while calling reap, can we just set a flag and called after we release the connection lock?
Instead of

if (completedUpcallCount > 0)
{
    std::lock_guard lock(_mutex);
    ....
        else if (_state == StateFinished)
        {
            reap();
        }
        _conditionVariable.notify_all();
    }
}

Do something like:

bool reapConnection;
if (completedUpcallCount > 0)
{
    std::lock_guard lock(_mutex);
    ....
        else if (_state == StateFinished)
        {
            reapConnection = true;
        }
        _conditionVariable.notify_all();
    }
}

if (reapConnection)
{
    reap();
}

// Close the connection if we didn't receive a heartbeat in
// the last period.
//
setState(StateClosed, make_exception_ptr(ConnectionTimeoutException(__FILE__, __LINE__)));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of this PR, ConnectionTimeoutException is not longer created in C++. I will remove it in a follow-up PR.

{
remove(_connections, p->connector(), p);
remove(_connectionsByEndpoint, p->endpoint(), p);
remove(_connectionsByEndpoint, p->endpoint()->compress(true), p);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We no longer remove closed connections, it seems this is only clear when the factory is destroyed.

Copy link
Member

@bentoi bentoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the IceGrid failure, it's most likely because the admin session got destroyed by IceGrid because it wasn't kept alive. Is the client still sending heartbeats on the session's connection?

@@ -495,14 +497,9 @@ SessionHelperI::connected(const Glacier2::RouterPrx& router, const optional<Glac
_session = session;
_connected = true;

if (acmTimeout > 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't know why we do this.

{
// Ensure all the connections are finished and reapable at this point.
vector<Ice::ConnectionIPtr> cons;
_monitor->swapReapedConnections(cons);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The connection doesn't get removed from the factory because the connection doesn't have a reference to the factory.

When a connection is done, it's added to a "reap connection" collection. This collection is held by the ACM code and it's indeed not correct since it has nothing to do with ACM.

Each time a new connection is added to the outgoing/incoming connection factories, the "reap connection" collection is checked and connections from this collection are removed from the factory.

Removing this code implies that connections will leak until the factory is destroyed.

Either you add another class to hold the "reap connection" collection or you fix the connection class to have a circular reference to the connection factory.

@bernardnormier
Copy link
Member Author

On the IceGrid failure, it's most likely because the admin session got destroyed by IceGrid because it wasn't kept alive.
Is the client still sending heartbeats on the session's connection?

The issue could be indeed due to a difference between the old ACM heartbeat "always mode" (removed by this PR) and the new idle timeout heartbeats, where heartbeats are not sent when some other write occurs.

I already updated the IceGrid code to synchronize the default for its 3 SessionTimeout to 60 seconds (just like the default idle timeout). But if IceGrid relies on frequent heartbeats (not just any incoming message), this could indeed be the reason.

@bernardnormier bernardnormier changed the title (Draft) Remove ACM in C++ Remove ACM in C++ Apr 23, 2024
@bernardnormier
Copy link
Member Author

On the IceGrid failure, it's most likely because the admin session got destroyed by IceGrid because it wasn't kept alive.
Is the client still sending heartbeats on the session's connection?

The issue could be indeed due to a difference between the old ACM heartbeat "always mode" (removed by this PR) and the new idle timeout heartbeats, where heartbeats are not sent when some other write occurs.

I commented out the reaping code in IceGrid and it fixes this failure on Windows. It does not appear we have tests for this IceGrid reaping code (to be removed in a follow-up PR).

@bernardnormier bernardnormier marked this pull request as ready for review April 23, 2024 17:04
Copy link
Member

@externl externl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. It would be good to open issues for some of the TODOs unless you plan on getting to them soon.

try
{
acmTimeout = router->getACMTimeout();
remoteIdleTimeout = router->getACMTimeout();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we plan on renaming/replacing this function getACMTimeout?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are unfortunately Slice-defined operations. As a result, we can't rename them without breaking on-the-wire compatibility.

For Glacier2, we want to keep on the wire compatibility in both directions:

  • an Ice 3.7 client using an Ice 3.8 server
  • an Ice 3.8 client using an Ice 3.7 server

For IceGrid, I am not sure. Do we want to support deployments with a mix of IceGrid 3.7 and 3.8 for the registry replicas and nodes? If yes, it gets complicated given the way "keep alive" are implemented in IceGrid.

{
// Ensure all the connections are finished and reapable at this point.
vector<Ice::ConnectionIPtr> cons;
_monitor->swapReapedConnections(cons);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to hold the connection lock while calling reap, can we just set a flag and called after we release the connection lock?
Instead of

if (completedUpcallCount > 0)
{
    std::lock_guard lock(_mutex);
    ....
        else if (_state == StateFinished)
        {
            reap();
        }
        _conditionVariable.notify_all();
    }
}

Do something like:

bool reapConnection;
if (completedUpcallCount > 0)
{
    std::lock_guard lock(_mutex);
    ....
        else if (_state == StateFinished)
        {
            reapConnection = true;
        }
        _conditionVariable.notify_all();
    }
}

if (reapConnection)
{
    reap();
}

Copy link
Member

@bentoi bentoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling reap() outside the connection lock as suggested by José could indeed be the solution.

@bernardnormier
Copy link
Member Author

Fixed as suggested by Jose and Benoit.

@bernardnormier
Copy link
Member Author

Several tests now display the following message without failing:

*** [1/1] Running cpp/Ice/info tests ***
[ running client/server test - 04/24/24 20:53:24 ]
testing proxy endpoint information... ok
test object adapter endpoint information... !! 04/24/24 20:53:24597 /Users/bernard/builds/ice/cpp/test/Ice/info/build/macosx/shared/client: error: exception in `Ice.ThreadPool.Server':
   std::exception
   event handler: work queue
ok
test connection endpoint information... ok
testing connection information... ok
!! 04/24/24 20:53:24600 /Users/bernard/builds/ice/cpp/test/Ice/info/build/macosx/shared/server: error: exception in `Ice.ThreadPool.Server':
   std::exception
   event handler: work queue

Ran 1 tests in 0.55 seconds
1 succeeded

This "error" was introduced by this PR but I don't know yet how. Commenting out the removeConnection code doesn't fix it.

@bernardnormier
Copy link
Member Author

Several tests now display the following message without failing:

Fixed. I was occasionally calling a null std::function.

@pepone
Copy link
Member

pepone commented Apr 24, 2024

Several tests now display the following message without failing:

We need to fix the tracing code, why is it showing just std::exception, it seems like a missing call to what()

// From ThreadPool.cpp
        catch (const exception& ex)
        {
            Error out(_instance->initializationData().logger);
            out << "exception in `" << _prefix << "':\n" << ex << "\nevent handler: " << current._handler->toString();
        }

@bernardnormier
Copy link
Member Author

Several tests now display the following message without failing:

We need to fix the tracing code, why is it showing just std::exception, it seems like a missing call to what()

The code appears correct - we overload << for std::exception for the logger. Not sure why we get std::exception here.

@bernardnormier bernardnormier merged commit cb549b2 into zeroc-ice:main Apr 24, 2024
17 checks passed
@bernardnormier bernardnormier deleted the acm6 branch May 10, 2024 23:42
@bernardnormier bernardnormier mentioned this pull request May 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants