Remove ACM in C++ #2067

bernardnormier · 2024-04-19T21:45:57Z

This PR removes ACM from C++ and the ACM JS test (since it relies on the now removed C++ test).

It's layered over #2059.

Note: this PR does not remove the various SessionTimeout properties and related implementation code (keep alive and the like) from Glacier2 and IceGrid. This is for a follow-up PR.

bernardnormier · 2024-04-21T13:26:35Z

cpp/config/templates.xml

@@ -15,7 +15,6 @@
            <property name="Glacier2.Client.Endpoints" value="${client-endpoints}"/>
            <property name="Glacier2.Server.Endpoints" value="${server-endpoints}"/>
            <property name="Glacier2.InstanceName" value="${instance-name}"/>
-            <property name="Glacier2.SessionTimeout" value="${session-timeout}"/>


I will remove the IceGrid SessionTimeout properties in a follow-up PR.

bernardnormier · 2024-04-21T13:28:38Z

cpp/src/Glacier2/SessionRouterI.cpp

 {
-    return current.con->getACM().timeout;
+    // TODO: better way to retrieve idle timeout


At a minimum, we should have a single spot for the default (60s).

bernardnormier · 2024-04-21T13:31:44Z

cpp/src/Glacier2Lib/SessionHelper.cpp

    }

+    // TODO: verify remote idle timeout is compatible with local idle timeout


Related question: what are compatible idle timeouts?
Do we take into account the setting for the local EnableIdleCheck too?

bernardnormier · 2024-04-21T13:33:33Z

cpp/src/Glacier2Lib/SessionHelper.cpp

@@ -495,14 +497,9 @@ SessionHelperI::connected(const Glacier2::RouterPrx& router, const optional<Glac
            _session = session;
            _connected = true;

-            if (acmTimeout > 0)


It's not clear to me why the close callback was sent only for acmTimeout > 0.

I also don't know why we do this.

bernardnormier · 2024-04-21T13:35:03Z

cpp/src/Ice/ConnectionFactory.cpp

-        {
-            // Ensure all the connections are finished and reapable at this point.
-            vector<Ice::ConnectionIPtr> cons;
-            _monitor->swapReapedConnections(cons);


I don't understand the reaping logic in the ACM code.

Presumably when the new idle check code aborts a connection, this connection gets cleaned up. It doesn't need to be "reaped".

The connection doesn't get removed from the factory because the connection doesn't have a reference to the factory.

When a connection is done, it's added to a "reap connection" collection. This collection is held by the ACM code and it's indeed not correct since it has nothing to do with ACM.

Each time a new connection is added to the outgoing/incoming connection factories, the "reap connection" collection is checked and connections from this collection are removed from the factory.

Removing this code implies that connections will leak until the factory is destroyed.

Either you add another class to hold the "reap connection" collection or you fix the connection class to have a circular reference to the connection factory.

It seems simpler for the connection to hold a weak_ptr on the connection factory.

Yes, it can probably also just be a shared_ptr given that the connections are closed on factory destruction and the connections maps are cleared in the factories.

I tried to remove the connection from the factory in "reap" (see the latest code) but this doesn't work.

For outgoing connections, this results in a lock acquisition order deadlock because reap is always called with the ConnectionI mutex lock. It also hangs for incoming connections but I didn't debug it to identify the issue.

Any suggestion?
I'd rather not start a thread just to cleanup these "finished" connections from time to time in the background.

Why do we need to hold the connection lock while calling reap, can we just set a flag and called after we release the connection lock?
Instead of

if (completedUpcallCount > 0) { std::lock_guard lock(_mutex); .... else if (_state == StateFinished) { reap(); } _conditionVariable.notify_all(); } }

Do something like:

bool reapConnection; if (completedUpcallCount > 0) { std::lock_guard lock(_mutex); .... else if (_state == StateFinished) { reapConnection = true; } _conditionVariable.notify_all(); } } if (reapConnection) { reap(); }

bernardnormier · 2024-04-21T13:38:13Z

cpp/src/Ice/ConnectionI.cpp

-            // Close the connection if we didn't receive a heartbeat in
-            // the last period.
-            //
-            setState(StateClosed, make_exception_ptr(ConnectionTimeoutException(__FILE__, __LINE__)));


As of this PR, ConnectionTimeoutException is not longer created in C++. I will remove it in a follow-up PR.

pepone · 2024-04-22T09:01:46Z

cpp/src/Ice/ConnectionFactory.cpp

-        {
-            remove(_connections, p->connector(), p);
-            remove(_connectionsByEndpoint, p->endpoint(), p);
-            remove(_connectionsByEndpoint, p->endpoint()->compress(true), p);


We no longer remove closed connections, it seems this is only clear when the factory is destroyed.

bentoi

On the IceGrid failure, it's most likely because the admin session got destroyed by IceGrid because it wasn't kept alive. Is the client still sending heartbeats on the session's connection?

bentoi · 2024-04-22T10:27:52Z

cpp/src/Glacier2Lib/SessionHelper.cpp

@@ -495,14 +497,9 @@ SessionHelperI::connected(const Glacier2::RouterPrx& router, const optional<Glac
            _session = session;
            _connected = true;

-            if (acmTimeout > 0)


I also don't know why we do this.

bentoi · 2024-04-22T10:38:23Z

cpp/src/Ice/ConnectionFactory.cpp

-        {
-            // Ensure all the connections are finished and reapable at this point.
-            vector<Ice::ConnectionIPtr> cons;
-            _monitor->swapReapedConnections(cons);


The connection doesn't get removed from the factory because the connection doesn't have a reference to the factory.

When a connection is done, it's added to a "reap connection" collection. This collection is held by the ACM code and it's indeed not correct since it has nothing to do with ACM.

Each time a new connection is added to the outgoing/incoming connection factories, the "reap connection" collection is checked and connections from this collection are removed from the factory.

Removing this code implies that connections will leak until the factory is destroyed.

Either you add another class to hold the "reap connection" collection or you fix the connection class to have a circular reference to the connection factory.

bernardnormier · 2024-04-22T14:56:58Z

On the IceGrid failure, it's most likely because the admin session got destroyed by IceGrid because it wasn't kept alive.
Is the client still sending heartbeats on the session's connection?

The issue could be indeed due to a difference between the old ACM heartbeat "always mode" (removed by this PR) and the new idle timeout heartbeats, where heartbeats are not sent when some other write occurs.

I already updated the IceGrid code to synchronize the default for its 3 SessionTimeout to 60 seconds (just like the default idle timeout). But if IceGrid relies on frequent heartbeats (not just any incoming message), this could indeed be the reason.

bernardnormier · 2024-04-23T17:03:52Z

On the IceGrid failure, it's most likely because the admin session got destroyed by IceGrid because it wasn't kept alive.
Is the client still sending heartbeats on the session's connection?

The issue could be indeed due to a difference between the old ACM heartbeat "always mode" (removed by this PR) and the new idle timeout heartbeats, where heartbeats are not sent when some other write occurs.

I commented out the reaping code in IceGrid and it fixes this failure on Windows. It does not appear we have tests for this IceGrid reaping code (to be removed in a follow-up PR).

externl

Looks good. It would be good to open issues for some of the TODOs unless you plan on getting to them soon.

externl · 2024-04-23T18:05:07Z

cpp/src/Glacier2Lib/SessionHelper.cpp

    try
    {
-        acmTimeout = router->getACMTimeout();
+        remoteIdleTimeout = router->getACMTimeout();


Do we plan on renaming/replacing this function getACMTimeout?

These are unfortunately Slice-defined operations. As a result, we can't rename them without breaking on-the-wire compatibility.

For Glacier2, we want to keep on the wire compatibility in both directions:

an Ice 3.7 client using an Ice 3.8 server

an Ice 3.8 client using an Ice 3.7 server

For IceGrid, I am not sure. Do we want to support deployments with a mix of IceGrid 3.7 and 3.8 for the registry replicas and nodes? If yes, it gets complicated given the way "keep alive" are implemented in IceGrid.

pepone · 2024-04-24T09:40:19Z

cpp/src/Ice/ConnectionFactory.cpp

-        {
-            // Ensure all the connections are finished and reapable at this point.
-            vector<Ice::ConnectionIPtr> cons;
-            _monitor->swapReapedConnections(cons);


Why do we need to hold the connection lock while calling reap, can we just set a flag and called after we release the connection lock?
Instead of

if (completedUpcallCount > 0) { std::lock_guard lock(_mutex); .... else if (_state == StateFinished) { reap(); } _conditionVariable.notify_all(); } }

Do something like:

bool reapConnection; if (completedUpcallCount > 0) { std::lock_guard lock(_mutex); .... else if (_state == StateFinished) { reapConnection = true; } _conditionVariable.notify_all(); } } if (reapConnection) { reap(); }

bentoi

Calling reap() outside the connection lock as suggested by José could indeed be the solution.

bernardnormier · 2024-04-24T18:01:17Z

Fixed as suggested by Jose and Benoit.

bernardnormier · 2024-04-24T18:55:13Z

Several tests now display the following message without failing:

*** [1/1] Running cpp/Ice/info tests ***
[ running client/server test - 04/24/24 20:53:24 ]
testing proxy endpoint information... ok
test object adapter endpoint information... !! 04/24/24 20:53:24597 /Users/bernard/builds/ice/cpp/test/Ice/info/build/macosx/shared/client: error: exception in `Ice.ThreadPool.Server':
   std::exception
   event handler: work queue
ok
test connection endpoint information... ok
testing connection information... ok
!! 04/24/24 20:53:24600 /Users/bernard/builds/ice/cpp/test/Ice/info/build/macosx/shared/server: error: exception in `Ice.ThreadPool.Server':
   std::exception
   event handler: work queue

Ran 1 tests in 0.55 seconds
1 succeeded

This "error" was introduced by this PR but I don't know yet how. Commenting out the removeConnection code doesn't fix it.

bernardnormier · 2024-04-24T19:02:10Z

Several tests now display the following message without failing:

Fixed. I was occasionally calling a null std::function.

pepone · 2024-04-24T19:06:40Z

Several tests now display the following message without failing:

We need to fix the tracing code, why is it showing just std::exception, it seems like a missing call to what()

// From ThreadPool.cpp
        catch (const exception& ex)
        {
            Error out(_instance->initializationData().logger);
            out << "exception in `" << _prefix << "':\n" << ex << "\nevent handler: " << current._handler->toString();
        }

bernardnormier · 2024-04-24T20:03:50Z

Several tests now display the following message without failing:

We need to fix the tracing code, why is it showing just std::exception, it seems like a missing call to what()

The code appears correct - we overload << for std::exception for the logger. Not sure why we get std::exception here.

bernardnormier marked this pull request as draft April 19, 2024 21:46

bernardnormier requested review from bentoi, pepone and externl April 21, 2024 13:24

bernardnormier commented Apr 21, 2024

View reviewed changes

pepone reviewed Apr 22, 2024

View reviewed changes

bentoi reviewed Apr 22, 2024

View reviewed changes

bernardnormier requested review from bentoi and pepone April 23, 2024 17:01

bernardnormier changed the title ~~(Draft) Remove ACM in C++~~ Remove ACM in C++ Apr 23, 2024

bernardnormier marked this pull request as ready for review April 23, 2024 17:04

externl approved these changes Apr 23, 2024

View reviewed changes

pepone approved these changes Apr 24, 2024

View reviewed changes

bentoi approved these changes Apr 24, 2024

View reviewed changes

Remove ACM in C++ and derived languages

17740a9

bernardnormier force-pushed the acm6 branch from 4536d9e to 17740a9 Compare April 24, 2024 18:00

Bug fix

bfc6652

bernardnormier merged commit cb549b2 into zeroc-ice:main Apr 24, 2024
17 checks passed

bernardnormier deleted the acm6 branch May 10, 2024 23:42

bernardnormier mentioned this pull request May 20, 2024

Remove ACM in C# #2202

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove ACM in C++ #2067

Remove ACM in C++ #2067

bernardnormier commented Apr 19, 2024 •

edited

Loading

bernardnormier Apr 21, 2024

bernardnormier Apr 21, 2024

bernardnormier Apr 21, 2024

bernardnormier Apr 21, 2024

bentoi Apr 22, 2024

bernardnormier Apr 21, 2024

bentoi Apr 22, 2024

bernardnormier Apr 22, 2024

bentoi Apr 23, 2024

bernardnormier Apr 23, 2024

pepone Apr 24, 2024

bernardnormier Apr 21, 2024

pepone Apr 22, 2024

bentoi left a comment

bentoi Apr 22, 2024

bentoi Apr 22, 2024

bernardnormier commented Apr 22, 2024

bernardnormier commented Apr 23, 2024

externl left a comment

externl Apr 23, 2024

bernardnormier Apr 24, 2024

pepone Apr 24, 2024

bentoi left a comment

bernardnormier commented Apr 24, 2024

bernardnormier commented Apr 24, 2024

bernardnormier commented Apr 24, 2024

pepone commented Apr 24, 2024

bernardnormier commented Apr 24, 2024

		}

		// TODO: verify remote idle timeout is compatible with local idle timeout

Remove ACM in C++ #2067

Remove ACM in C++ #2067

Conversation

bernardnormier commented Apr 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bentoi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bernardnormier commented Apr 22, 2024

bernardnormier commented Apr 23, 2024

externl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bentoi left a comment

Choose a reason for hiding this comment

bernardnormier commented Apr 24, 2024

bernardnormier commented Apr 24, 2024

bernardnormier commented Apr 24, 2024

pepone commented Apr 24, 2024

bernardnormier commented Apr 24, 2024

bernardnormier commented Apr 19, 2024 •

edited

Loading