-
Notifications
You must be signed in to change notification settings - Fork 592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove ACM in C++ #2067
Remove ACM in C++ #2067
Conversation
@@ -15,7 +15,6 @@ | |||
<property name="Glacier2.Client.Endpoints" value="${client-endpoints}"/> | |||
<property name="Glacier2.Server.Endpoints" value="${server-endpoints}"/> | |||
<property name="Glacier2.InstanceName" value="${instance-name}"/> | |||
<property name="Glacier2.SessionTimeout" value="${session-timeout}"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will remove the IceGrid SessionTimeout properties in a follow-up PR.
{ | ||
return current.con->getACM().timeout; | ||
// TODO: better way to retrieve idle timeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At a minimum, we should have a single spot for the default (60s).
} | ||
|
||
// TODO: verify remote idle timeout is compatible with local idle timeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related question: what are compatible idle timeouts?
Do we take into account the setting for the local EnableIdleCheck too?
@@ -495,14 +497,9 @@ SessionHelperI::connected(const Glacier2::RouterPrx& router, const optional<Glac | |||
_session = session; | |||
_connected = true; | |||
|
|||
if (acmTimeout > 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear to me why the close callback was sent only for acmTimeout > 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also don't know why we do this.
{ | ||
// Ensure all the connections are finished and reapable at this point. | ||
vector<Ice::ConnectionIPtr> cons; | ||
_monitor->swapReapedConnections(cons); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the reaping logic in the ACM code.
Presumably when the new idle check code aborts a connection, this connection gets cleaned up. It doesn't need to be "reaped".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The connection doesn't get removed from the factory because the connection doesn't have a reference to the factory.
When a connection is done, it's added to a "reap connection" collection. This collection is held by the ACM code and it's indeed not correct since it has nothing to do with ACM.
Each time a new connection is added to the outgoing/incoming connection factories, the "reap connection" collection is checked and connections from this collection are removed from the factory.
Removing this code implies that connections will leak until the factory is destroyed.
Either you add another class to hold the "reap connection" collection or you fix the connection class to have a circular reference to the connection factory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems simpler for the connection to hold a weak_ptr on the connection factory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it can probably also just be a shared_ptr given that the connections are closed on factory destruction and the connections maps are cleared in the factories.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to remove the connection from the factory in "reap" (see the latest code) but this doesn't work.
For outgoing connections, this results in a lock acquisition order deadlock because reap is always called with the ConnectionI mutex lock. It also hangs for incoming connections but I didn't debug it to identify the issue.
Any suggestion?
I'd rather not start a thread just to cleanup these "finished" connections from time to time in the background.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to hold the connection lock while calling reap, can we just set a flag and called after we release the connection lock?
Instead of
if (completedUpcallCount > 0)
{
std::lock_guard lock(_mutex);
....
else if (_state == StateFinished)
{
reap();
}
_conditionVariable.notify_all();
}
}
Do something like:
bool reapConnection;
if (completedUpcallCount > 0)
{
std::lock_guard lock(_mutex);
....
else if (_state == StateFinished)
{
reapConnection = true;
}
_conditionVariable.notify_all();
}
}
if (reapConnection)
{
reap();
}
// Close the connection if we didn't receive a heartbeat in | ||
// the last period. | ||
// | ||
setState(StateClosed, make_exception_ptr(ConnectionTimeoutException(__FILE__, __LINE__))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As of this PR, ConnectionTimeoutException is not longer created in C++. I will remove it in a follow-up PR.
{ | ||
remove(_connections, p->connector(), p); | ||
remove(_connectionsByEndpoint, p->endpoint(), p); | ||
remove(_connectionsByEndpoint, p->endpoint()->compress(true), p); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We no longer remove closed connections, it seems this is only clear when the factory is destroyed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the IceGrid failure, it's most likely because the admin session got destroyed by IceGrid because it wasn't kept alive. Is the client still sending heartbeats on the session's connection?
@@ -495,14 +497,9 @@ SessionHelperI::connected(const Glacier2::RouterPrx& router, const optional<Glac | |||
_session = session; | |||
_connected = true; | |||
|
|||
if (acmTimeout > 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also don't know why we do this.
{ | ||
// Ensure all the connections are finished and reapable at this point. | ||
vector<Ice::ConnectionIPtr> cons; | ||
_monitor->swapReapedConnections(cons); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The connection doesn't get removed from the factory because the connection doesn't have a reference to the factory.
When a connection is done, it's added to a "reap connection" collection. This collection is held by the ACM code and it's indeed not correct since it has nothing to do with ACM.
Each time a new connection is added to the outgoing/incoming connection factories, the "reap connection" collection is checked and connections from this collection are removed from the factory.
Removing this code implies that connections will leak until the factory is destroyed.
Either you add another class to hold the "reap connection" collection or you fix the connection class to have a circular reference to the connection factory.
The issue could be indeed due to a difference between the old ACM heartbeat "always mode" (removed by this PR) and the new idle timeout heartbeats, where heartbeats are not sent when some other write occurs. I already updated the IceGrid code to synchronize the default for its 3 SessionTimeout to 60 seconds (just like the default idle timeout). But if IceGrid relies on frequent heartbeats (not just any incoming message), this could indeed be the reason. |
I commented out the reaping code in IceGrid and it fixes this failure on Windows. It does not appear we have tests for this IceGrid reaping code (to be removed in a follow-up PR). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. It would be good to open issues for some of the TODOs unless you plan on getting to them soon.
try | ||
{ | ||
acmTimeout = router->getACMTimeout(); | ||
remoteIdleTimeout = router->getACMTimeout(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we plan on renaming/replacing this function getACMTimeout
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are unfortunately Slice-defined operations. As a result, we can't rename them without breaking on-the-wire compatibility.
For Glacier2, we want to keep on the wire compatibility in both directions:
- an Ice 3.7 client using an Ice 3.8 server
- an Ice 3.8 client using an Ice 3.7 server
For IceGrid, I am not sure. Do we want to support deployments with a mix of IceGrid 3.7 and 3.8 for the registry replicas and nodes? If yes, it gets complicated given the way "keep alive" are implemented in IceGrid.
{ | ||
// Ensure all the connections are finished and reapable at this point. | ||
vector<Ice::ConnectionIPtr> cons; | ||
_monitor->swapReapedConnections(cons); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to hold the connection lock while calling reap, can we just set a flag and called after we release the connection lock?
Instead of
if (completedUpcallCount > 0)
{
std::lock_guard lock(_mutex);
....
else if (_state == StateFinished)
{
reap();
}
_conditionVariable.notify_all();
}
}
Do something like:
bool reapConnection;
if (completedUpcallCount > 0)
{
std::lock_guard lock(_mutex);
....
else if (_state == StateFinished)
{
reapConnection = true;
}
_conditionVariable.notify_all();
}
}
if (reapConnection)
{
reap();
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling reap()
outside the connection lock as suggested by José could indeed be the solution.
Fixed as suggested by Jose and Benoit. |
Several tests now display the following message without failing:
This "error" was introduced by this PR but I don't know yet how. Commenting out the removeConnection code doesn't fix it. |
Fixed. I was occasionally calling a null std::function. |
We need to fix the tracing code, why is it showing just
|
The code appears correct - we overload |
This PR removes ACM from C++ and the ACM JS test (since it relies on the now removed C++ test).
It's layered over #2059.
Note: this PR does not remove the various SessionTimeout properties and related implementation code (keep alive and the like) from Glacier2 and IceGrid. This is for a follow-up PR.