Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initiating connection migration with tmb? #98

Open
javafanboy opened this issue Mar 14, 2023 · 4 comments
Open

initiating connection migration with tmb? #98

javafanboy opened this issue Mar 14, 2023 · 4 comments
Labels

Comments

@javafanboy
Copy link

javafanboy commented Mar 14, 2023

We are getting quite frequent warnings in our Coherence log on storage enabled nodes mentioning "initiating connection migration with tmb" and I would like some info about what it means on a more technical level and perhaps suggestions on most common causes?

I am guessing some problem on the physical network as well as long GC pauses could result in more or less any network related warning in Coherence but are there also other possible reasons and are there any tips on how to further debug the problem?

2023-03-13 20:17:56.310/95085.948 Oracle Coherence CE 14.1.1.0.12 (thread=SelectionService(channels=112, selector=MultiplexedSelector(sun.nio.ch.EPollSelectorImpl@3676ac27), id=1660451908), member=17): tmb://138.106.96.41:9001.49982 initiating connection migration with tmb://138.106.96.25:33316.40573 after 15s ack timeout health(read=false, write=true), receiptWait=Message "PartialValueResponse"
{
FromMember=Member(Id=17, Timestamp=2023-03-12 17:53:14.974, Address=138.106.96.41:9001, MachineId=46694, Location=site:sss.se.xxxxyyyy.com,machine:l4041p.sss.se.xxxxyyyy.com,process:391,member:l4041p-2, Role=storagenode)
FromMessageId=0
MessagePartCount=0
PendingCount=0
BufferCounter=1
MessageType=70
ToPollId=19827300
Poll=null
Packets
{
}
Service=PartitionedCache{Name=DistributedCache, State=(SERVICE_STARTED), Id=3, OldestMemberId=1, LocalStorage=enabled, PartitionCount=601, BackupCount=1, AssignedPartitions=18, BackupPartitions=19, CoordinatorId=1}
ToMemberSet=MemberSet(Size=1
Member(Id=179, Timestamp=2023-03-12 18:43:36.847, Address=138.106.96.25:33316, MachineId=48549, Location=site:sss.se.xxxxyyyy.com,machine:l4025p,process:3473,member:l4025p_11990, Role=scex)
)
NotifyDelivery=false
}: peer=tmb://138.106.96.25:33316.40573, state=ACTIVE, socket=MultiplexedSocket{Socket[addr=/138.106.96.25,port=38114,localport=9001]}, migrations=17, bytes(in=104371345, out=101784244), flushlock false, bufferedOut=0B, unflushed=0B, delivered(in=203177, out=197772), timeout(ack=0ns), interestOps=1, unflushed receipt=0, receiptReturn 0, isReceiptFlushRequired false, bufferedIn(), msgs(in=95922, out=99203/99206)
2023-03-13 20:17:56.310/95085.948 Oracle Coherence CE 14.1.1.0.12 (thread=SelectionService(channels=112, selector=MultiplexedSelector(sun.nio.ch.EPollSelectorImpl@3676ac27), id=1660451908), member=17): tmb://138.106.96.41:9001.49982 initiating connection migration with tmb://138.106.96.32:41070.40752 after 15s ack timeout health(read=true, write=false), receiptWait=null: peer=tmb://138.106.96.32:41070.40752, state=ACTIVE, socket=MultiplexedSocket{Socket[addr=/138.106.96.32,port=41070,localport=36388]}, migrations=5, bytes(in=95752773, out=99458811), flushlock false, bufferedOut=1.54KB, unflushed=0B, delivered(in=192506, out=187239), timeout(ack=0ns), interestOps=1, unflushed receipt=0, receiptReturn 0, isReceiptFlushRequired false, bufferedIn(), msgs(in=90667, out=93950/93953)

@javafanboy javafanboy added the RFA label Mar 14, 2023
@javafanboy
Copy link
Author

javafanboy commented Mar 17, 2023

We continue seeing this message semi frequently for several days so I do not think we have any underlying network problem (we run on an enterprise class DC network where any disturbances are resolved quickly) and I can't see any excessive GC pauses in our log...

Is this a message we need to act on or not critical?

@javafanboy
Copy link
Author

javafanboy commented Mar 17, 2023

Seem to be triggered in class com.oracle.coherence.common.internal.net.socketbus.BufferedSocketBus method checkHealth(long) but it is still not clear to me exactly what ack it is that is timing out or what the code is trying to "do about it" i.e. the "migration":

        else if (ldtNow >= ldtAckTimeout)
            {
            // timeout expired
            final int cMultCap = 10;
            long      cMillisTimeout = f_driver.getDependencies().getAckTimeoutMillis();
            Duration  dur            = new Duration(cMillisTimeout * Math.min(cMultCap, m_cUnackLast + 1), Duration.Magnitude.MILLI);

            getLogger().log(makeRecord(Level.WARNING,
                    "{0} initiating connection migration with {1} after {2} ack timeout health(read={3}, write={4}), receiptWait={5}: {6}",
                    getLocalEndPoint(), getPeer(), dur, fReadHealthy, fWriteHealthy, oReceiptUnacked, BufferedConnection.this));

@mgamanho
Copy link
Contributor

Hi, these messages can be quite common depending on system size, load, context (application busy, GC, ...) and as you can see they are harmless.

The heartbeat (health check) is exchanged regularly by cluster members to ensure the cluster is whole. Everyone looks after each other, so that generates a fair amount of ancillary activity for which once in a while we detect issues. At that point we decide to "migrate" the connection, which is merely opening a new connection and ditching the old one. In some cases a "fresh" connection may resolve OS or JVM congestion issues. In most cases it is just a precaution that has no consequences.

These heartbeat messages are purely for housekeeping and they are timed relatively aggressively to weed out potential issues. Application/data traffic is not impacted. If you do see them frequently, however, they are an indication that some tuning may be necessary: heap and GC pressure, network health, native memory (OS/hardware) for example. Running network performance tests can show you if network settings are ok, they can be eye opening.

Let us know how it goes.

@javafanboy
Copy link
Author

javafanboy commented Mar 17, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants