Issue: JacORB client opening a new connection to a persistent ACE-TAO server, forwarded via ACE-TAO ImplementationRepository to a process that is dead can enter an infinite loop. Analysis: org.jacorb.orb.factory.AbstractSocketFactory.java: try { return doCreateSocket(host, port, timeout); } catch(SocketTimeoutException e) { throw new TIMEOUT(e.toString()); } Problem (1), this can also throw ConnectException, minor-message="timeout connecting to socket". This is using a 1.6 JVM, but I believe is true since 1.1. Doesn't seem to be a way to translate this to TIMEOUT which it probably should be. At the very least, this means client sees TRANSIENT later, not TIMEOUT. Also means we go through an additional set of retry loops in ClientIIOPConnection.java::connect()). org.jacorb.orb.iiop.ClientIIOPConnection.java::createSocket() if (exception instanceof SocketTimeoutException) { throw new TIMEOUT("connection timeout of " + timeout + " milliseconds expired: " + exception); } else if( exception instanceof IOException ) { throw (IOException)exception; } else { //not expected, because all used methods just throw IOExceptions or TIMEOUT //but... never say never ;o) throw new IOException ( "Unexpected exception occured: " + exception.toString() ); } Problem (2) We possibly just caught a TIMEOUT (as above) and translated it back to IOException?? How's that work then. Not seen this in practice, because I am seeing the ConnectException in my test harness, but seems possible from reading code. Problem (3) org.omg.jacorb.orb.Delegate.java if ( cfe instanceof org.omg.CORBA.TRANSIENT && try_rebind() ) Whoops, that was a TRANSIENT when it should have been a TIMEOUT. I'm about to have a try_rebind(). Here's were I get a bit hazy. I *think* there's code in try_rebind() that is meant to check whether I am rebinding to an IOR I know to be bad and prevent an infinite loop, but it doesn't work. Flow of control I see: original IOR = TAO ImR reference connect to ior, recieve location forward new ior = my broken server connect fails lastFailedIOR = my broken server try_rebind() rebind to original ior ok = TAO ImR reference ooh look, all is well, set lastFailedIOR = null connect to ior, recieve location forward Rinse and repeat. Now either this is wrong: piorLastFailed = null; // supplied byte Kevin Heifner, OCI As this resets the last field ior after rebinding to my original ImR reference, when the forwarded reference fails again I am in a loop. Or if locate_on_bind is true, then the lines that set piorOriginal = null; are surely wrong, as they throw away my ImR reference and it isn't reset by the forward-because I have just done it. If my forwarded reference is ok to start with, but later I need to rebind, I have nothing to rebind to! I don't think we can fix 1 and 2 properly and change to TIMEOUT - that ConnectException is painful, and what if I have a genuine TRANSIENT that my servers cant recover from, puts the JacORB client into a loop anyway - so the fix has to be to make the Delegate code work, regardless of whether locate_on_bind is on or off. Sorry, it's a bit late - might revisit this tomorrow and tidy up.
Things I don't understand that prevent me writing a patch I'm confident to submit: 1) Why would we ever set piorOriginal=null after we have set it originally. 2) How can we make this work when locate_on_bind is false? Initial forwarding is ok, and genuine failover and rebind to new ior is ok, but what happens if I have a genuine temporary comms problem, ImR reforwards me to the /same/ ior=piorLastFailed again, how do I know not to refuse to go to it... Unrecoverable failure is better than infinite loop, but not much.
Created attachment 391 [details] Patch to Delegate class to fix infinite loop bug Attached a patch that applies to the org.jacorb.orb.Delegate class in the method 'public boolean is_a( org.omg.CORBA.Object self, String logical_type_id ). It adds a counter to the while loop at the end of the method that causes it to terminate after 5 iterations to prevent an infinite loop. The patch is based off of the 2.3.0 release of jacorb and not the latest 2.3.1 release, but it's simple enough to apply manually. It's probably not an ideal solution, but it gave us a quick fix that we needed for this issue. Submitting it for consideration for the next planned release of jacorb in hopes that either it will be accepted as a fix or that it will prompt a better fix to be found.
I've added a patch to the Delegate class to address this bug. Since there are several builtin operations that could potentially trigger the same infinite loop, I created a helper function that encapsulates the desired behavior, and applied the fix there. Also, to ensure no change to the status quo, I added a configuration property, jacorb.maxBuiltinRetries, to allow users to specify a limiter. No value set will cause the loop to retry forever. Search for this property in etc/jacorb_properties.template for an example of its use. Note that this is really an interim patch addressing a symptom of a larger problem. The larger problem is that of errant interaction with at least the TAO ImR and servers attached to the same. Part of that behavior is also captured in bug 896. The proper solution, which will likely render this fix redundant, I believe is sufficiently difficult as to require funding to develop. Please visit the support page of jacorb.org to contact a company to work who can develop a proper solution.