Read only archive ; use https://github.com/JacORB/JacORB/issues for new issues
Bug 810 - is_a infinite loop due to socket creation timeout & try_rebind
Summary: is_a infinite loop due to socket creation timeout & try_rebind
Status: RESOLVED FIXED
Alias: None
Product: JacORB
Classification: Unclassified
Component: ORB (show other bugs)
Version: 2.3.0
Hardware: PC All
: P2 normal
Assignee: Gerald Brose
URL:
Depends on:
Blocks: 896
  Show dependency tree
 
Reported: 2007-11-12 22:29 CET by Richard Ridgway
Modified: 2011-12-01 14:46 CET (History)
1 user (show)

See Also:


Attachments
Patch to Delegate class to fix infinite loop bug (304 bytes, application/octet-stream)
2011-11-15 22:33 CET, Richard
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Richard Ridgway 2007-11-12 22:29:02 CET
Issue:
JacORB client opening a new connection to a persistent ACE-TAO server, 
forwarded via ACE-TAO ImplementationRepository to a process that is dead can 
enter an infinite loop.

Analysis:

org.jacorb.orb.factory.AbstractSocketFactory.java:

try
{
    return doCreateSocket(host, port, timeout);
}
catch(SocketTimeoutException e)
{
    throw new TIMEOUT(e.toString());
}

Problem (1), this can also throw ConnectException, minor-message="timeout 
connecting to socket". This is using a 1.6 JVM, but I believe is true since 
1.1. Doesn't seem to be a way to translate this to TIMEOUT which it probably 
should be. At the very least, this means client sees TRANSIENT later, not 
TIMEOUT. Also means we go through an additional set of retry loops in 
ClientIIOPConnection.java::connect()).

org.jacorb.orb.iiop.ClientIIOPConnection.java::createSocket()

if (exception instanceof SocketTimeoutException)
{
    throw new TIMEOUT("connection timeout of " + timeout + " milliseconds 
expired: " + exception);
}
else if( exception instanceof IOException )
{
    throw (IOException)exception;
}
else
{
    //not expected, because all used methods just throw IOExceptions or TIMEOUT
    //but... never say never ;o)
    throw new IOException ( "Unexpected exception occured: " + 
exception.toString() );
}

Problem (2) We possibly just caught a TIMEOUT (as above) and translated it 
back to IOException?? How's that work then. Not seen this in practice, because 
I am seeing the ConnectException in my test harness, but seems possible from 
reading code.


Problem (3)
org.omg.jacorb.orb.Delegate.java

if ( cfe instanceof org.omg.CORBA.TRANSIENT && try_rebind() )

Whoops, that was a TRANSIENT when it should have been a TIMEOUT. I'm about to 
have a try_rebind().

Here's were I get a bit hazy. I *think* there's code in try_rebind() that is 
meant to check whether I am rebinding to an IOR I know to be bad and prevent 
an infinite loop, but it doesn't work.

Flow of control I see:

original IOR = TAO ImR reference
connect to ior, recieve location forward
new ior = my broken server
connect fails
lastFailedIOR = my broken server
try_rebind()
rebind to original ior ok = TAO ImR reference
ooh look, all is well, set lastFailedIOR = null
connect to ior, recieve location forward

Rinse and repeat.


Now either this is wrong:
piorLastFailed = null; // supplied byte Kevin Heifner, OCI

As this resets the last field ior after rebinding to my original ImR 
reference, when the forwarded reference fails again I am in a loop.

Or if locate_on_bind is true, then the lines that set
piorOriginal = null;
are surely wrong, as they throw away my ImR reference and it isn't reset by 
the forward-because I have just done it. If my forwarded reference is ok to 
start with, but later I need to rebind, I have nothing to rebind to!


I don't think we can fix 1 and 2 properly and change to TIMEOUT - that 
ConnectException is painful, and what if I have a genuine TRANSIENT that my 
servers cant recover from, puts the JacORB client into a loop anyway - so the 
fix has to be to make the Delegate code work, regardless of whether 
locate_on_bind is on or off.


Sorry, it's a bit late - might revisit this tomorrow and tidy up.
Comment 1 Richard Ridgway 2007-11-12 22:38:11 CET
Things I don't understand that prevent me writing a patch I'm confident to 
submit:

1) Why would we ever set piorOriginal=null after we have set it originally.
2) How can we make this work when locate_on_bind is false? Initial forwarding 
is ok, and genuine failover and rebind to new ior is ok, but what happens if I 
have a genuine temporary comms problem, ImR reforwards me to the /same/ 
ior=piorLastFailed again, how do I know not to refuse to go to it... 
Unrecoverable failure is better than infinite loop, but not much.


Comment 2 Richard 2011-11-15 22:33:00 CET
Created attachment 391 [details]
Patch to Delegate class to fix infinite loop bug

Attached a patch that applies to the org.jacorb.orb.Delegate class in the method 'public boolean is_a( org.omg.CORBA.Object self, String logical_type_id ).  It adds a counter to the while loop at the end of the method that causes it to terminate after 5 iterations to prevent an infinite loop.  The patch is based off of the 2.3.0 release of jacorb and not the latest 2.3.1 release, but it's simple enough to apply manually.

It's probably not an ideal solution, but it gave us a quick fix that we needed for this issue.  Submitting it for consideration for the next planned release of jacorb in hopes that either it will be accepted as a fix or that it will prompt a better fix to be found.
Comment 3 Phil Mesnier 2011-12-01 14:45:03 CET
I've added a patch to the Delegate class to address this bug. Since there are several builtin operations that could potentially trigger the same infinite loop, I created a helper function that encapsulates the desired behavior, and applied the fix there. Also, to ensure no change to the status quo, I added a configuration property, jacorb.maxBuiltinRetries, to allow users to specify a limiter. No value set will cause the loop to retry forever. Search for this property in etc/jacorb_properties.template for an example of its use.

Note that this is really an interim patch addressing a symptom of a larger problem. The larger problem is that of errant interaction with at least the TAO ImR and servers attached to the same. Part of that behavior is also captured in bug 896. The proper solution, which will likely render this fix redundant, I believe is sufficiently difficult as to require funding to develop. Please visit the support page of jacorb.org to contact a company to work who can develop a proper solution.