Test Environment :
The environment is HP-UX B11.31 ia64, java 184.108.40.206 and jacorb 2.3.1 (or jacorb 3.0 beta)
The same bug can be noticed in jacorb 3.0 also.
The bug comes only for PERSISTENT (life span policy value)corba servers. TRASIENT corba servers works fine.
Around 1000 client threads are created simutaneously to call two different interface methods (500 threads for each method)
each part of different PERSISTENT corba servers.
The methods are dummy method ones and just sleeps for 4 seconds.
There are two problems noticed.
1. Some methods are executed more than once. While client logs are showed that a method is executed only 500 times
server logs showed that it is executed more than 500 time. Some times it is around 510 to even maximum of 600 times.
This problem is CONSISTENTLY noticed in both jacorb 2.3.1 and jacorb 3.0 beta
2. Not all the methods are succesfully went through. Few client methods (around 1 to 3) failed with following error
Unexpected Exception:org.omg.CORBA.COMM_FAILURE: vmcid: 0x0 minor code: 0 completed: Maybe
org.omg.CORBA.COMM_FAILURE: vmcid: 0x0 minor code: 0 completed: Maybe
The above problem is NOT consistently noticed and noticed once in 2 or 3 times approximately in jacorb 2.3.1
This problem is CONSISTENTLY noticed for jacorb 3.0 beta.
Created attachment 395 [details]
Contains all the files for testing the bugs. BugDescription.txt has all the details
BugDescription.txt has all the details about the bug, the source files involved and the test procedure.
Here's what is happening to cause this problem.
The client is firing off 100 concurrent requests, one per thread.
Several of these go off to the IMR, which dutifully sends back location forwarding details.
The Delegate processes the location forward, rebinding and "resetting" all waiting requests, because these must all be waiting on the IMR, right? Unfortunately in this case, this is not right.
Upon receiving the first location forward, the Delegate rebinds, which releases the connection to the IMR and opens a new connection to the server. Since there were several requests sent to the IMR, multiple location forward exceptions are received from the IMR before the connection is released, and each one triggers a reset/rebind.
Since the machine is loaded, it takes time to actually spawn all 100 threads, meaning some requests are started after the first rebind occurred, meaning the client sent the request to the real server. When the second rebind occurs, all of these threads dutifully stop waiting and remarshal and resend. Depending on the system load, this can happen several times, leading to 3 or more duplications of a request. This effect can impact many threads, leading to dozens of duplicated requests sent to the server.
Now for the solution. The easiest fix is to do nothing to JacORB, and restructure the client application so that a connection to the real server is established prior to starting any MT activity. However this breaks down if for instance the client doesn't obtain the target IOR until it has already entered its MT operational phase, or if perhaps the connection to the server is lost and must be reestablished.
I think the durable solution is to be more intelligent about when to abandon a request. This would involve modifying the ReplyReceiver and the Delegate so that the RR can decide if it truly needs to abandon the current request or not.
I'm stepping down the severity from Blocker to Major in that the condition under which the bug appears is very narrowly defined - it doesn't happen everywhere even under similar loads. Also, there is a work-around defined which mitigates the likelihood of occurrence even more.
We have another work-around for this bug. Each client thread can make its own duplicate of the desired reference, and then any rebinding and notification will affect only the thread(s) using the duplicate. This is not a perfect solution however, since each duplicate must first interact with the IMR in order to get the target server's reference. Depending on the performance of the client host, this may result in many consecutive connections to the IMR, and possibly the target server as well, depending on reference lifespan.
To test this, modify the original interface1_test.java from:
test_interface1 target = test_interface1Helper.narrow (servantRef._duplicate());
The interface2_test can be similarly modified.
Thanks Nick Cross for this suggestion.