-
Notifications
You must be signed in to change notification settings - Fork 137
[Transaction] Fix initTransaction might wait until request timeout #1991
[Transaction] Fix initTransaction might wait until request timeout #1991
Conversation
@@ -584,7 +584,7 @@ private Either<Errors, Optional<CoordinatorEpochAndTxnMetadata>> getAndMaybeAddT | |||
if (loadingPartitions.contains(partitionId)) { | |||
log.info("TX Coordinator {} partition {} for transactionalId {} is loading", | |||
transactionConfig.getTransactionMetadataTopicName(), partitionId, transactionalId); | |||
return Either.left(Errors.COORDINATOR_LOAD_IN_PROGRESS); | |||
return Either.left(Errors.COORDINATOR_NOT_AVAILABLE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a comment to let others know why we change to COORDINATOR_NOT_AVAILABLE
?
Maybe it's also related to this change, I think the init-producerID request should be resent after 1 second when receiving |
Yes, it will wait for the request timeout and then send the next request. I tried to change the |
@gaoran10 Good find! I will investigate it deeper. |
Yes, it's because of the throttle. When any response arrives, the private void maybeThrottle(AbstractResponse response, short apiVersion, String nodeId, long now) {
int throttleTimeMs = response.throttleTimeMs();
if (throttleTimeMs > 0 && response.shouldClientThrottle(apiVersion)) {
connectionStates.throttle(nodeId, now + throttleTimeMs);
log.trace("Connection to node {} is throttled for {} ms until timestamp {}", nodeId, throttleTimeMs,
now + throttleTimeMs);
}
} However, the throttle is only 1 second. After debugging with Kafka client, I found the root cause is log.info("XYZ before awaitNodeReady");
if (!awaitNodeReady(targetNode, coordinatorType)) {
/* ... */
} else {
log.trace("XYZ awaitNodeReady done: {} {}", targetNode, coordinatorType);
} and found:
public static boolean awaitReady(KafkaClient client, Node node, Time time, long timeoutMs) throws IOException {
// timeoutMs is the request timeout, which is 10 seconds in our config.
if (timeoutMs < 0) {
throw new IllegalArgumentException("Timeout needs to be greater than 0");
}
long startTime = time.milliseconds();
// isReady will check if `throttleUntilTimeMs` of the connection is before the current timestamp.
if (isReady(client, node, startTime) || client.ready(node, startTime))
return true;
long attemptStartTime = time.milliseconds();
while (!client.isReady(node, attemptStartTime) && attemptStartTime - startTime < timeoutMs) {
// Since the throttle time is 1 second, it will always go to this loop
if (client.connectionFailed(node)) {
throw new IOException("Connection to " + node + " failed.");
}
// Since the time of `isReady()` and `client.ready()` calls are too small to count,
// the pollTimeout is very closed to the `timeoutMs` (10 seconds)
long pollTimeout = timeoutMs - (attemptStartTime - startTime);
// Then, `poll` will be blocking for nearly 10 seconds because there is no inflight request
client.poll(pollTimeout, attemptStartTime);
if (client.authenticationException(node) != null)
throw client.authenticationException(node);
attemptStartTime = time.milliseconds();
}
return client.isReady(node, attemptStartTime);
} The reason why returning a } else if (coordinatorType != null) {
log.trace("Coordinator not known for {}, will retry {} after finding coordinator.", coordinatorType, requestBuilder.apiKey());
maybeFindCoordinatorAndRetry(nextRequestHandler);
return true; |
10c76b8
to
227876c
Compare
Codecov Report
@@ Coverage Diff @@
## master #1991 +/- ##
=========================================
Coverage 17.29% 17.29%
Complexity 728 728
=========================================
Files 190 190
Lines 14041 14038 -3
Branches 1320 1318 -2
=========================================
Hits 2428 2428
+ Misses 11437 11434 -3
Partials 176 176
|
Hi @Demogorgon314 @gaoran10, I just updated my PR according to the explanation above. IMO, it does not make sense to set a 1 second throttle ms in the InitProducerId response, so I removed the config. |
…treamnative#1991) ### Motivation Most of the tests in `TransactionTest` take at least 10 seconds because the `request.timeout.ms` config is 10000. See https://github.com/streamnative/kop/blob/1fd3bdb9158fd944c2fb7a241c0d8dca923367ab/tests/src/test/java/io/streamnative/pulsar/handlers/kop/coordinator/transaction/TransactionTest.java#L1150 It's because KoP loads the metadata topic lazily, unlike Kafka, which loads the metadata topic when started. KoP adopts this way because KoP starts before the Pulsar broker finishes the start, in this case there might be some problems with the topic lookup. Therefore, when the Kafka producer sends the 1st INIT_PRODUCER_ID request, it will receive `COORDINATOR_LOAD_IN_PROGRESS` error, and the request will reenqueue [1], then the request will expire after `request.timeout.ms` milliseconds. ```java } else if (error != Errors.NOT_COORDINATOR && error != Errors.COORDINATOR_NOT_AVAILABLE) { if (error != Errors.COORDINATOR_LOAD_IN_PROGRESS && error != Errors.CONCURRENT_TRANSACTIONS) { /* ... */ } else { this.reenqueue(); // [1] } } else { // [2] TransactionManager.this.lookupCoordinator(CoordinatorType.TRANSACTION, TransactionManager.this.transactionalId); this.reenqueue(); } ``` ### Modifications Return `COORDINATOR_NOT_AVAILABLE` instead of `COORDINATOR_LOAD_IN_PROGRESS` when the transaction metadata topic is loading. It's different with Kafka's behavior but it can avoid waiting 30 seconds by default for `initTransaction` in KoP. In addition, configure the request timeout with 3 seconds for some cases that `initTransaction` might expire so that the test time could be reduced. (cherry picked from commit b645003)
Motivation
Most of the tests in
TransactionTest
take at least 10 seconds because therequest.timeout.ms
config is 10000. Seekop/tests/src/test/java/io/streamnative/pulsar/handlers/kop/coordinator/transaction/TransactionTest.java
Line 1150 in 1fd3bdb
It's because KoP loads the metadata topic lazily, unlike Kafka, which loads the metadata topic when started. KoP adopts this way because KoP starts before the Pulsar broker finishes the start, in this case there might be some problems with the topic lookup.
Therefore, when the Kafka producer sends the 1st INIT_PRODUCER_ID request, it will receive
COORDINATOR_LOAD_IN_PROGRESS
error, and the request will reenqueue [1], then the request will expire afterrequest.timeout.ms
milliseconds.Modifications
Return
COORDINATOR_NOT_AVAILABLE
instead ofCOORDINATOR_LOAD_IN_PROGRESS
when the transaction metadata topic is loading. It's different with Kafka's behavior but it can avoid waiting 30 seconds by default forinitTransaction
in KoP.In addition, configure the request timeout with 3 seconds for some cases that
initTransaction
might expire so that the test time could be reduced.Documentation
Check the box below.
Need to update docs?
doc-required
(If you need help on updating docs, create a doc issue)
no-need-doc
(Please explain why)
doc
(If this PR contains doc changes)