Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JAVA-3145: Implement some notion of retry logic around call to metadata service for Astra clients #1902

Open
wants to merge 2 commits into
base: 4.x
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
*/
package com.datastax.oss.driver.internal.core.config.cloud;

import com.datastax.oss.driver.api.core.DriverTimeoutException;
import com.datastax.oss.driver.api.core.metadata.EndPoint;
import com.datastax.oss.driver.internal.core.metadata.SniEndPoint;
import com.datastax.oss.driver.internal.core.ssl.SniSslEngineFactory;
Expand Down Expand Up @@ -58,6 +59,8 @@
@ThreadSafe
public class CloudConfigFactory {
private static final Logger LOG = LoggerFactory.getLogger(CloudConfigFactory.class);
private static final int METADATA_RETRY_MAX_ATTEMPTS = 4;
private static final int METADATA_RETRY_INITIAL_DELAY_MS = 250;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: per an earlier conversation @SiyaoIsHiding and I decided to leave these as constants for now. We can move them to configurable values if an explicit need prevents itself (or perhaps as a future enhancement).

/**
* Creates a {@link CloudConfig} with information fetched from the specified Cloud configuration
* URL.
Expand Down Expand Up @@ -225,22 +228,40 @@ protected TrustManagerFactory createTrustManagerFactory(
@NonNull
protected BufferedReader fetchProxyMetadata(
@NonNull URL metadataServiceUrl, @NonNull SSLContext sslContext) throws IOException {
try {
HttpsURLConnection connection = (HttpsURLConnection) metadataServiceUrl.openConnection();
connection.setSSLSocketFactory(sslContext.getSocketFactory());
connection.setRequestMethod("GET");
connection.setRequestProperty("host", "localhost");
return new BufferedReader(
new InputStreamReader(connection.getInputStream(), StandardCharsets.UTF_8));
} catch (ConnectException e) {
throw new IllegalStateException(
"Unable to connect to cloud metadata service. Please make sure your cluster is not parked or terminated",
e);
} catch (UnknownHostException e) {
throw new IllegalStateException(
"Unable to resolve host for cloud metadata service. Please make sure your cluster is not terminated",
e);
HttpsURLConnection connection = null;
int attempt = 0;
while(attempt < METADATA_RETRY_MAX_ATTEMPTS){
try {
connection = (HttpsURLConnection) metadataServiceUrl.openConnection();
connection.setSSLSocketFactory(sslContext.getSocketFactory());
connection.setRequestMethod("GET");
connection.setRequestProperty("host", "localhost");
attempt++;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a problem logically but I'd argue it's a bit cleaner to increment the number of attempts at the very top of this loop. You can probably do it even before going into the try block honestly.

// if this is the last attempt, throw
// else if the response code is not 200, retry
// else, throw
if (attempt != METADATA_RETRY_MAX_ATTEMPTS && connection.getResponseCode() != 200) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be attempt <= METADATA_RETRY_MAX_ATTEMPTS. You're incrementing "attempt" at the very top of this block so it will always point to the current attempt (i.e. it'll be one on the first attempt, two on the second, etc.). You want to check here that the current attempt is less than or equal to the max number of attempts allowed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want it to throw the error at the last attempt. Otherwise, a FileNotFoundException will become a DriverTimeoutException (the dead code at the end).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need to make this a bit more granular. For example: 300 errors (3xx) indicate that a resource has been moved (or that we're being redirected someplace else). We should probably follow those if URLConnection doesn't do so already.

4xx errors are all client errors; my guess is that in most cases we won't want to retry these, although we have seen at least one case with a 421 (misdirected request) response which could benefit from a retry. JAVA-3033 also makes reference to a case in which 401 errors can be returned apparently due to a hibernated Astra instance.

5xx errors indicate a failure server-side; these should probably be retried in case there's a transient server error of some kind.

It might be worth changing this to an if test only here and then having a set of if tests underneath that to check for the various error codes; that way you can handle each case independently without making the code too complex.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3xx should be redirected as long as the redirection is in the same protocol. How about if (connection.getResponseCode() == 401 || connection.getResponseCode() == 421 || connection.getResponseCode() >= 500), retry?

"a set of if tests underneath that" do you mean extracting the conditions into a function, like bool shouldRetryGivenStatusCode(int)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HttpURLConnection.HTTP_OK is probably preferred here to the literal 200

try {
Thread.sleep(METADATA_RETRY_INITIAL_DELAY_MS);
continue;
} catch (InterruptedException interruptedException) {
Thread.currentThread().interrupt();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't need to do an explicit interrupt of the current thread here; that's already happened (in order to get the InterruptedException to you). :)

throw new IOException("Interrupted while waiting to retry metadata fetch", interruptedException);
}
}
return new BufferedReader(
new InputStreamReader(connection.getInputStream(), StandardCharsets.UTF_8));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We know that connection.getInputStream() can throw an IOException under various cases (including some non-200 response codes). Presumably if we add checks above to make sure we only hit this code if we have a 200 response code we can avoid that issue... but any such code is probably a bit more brittle than we'd like. While we're here it may be best to just wrap this line of code in it's own try/catch and either (a) return an IllegalStateException if we get an IOException (like we do in the other catch blocks here) or (b) fold any such error into our retry logic.

Note that this issue is really what JAVA-3033 is about.

} catch (ConnectException e) {
throw new IllegalStateException(
"Unable to connect to cloud metadata service. Please make sure your cluster is not parked or terminated",
e);
} catch (UnknownHostException e) {
throw new IllegalStateException(
"Unable to resolve host for cloud metadata service. Please make sure your cluster is not terminated",
e);
}
}
throw new DriverTimeoutException("Unable to fetch metadata from cloud metadata service"); // dead code
}

@NonNull
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
import static com.github.tomakehurst.wiremock.client.WireMock.urlEqualTo;
import static com.github.tomakehurst.wiremock.client.WireMock.urlPathEqualTo;
import static com.github.tomakehurst.wiremock.core.WireMockConfiguration.wireMockConfig;
import static com.github.tomakehurst.wiremock.stubbing.Scenario.STARTED;
import static org.assertj.core.api.Assertions.catchThrowable;

import com.datastax.oss.driver.internal.core.ssl.SniSslEngineFactory;
Expand Down Expand Up @@ -139,6 +140,27 @@ public void should_throw_when_metadata_not_found() throws Exception {
assertThat(t).isInstanceOf(FileNotFoundException.class).hasMessageContaining("metadata");
}

@Test
public void should_retry_when_metadata_return_non_200() throws Exception {
// given, first attempt 421, subsequent attempts 200
mockHttpSecureBundle(secureBundle());
URL configFile = new URL("http", "localhost", wireMockRule.port(), BUNDLE_PATH);
stubFor(any(urlPathEqualTo("/metadata")).inScenario("retry")
.whenScenarioStateIs(STARTED)
.willReturn(aResponse().withStatus(421))
.willSetStateTo("second-200"));
stubFor(any(urlPathEqualTo("/metadata")).inScenario("retry")
.whenScenarioStateIs("second-200")
.willReturn(aResponse().withStatus(200)
.withHeader("Content-Type", "application/json")
.withBody(jsonMetadata())));
// when
CloudConfigFactory cloudConfigFactory = new CloudConfigFactory();
CloudConfig cloudConfig = cloudConfigFactory.createCloudConfig(configFile);
// then
assertCloudConfig(cloudConfig);
}

@Test
public void should_throw_when_metadata_not_readable() throws Exception {
// given
Expand Down