-
Notifications
You must be signed in to change notification settings - Fork 372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor code around cloud support #1975
base: master
Are you sure you want to change the base?
Conversation
@cmnbroad will you review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tsato The PicardHtsPath
to IOPath
changes seem to be mostly ok (although there are other places where the change can be made), but there are a few other problems that I didn't even notice before, the biggest one being that we need to reconcile GCloudTestUtils
with PicardBucketUtils
. (I didn't realize, for instance, that GCloudTestUtils
has support for a custom staging area specified via the environment, which PicardBucketUtils
completely ignores.) I know thats probably a bigger issue than you were trying to solve here - we should address that in a separate PR (again, I'm happy to do that if you want), but I really think we need to resolve these issues before we do any more cloudification.
@@ -182,7 +182,7 @@ public static PicardHtsPath replaceExtension(final IOPath path, final String new | |||
/** | |||
* Wrapper for Path.resolve() | |||
*/ | |||
public static PicardHtsPath resolve(final PicardHtsPath absPath, final String relativePath){ | |||
public static PicardHtsPath resolve(final IOPath absPath, final String relativePath){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the return value for this should also be IOPath, unless there is some reason that would cause problems. (Ideally this would be a method on IOPath, but since it has to return a new object of the subclass type, it can't live there, unless we add some kind of abstract "toSelf" method that each subclass has to implement. For another day).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, and I see your point. Another day sounds good.
return scheme != null | ||
&& (scheme.equals(GOOGLE_CLOUD_STORAGE_FILESYSTEM_SCHEME) | ||
|| scheme.equals(HTTP_FILESYSTEM_PROVIDER_SCHEME) | ||
|| scheme.equals(HTTPS_FILESYSTEM_PROVIDER_SCHEME)); | ||
} | ||
|
||
public static boolean isLocalPath(final IOPath path){ | ||
return path.getScheme().equals(FILE_SCHEME); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should really be an instance method on PicardHtsPath
for now (ultimately, really it should be on IOPath - I may create a PR to add it to the IOPath
interface in htsjdk).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to happen in IOPath, because all the objects that would call the instance method isLocalPath have type IOPath, not PicardHtsPath. If you could add it to htsjdk, that would be great.
PicardIOUtils.deleteOnExit(PicardHtsPath.replaceExtension(path, ".md5", true).toPath()); | ||
return path; | ||
} else { | ||
throw new PicardException("Unsupported cloud filesystem: " + directory.getURIString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isSupportedCloudFileSystem
currently throws (I think it should not - see my comments there). If you keep this code, I suggest it to IllegalArgumentException
(I'm still not sure what the intended use of PicardException is in this codebase, but IllegalArgumentException
would seem more correct to me), and changing isSupportedFileSystem
to just return a boolean and never throw.
@@ -61,7 +62,7 @@ default Path getReferencePath(){ | |||
* | |||
* @return The reference provided by the user, if any, or the default, if any, as a PicardHtsPath. May be null. | |||
*/ | |||
default PicardHtsPath getHtsPath(){ | |||
default IOPath getHtsPath(){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name of this method should probably change to reflect the new return type , i.e., getIOPath
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
private static boolean isHadoopUrl(String path) { | ||
return path.startsWith(HDFS_SCHEME + "://"); | ||
public static boolean isSupportedCloudFilesystem(final IOPath path){ | ||
ValidationUtils.validateArg(! isLocalPath(path), "isSupportedCloudFilesystem should be called on a cloud path but was given: " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't actually think this method should be retained (or adds much value), for several reasons.
First, enumerating all of the supported protocol schemes may become difficult as we add additional cloud providers (I think azure, for example, uses multiple schemes). Also, what schemes will work may depend on what additional providers the user has on their classpath.
Having said that, if you're going to keep it, it should be moved to PicardHtsPath, and a test added, with test cases at least for file
, http
, hdfs
, s3
, and gcs
schemes.
Finally, for a predicate method like this, where you're just testing the protocol scheme, it should be pure and not have a side effect like throwing an exception (with the possible exception of requiring that the input is non-null, since the code can't recover from that). I don't see any reason to require the caller to first test if the path has a cloud scheme to prevent it from throwing. It should just return a boolean if the scheme is a supported cloud scheme, and let the caller decide when to throw.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing sounds good
public static final String GCLOUD_PICARD_STAGING_DIRECTORY = "gs://hellbender-test-logs/staging/picard/"; | ||
public static final String GCLOUD_PICARD_STAGING_DIRECTORY_STR = "gs://hellbender-test-logs/staging/picard/"; | ||
public static final PicardHtsPath GCLOUD_PICARD_STAGING_DIRECTORY = new PicardHtsPath(GCLOUD_PICARD_STAGING_DIRECTORY_STR); | ||
|
||
|
||
// slashes omitted since hdfs paths seem to only have 1 slash which would be weirder to include than no slashes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
git blame seems to implicate me on this comment, but I have no clue what it means ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a remnant from when we copied over code from GATK. Removed.
@@ -289,7 +290,7 @@ final Path makeTempDictionary(final Path inputFasta, final String dictNamePrefix | |||
final PicardHtsPath HG19_MINI = PicardHtsPath.resolve(GCloudTestUtils.TEST_INPUTS_DEFAULT_GCLOUD, "picard/references/hg19mini.fasta"); | |||
final PicardHtsPath HG19_MINI_LOCAL = new PicardHtsPath("testdata/picard/reference/hg19mini.fasta"); | |||
|
|||
final PicardHtsPath CLOUD_OUTPUT_DIR = PicardHtsPath.resolve(GCloudTestUtils.TEST_STAGING_DEFAULT_GCLOUD, "picard/"); | |||
final PicardHtsPath CLOUD_OUTPUT_DIR = PicardHtsPath.resolve(GCloudTestUtils.TEST_STAGING_DEFAULT_GCLOUD, "picard/CreateSequenceDictionary/"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should be consistent with the types of these test constants. Elsewhere in this PR, you changed some of them to have the IOPath
type, which is good. But most of these remaining PicardHtsPath
constants can also be changed (of course there may be exceptions, such as if you need to subsequently call a method that is only on to PicardHtsPath
, but we should try to minimize the number of those).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I changed them to IOPath
return new PicardHtsPath(PicardIOUtils.createTempFileInDirectory(prefix, extension, new File(directory))); | ||
return new PicardHtsPath(PicardIOUtils.createTempFileInDirectory(prefix, extension, directory.toPath().toFile())); | ||
} else { | ||
if (isSupportedCloudFilesystem(directory)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the right way to branch here is to invert the logic and test for a file
protocol scheme. See my comments on isSupportedCloudFileSystem
.
ValidationUtils.validateArg(relativePath.endsWith("/"), "relativePath must end in backslash '/': " + relativePath); | ||
|
||
return PicardHtsPath.fromPath(PicardBucketUtils.randomRemotePath(GCLOUD_PICARD_STAGING_DIRECTORY + relativePath, "", "/")); | ||
return PicardBucketUtils.randomRemotePath(PicardHtsPath.resolve(GCLOUD_PICARD_STAGING_DIRECTORY, relativePath), "", "/"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Every direct reference to GCLOUD_PICARD_STAGING_DIRECTORY
like this is bypassing the GCloudUtils
methods such as getTestStaging
that check the environment to determine if the user has specified an override staging area to use.
public static final String FILE_SCHEME = "file"; | ||
|
||
// This Picard test staging bucket has a TTL of 180 days (DeleteAction with Age = 180) | ||
public static final String GCLOUD_PICARD_STAGING_DIRECTORY = "gs://hellbender-test-logs/staging/picard/"; | ||
public static final String GCLOUD_PICARD_STAGING_DIRECTORY_STR = "gs://hellbender-test-logs/staging/picard/"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These really need to be reconciled with the ones in GCloudTestUtils
, where there is a redundant definition of the staging area root gs://hellbender-test-logs/staging/
. There really should be only one definition, and all direct usages of this constant should be replaced with calls to the methods in GCloudTestUtils
that interrogate where the user wants the staging area to be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out. I think it should be handled in a separate PR, if that's alright with you.
Description
As requested by @cmnbroad. The changes include:
Checklist (never delete this)
Never delete this, it is our record that procedure was followed. If you find that for whatever reason one of the checklist points doesn't apply to your PR, you can leave it unchecked but please add an explanation below.
Content
Review
For more detailed guidelines, see https://github.com/broadinstitute/picard/wiki/Guidelines-for-pull-requests