Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support cloud input/output for IntervalListTools #1852

Merged
merged 28 commits into from
Jul 20, 2023
Merged

Support cloud input/output for IntervalListTools #1852

merged 28 commits into from
Jul 20, 2023

Conversation

takutosato
Copy link
Contributor

Description

  • Update Files to PicardHTSPath to support files stored in the cloud

Also see #1804.


Checklist (never delete this)

Never delete this, it is our record that procedure was followed. If you find that for whatever reason one of the checklist points doesn't apply to your PR, you can leave it unchecked but please add an explanation below.

Content

  • Added or modified tests to cover changes and any new functionality
  • Edited the README / documentation (if applicable)
  • All tests passing on Travis

Review

  • Final thumbs-up from reviewer
  • Rebase, squash and reword as applicable

For more detailed guidelines, see https://github.com/broadinstitute/picard/wiki/Guidelines-for-pull-requests

@takutosato
Copy link
Contributor Author

@cmnbroad @droazen @lbergelson here is the PR as requested. I converted the input Files to PicardHtsPath (output is not updated yet). It compiles but when I run it as follows, I get the google exception "400 Bad Request."

cloud=gs://gcp-public-data--broad-references/hg38/v0/HybSelOligos/whole_exome_illumina_coding_v1/whole_exome_illumina_coding_v1.Homo_sapiens_assembly38.targets.interval_list

javadebug -jar ./build/libs/picardcloud.jar IntervalListTools \
ACTION=SUBTRACT \
INPUT=$cloud \
SECOND_INPUT=$cloud \
OUTPUT=test.interval_list

@takutosato takutosato marked this pull request as ready for review February 10, 2023 18:46
@takutosato
Copy link
Contributor Author

@droazen @lbergelson @cmnbroad this PR is ready for review. I left some questions in comments throughout the code. In particular, creating a temp test file/directory was not as natural as for the local case, I would love to hear how you would go about it.

@droazen droazen requested a review from cmnbroad February 13, 2023 17:18
try {
// tsato: fine for now, but maybe I should put the following code inside try.
// do I need to define a new variable here? Giving OUTPUT to assertDirectoryIsWritable leads to not found
Files.createDirectories(OUTPUT.toPath());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the gcs provider does in this case. It looks like a folder isn't a unique entity in gcs (https://cloud.google.com/storage/docs/folders), rather it just defines part of an object's name. So its unclear to me what happens when this is executed within a gcs bucket, or whether the resulting directory is navigable after this.

If its problematic, one option might be to only make this call when the PicardHtsPath URI scheme is "file:" (picardHtsPath.getScheme().equals("file")). Ideally, rather than implementing that logic wherever we need it, we would extend the base IOPath interface with additional methods for creating a directory, resolving a sibling, asserting something is writable, etc., and then add default implementations toHtsPath, PicardHtsPath, and GATKPath, but maybe thats for another day. We can discuss at the next meeting whether we want to do this as part of this PR, or defer, (in which case we might want to choose tools that don't write out folders).

// do I need to define a new variable here? Giving OUTPUT to assertDirectoryIsWritable leads to not found
Files.createDirectories(OUTPUT.toPath());
// tsato: this line below always throws an error even if the directory is there and writable, so temporarily turn this off
// IOUtil.assertDirectoryIsWritable(yo);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment above, this issue may be due to bucket/folder semantics.

Files.createDirectory(result.getParent());
return result;
} catch (IOException e){
// tsato: Why is throwing a PicardException not a static error, when throwing an IOException is?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IOException is a "checked" exception, meaning the compiler requires you to either handle (catch) it, or else declare that your method throws it (in which case the compiler will apply the same requirement to any methods that call your method). Checked exceptions are any method that does not derive from RuntimeException. PicardException is derived from RuntimeException, and therefore is not subject to the compiler's static requirements.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. So the best practice is to either

  1. catch a checked exception, then throw a RuntimeException.
  2. let the checked exception get thrown without catching (and add throws IOException keyword),

and whether we do 1 or 2 depends on the case at hand? I feel like I've done both in the past, without thinking too hard about the pros/cons of one way over the other...

if (!directory.exists() && !directory.mkdir()) {
throw new PicardException("Unable to create directory: " + directory.getAbsolutePath());
private static Path createSubDirectoryAndGetScatterFile(final PicardHtsPath outputDirectory, final long scatterCount, final String formattedIndex) {
final String newFileName = "temp_" + formattedIndex + "_of_" + scatterCount + "/scattered" + FileExtensions.INTERVAL_LIST;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably fine as it is since you're creating a directory name, but just for future reference, you can't use "_" in a name if it's used as a bucket name. Again, probably doesn't matter here since this isn't a bucket, but its worth knowing if you're making up a bucket name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know, thanks

args.add("SCATTER_COUNT=" +scatterCount);
args.add("OUTPUT=" + outputDir.toUri());
Assert.assertEquals(runPicardCommandLine(args), 0);
Assert.assertEquals(Files.list(outputDir).count(), scatterCount);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should manually verify that Files.list works correctly on a "directory" in a bucket, especially in the case where the "directory has been created, but nothing has been added to it yet. Since it's not really a directory at that point, we might have problems in that case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it turns out, Files.list works as expected, but Files.createTempDirectory seems to be a no-op. The directories are created when we run IntervalListTools via runPicardCommandLine

private ScatterSummary writeScatterIntervals(final IntervalList list) {
private ScatterSummary writeScatterIntervals(final IntervalList list) {
// tsato: In gatk I would use Utils.validate(); what's the equivalent in Picard? Although I could def live without this particular assert.
assert SCATTER_COUNT > 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use validateArg in ValidationUtils (from htsjdk) to do this kind of thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thank you

Comment on lines 596 to 598
final Path result = outputDirectory.toPath().resolve(newFileName);
// tsato: https://stackoverflow.com/questions/3634853/how-to-create-a-directory-in-java
Files.createDirectory(result.getParent());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was confused by this at first. Although I think it works, it would be cleaner to do two separate operations: first make the containing subdirectory, and then use that to derive the file name within that directory:

Suggested change
final Path result = outputDirectory.toPath().resolve(newFileName);
// tsato: https://stackoverflow.com/questions/3634853/how-to-create-a-directory-in-java
Files.createDirectory(result.getParent());
final String subDirName = "temp_" + formattedIndex + "_of_" + scatterCount;
final Path subDir = Files.createDirectory(outputDirectory.toPath().resolve(subDirName));
return subDir.resolve("scattered" + FileExtensions.INTERVAL_LIST);

IOUtil.assertDirectoryIsWritable(OUTPUT);

final ScatterSummary scattered = writeScatterIntervals(output);
final ScatterSummary scattered = writeScatterIntervals(outputIntervals); // tsato: I forget how this all works with try catch vs throws IOException...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comments elsewhere on runtime vs checked exceptions.

IOUtil.assertFilesAreReadable(INPUT);
IOUtil.assertFilesAreReadable(SECOND_INPUT);
IOUtil.assertPathsAreReadable(INPUT.stream().map(PicardHtsPath::toPath).collect(Collectors.toList()));
IOUtil.assertPathsAreReadable(SECOND_INPUT.stream().map(PicardHtsPath::toPath).collect(Collectors.toList()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might try using PicardHtsPath.toPaths here, though I just noticed that it returns a list - it should probably return a Collection.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much cleaner that way. Thank you

final int scatterCount = 3;
// tsato: vs Files.createTempFile()?
try {
final Path outputDir = Files.createTempDirectory(new PicardHtsPath(CLOUD_DATA_DIR).toPath(), "interval_list_scatter_test");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Louis is working on some test utils that we can use for these kinds of things (creating temp fields/dirs).

@takutosato
Copy link
Contributor Author

@cmnbroad back to you. I ported BucketUtils and associated classes from GATK (but not verbatim—some of it had to be edited to live in Picard), and now the automatic deleting of temp files is working.

I think now we just need to figure out how the tests will run on github with @lbergelson.

private static final String TEST_DATA_DIR = "testdata/picard/util/";
// tsato: place these files in an appropriate cloud bucket
// private static final String CLOUD_DATA_DIR = "gs://broad-dsde-methods-takuto/gatk/test/";
private static final String CLOUD_DATA_DIR = "gs://hellbender/test/resources/picard/intervals/"; // tsato: picard/utils might be better for consistency...
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a good place to put test files within gs://hellbender/test/resources/?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SCATTER_COUNT = (int)Math.round((double) listSize / SCATTER_CONTENT);
LOG.info(String.format("Using SCATTER_CONTENT = %d and an interval of size %d, attempting to scatter into %s intervals.", SCATTER_CONTENT, listSize, SCATTER_COUNT));
}

if (OUTPUT != null && SCATTER_COUNT > 1) {

IOUtil.assertDirectoryIsWritable(OUTPUT);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmnbroad @droazen I'm removing these directory writable checks because it always throws an error in google cloud (directories 'don't exist' in gcloud…) Let me know if you think it's problematic

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems ok to me.

@@ -0,0 +1,160 @@
package picard.nio;
Copy link
Contributor Author

@takutosato takutosato Jul 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmnbroad @droazen @lbergelson these classes were "ported" from GATK in order to make temporaryFiles for tests

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@takutosato Some minor comments and 1 big one about the cloud jar dependency. @droazen What do you think we should do about that? Give up on separate cloud/non-cloud picards or try to make it work either way?

* @return the resulting {@code Path}
* @throws UserException if an I/O error occurs when creating the file system
*/
public static Path getPath(String uriString) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn't really thought about it when we talked before, but bringing these files in adds hard references to the google libraries which will cause runtime errors if these methods are invoked in the non-cloud jar. Are we sure we want to do that?

We either have to remove the references to CloudStorageFileSystem, make the gcloud dependency a hard one instead of optional, or handle the cases where it's missing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We make it mandatory is the decision.


/**
*
* Copied from GATK; to be removed once the original GATK code is ported to htsjdk
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should push this to htsjdk as you say here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in The Futuure

SCATTER_COUNT = (int)Math.round((double) listSize / SCATTER_CONTENT);
LOG.info(String.format("Using SCATTER_CONTENT = %d and an interval of size %d, attempting to scatter into %s intervals.", SCATTER_CONTENT, listSize, SCATTER_COUNT));
}

if (OUTPUT != null && SCATTER_COUNT > 1) {

IOUtil.assertDirectoryIsWritable(OUTPUT);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems ok to me.

@@ -477,29 +475,30 @@ protected int doWork() {

} else {
if (OUTPUT != null) {
output.write(OUTPUT);
outputIntervals.write(OUTPUT.toPath());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if this supported htspath directly but this seems reasonable given that it doesn't.

}

LOG.info("Produced " + resultIntervals.intervalCount + " intervals totalling " + resultIntervals.baseCount + " bases.");
if (COUNT_OUTPUT != null) {
try (final PrintStream countStream = new PrintStream(COUNT_OUTPUT)) {
try (final PrintStream countStream = new PrintStream(Files.newOutputStream(COUNT_OUTPUT.toPath()))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're better off using COUNT_OUTPUT.getOutputStream() here I believe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lbergelson turns out COUNT_OUTPUT.getoutputStream() doesn't produce a file for whatever reason, even though both ways call Files.newOutputStream(COUNT_OUTPUT.toPath())…something subtle going on I think, but I reverted it in the meantime.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is very weird. We should investigate what's happening but reverting makes sense.

*
* @param list The list of intervals to scatter
* @return The scattered intervals, represented as a {@link List} of {@link IntervalList}
*/
private ScatterSummary writeScatterIntervals(final IntervalList list) {
private ScatterSummary writeScatterIntervals(final IntervalList list) {
ValidationUtils.validateArg(SCATTER_COUNT > 0, "Scatter count should be a positive integer.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's good to include the value in the error message i.e. "Scatter count should be a positive integer but it was " + SCATTER_COUNT + "."

In methods which will be run millions of times you have to be careful about that sort of thing to avoid accidentally wasting cycles on useless too strings, but this is only going to happen ~once.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, done

@@ -629,8 +625,8 @@ static IntervalListInputType forFile(final File intervalListExtractable) {
return INTERVAL_LIST;
}

public static IntervalList getIntervalList(final File file, final boolean includeFiltered) {
return forFile(file).getIntervalListInternal(file, includeFiltered);
public static IntervalList getIntervalList(final PicardHtsPath file, final boolean includeFiltered) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably take HtsPath instead of PicardHtsPath so it is more general. Usuall

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@droazen
Copy link
Contributor

droazen commented Jul 13, 2023

@lbergelson I would support giving up on having a "non-cloud" Picard, and just make the gcloud dependency a hard dependency.

@lbergelson
Copy link
Member

I hope everyone enjoys THE ENTIRE GOOGLE ECOSYSTEM. :)

@lbergelson
Copy link
Member

@takutosato If you're not sure how to make the change to the cloud jar we can do it in a follow up PR.

@takutosato
Copy link
Contributor Author

@lbergelson back to you, let me know we need further changes

@lbergelson lbergelson merged commit 6420484 into master Jul 20, 2023
@lbergelson lbergelson deleted the ts_cloud branch July 20, 2023 16:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants