-
Notifications
You must be signed in to change notification settings - Fork 371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support cloud input/output for IntervalListTools #1852
Conversation
@cmnbroad @droazen @lbergelson here is the PR as requested. I converted the input Files to PicardHtsPath (output is not updated yet). It compiles but when I run it as follows, I get the google exception "400 Bad Request." cloud=gs://gcp-public-data--broad-references/hg38/v0/HybSelOligos/whole_exome_illumina_coding_v1/whole_exome_illumina_coding_v1.Homo_sapiens_assembly38.targets.interval_list
javadebug -jar ./build/libs/picardcloud.jar IntervalListTools \
ACTION=SUBTRACT \
INPUT=$cloud \
SECOND_INPUT=$cloud \
OUTPUT=test.interval_list |
@droazen @lbergelson @cmnbroad this PR is ready for review. I left some questions in comments throughout the code. In particular, creating a temp test file/directory was not as natural as for the local case, I would love to hear how you would go about it. |
try { | ||
// tsato: fine for now, but maybe I should put the following code inside try. | ||
// do I need to define a new variable here? Giving OUTPUT to assertDirectoryIsWritable leads to not found | ||
Files.createDirectories(OUTPUT.toPath()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what the gcs provider does in this case. It looks like a folder isn't a unique entity in gcs (https://cloud.google.com/storage/docs/folders), rather it just defines part of an object's name. So its unclear to me what happens when this is executed within a gcs bucket, or whether the resulting directory is navigable after this.
If its problematic, one option might be to only make this call when the PicardHtsPath
URI scheme is "file:" (picardHtsPath.getScheme().equals("file")
). Ideally, rather than implementing that logic wherever we need it, we would extend the base IOPath
interface with additional methods for creating a directory, resolving a sibling, asserting something is writable, etc., and then add default implementations toHtsPath
, PicardHtsPath
, and GATKPath
, but maybe thats for another day. We can discuss at the next meeting whether we want to do this as part of this PR, or defer, (in which case we might want to choose tools that don't write out folders).
// do I need to define a new variable here? Giving OUTPUT to assertDirectoryIsWritable leads to not found | ||
Files.createDirectories(OUTPUT.toPath()); | ||
// tsato: this line below always throws an error even if the directory is there and writable, so temporarily turn this off | ||
// IOUtil.assertDirectoryIsWritable(yo); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment above, this issue may be due to bucket/folder semantics.
Files.createDirectory(result.getParent()); | ||
return result; | ||
} catch (IOException e){ | ||
// tsato: Why is throwing a PicardException not a static error, when throwing an IOException is? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IOException
is a "checked" exception, meaning the compiler requires you to either handle (catch) it, or else declare that your method throws it (in which case the compiler will apply the same requirement to any methods that call your method). Checked exceptions are any method that does not derive from RuntimeException
. PicardException
is derived from RuntimeException
, and therefore is not subject to the compiler's static requirements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see. So the best practice is to either
- catch a checked exception, then throw a RuntimeException.
- let the checked exception get thrown without catching (and add
throws IOException
keyword),
and whether we do 1 or 2 depends on the case at hand? I feel like I've done both in the past, without thinking too hard about the pros/cons of one way over the other...
if (!directory.exists() && !directory.mkdir()) { | ||
throw new PicardException("Unable to create directory: " + directory.getAbsolutePath()); | ||
private static Path createSubDirectoryAndGetScatterFile(final PicardHtsPath outputDirectory, final long scatterCount, final String formattedIndex) { | ||
final String newFileName = "temp_" + formattedIndex + "_of_" + scatterCount + "/scattered" + FileExtensions.INTERVAL_LIST; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably fine as it is since you're creating a directory name, but just for future reference, you can't use "_" in a name if it's used as a bucket name. Again, probably doesn't matter here since this isn't a bucket, but its worth knowing if you're making up a bucket name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to know, thanks
args.add("SCATTER_COUNT=" +scatterCount); | ||
args.add("OUTPUT=" + outputDir.toUri()); | ||
Assert.assertEquals(runPicardCommandLine(args), 0); | ||
Assert.assertEquals(Files.list(outputDir).count(), scatterCount); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should manually verify that Files.list
works correctly on a "directory" in a bucket, especially in the case where the "directory has been created, but nothing has been added to it yet. Since it's not really a directory at that point, we might have problems in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it turns out, Files.list
works as expected, but Files.createTempDirectory
seems to be a no-op. The directories are created when we run IntervalListTools via runPicardCommandLine
private ScatterSummary writeScatterIntervals(final IntervalList list) { | ||
private ScatterSummary writeScatterIntervals(final IntervalList list) { | ||
// tsato: In gatk I would use Utils.validate(); what's the equivalent in Picard? Although I could def live without this particular assert. | ||
assert SCATTER_COUNT > 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use validateArg
in ValidationUtils
(from htsjdk) to do this kind of thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thank you
final Path result = outputDirectory.toPath().resolve(newFileName); | ||
// tsato: https://stackoverflow.com/questions/3634853/how-to-create-a-directory-in-java | ||
Files.createDirectory(result.getParent()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was confused by this at first. Although I think it works, it would be cleaner to do two separate operations: first make the containing subdirectory, and then use that to derive the file name within that directory:
final Path result = outputDirectory.toPath().resolve(newFileName); | |
// tsato: https://stackoverflow.com/questions/3634853/how-to-create-a-directory-in-java | |
Files.createDirectory(result.getParent()); | |
final String subDirName = "temp_" + formattedIndex + "_of_" + scatterCount; | |
final Path subDir = Files.createDirectory(outputDirectory.toPath().resolve(subDirName)); | |
return subDir.resolve("scattered" + FileExtensions.INTERVAL_LIST); |
IOUtil.assertDirectoryIsWritable(OUTPUT); | ||
|
||
final ScatterSummary scattered = writeScatterIntervals(output); | ||
final ScatterSummary scattered = writeScatterIntervals(outputIntervals); // tsato: I forget how this all works with try catch vs throws IOException... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comments elsewhere on runtime vs checked exceptions.
IOUtil.assertFilesAreReadable(INPUT); | ||
IOUtil.assertFilesAreReadable(SECOND_INPUT); | ||
IOUtil.assertPathsAreReadable(INPUT.stream().map(PicardHtsPath::toPath).collect(Collectors.toList())); | ||
IOUtil.assertPathsAreReadable(SECOND_INPUT.stream().map(PicardHtsPath::toPath).collect(Collectors.toList())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might try using PicardHtsPath.toPaths
here, though I just noticed that it returns a list - it should probably return a Collection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much cleaner that way. Thank you
final int scatterCount = 3; | ||
// tsato: vs Files.createTempFile()? | ||
try { | ||
final Path outputDir = Files.createTempDirectory(new PicardHtsPath(CLOUD_DATA_DIR).toPath(), "interval_list_scatter_test"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Louis is working on some test utils that we can use for these kinds of things (creating temp fields/dirs).
@cmnbroad back to you. I ported BucketUtils and associated classes from GATK (but not verbatim—some of it had to be edited to live in Picard), and now the automatic deleting of temp files is working. I think now we just need to figure out how the tests will run on github with @lbergelson. |
private static final String TEST_DATA_DIR = "testdata/picard/util/"; | ||
// tsato: place these files in an appropriate cloud bucket | ||
// private static final String CLOUD_DATA_DIR = "gs://broad-dsde-methods-takuto/gatk/test/"; | ||
private static final String CLOUD_DATA_DIR = "gs://hellbender/test/resources/picard/intervals/"; // tsato: picard/utils might be better for consistency... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a good place to put test files within gs://hellbender/test/resources/
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SCATTER_COUNT = (int)Math.round((double) listSize / SCATTER_CONTENT); | ||
LOG.info(String.format("Using SCATTER_CONTENT = %d and an interval of size %d, attempting to scatter into %s intervals.", SCATTER_CONTENT, listSize, SCATTER_COUNT)); | ||
} | ||
|
||
if (OUTPUT != null && SCATTER_COUNT > 1) { | ||
|
||
IOUtil.assertDirectoryIsWritable(OUTPUT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems ok to me.
@@ -0,0 +1,160 @@ | |||
package picard.nio; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cmnbroad @droazen @lbergelson these classes were "ported" from GATK in order to make temporaryFiles for tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@takutosato Some minor comments and 1 big one about the cloud jar dependency. @droazen What do you think we should do about that? Give up on separate cloud/non-cloud picards or try to make it work either way?
* @return the resulting {@code Path} | ||
* @throws UserException if an I/O error occurs when creating the file system | ||
*/ | ||
public static Path getPath(String uriString) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hadn't really thought about it when we talked before, but bringing these files in adds hard references to the google libraries which will cause runtime errors if these methods are invoked in the non-cloud jar. Are we sure we want to do that?
We either have to remove the references to CloudStorageFileSystem, make the gcloud dependency a hard one instead of optional, or handle the cases where it's missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We make it mandatory is the decision.
|
||
/** | ||
* | ||
* Copied from GATK; to be removed once the original GATK code is ported to htsjdk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should push this to htsjdk as you say here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in The Futuure
SCATTER_COUNT = (int)Math.round((double) listSize / SCATTER_CONTENT); | ||
LOG.info(String.format("Using SCATTER_CONTENT = %d and an interval of size %d, attempting to scatter into %s intervals.", SCATTER_CONTENT, listSize, SCATTER_COUNT)); | ||
} | ||
|
||
if (OUTPUT != null && SCATTER_COUNT > 1) { | ||
|
||
IOUtil.assertDirectoryIsWritable(OUTPUT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems ok to me.
@@ -477,29 +475,30 @@ protected int doWork() { | |||
|
|||
} else { | |||
if (OUTPUT != null) { | |||
output.write(OUTPUT); | |||
outputIntervals.write(OUTPUT.toPath()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if this supported htspath directly but this seems reasonable given that it doesn't.
} | ||
|
||
LOG.info("Produced " + resultIntervals.intervalCount + " intervals totalling " + resultIntervals.baseCount + " bases."); | ||
if (COUNT_OUTPUT != null) { | ||
try (final PrintStream countStream = new PrintStream(COUNT_OUTPUT)) { | ||
try (final PrintStream countStream = new PrintStream(Files.newOutputStream(COUNT_OUTPUT.toPath()))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're better off using COUNT_OUTPUT.getOutputStream() here I believe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lbergelson turns out COUNT_OUTPUT.getoutputStream()
doesn't produce a file for whatever reason, even though both ways call Files.newOutputStream(COUNT_OUTPUT.toPath())
…something subtle going on I think, but I reverted it in the meantime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is very weird. We should investigate what's happening but reverting makes sense.
* | ||
* @param list The list of intervals to scatter | ||
* @return The scattered intervals, represented as a {@link List} of {@link IntervalList} | ||
*/ | ||
private ScatterSummary writeScatterIntervals(final IntervalList list) { | ||
private ScatterSummary writeScatterIntervals(final IntervalList list) { | ||
ValidationUtils.validateArg(SCATTER_COUNT > 0, "Scatter count should be a positive integer."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's good to include the value in the error message i.e. "Scatter count should be a positive integer but it was " + SCATTER_COUNT + "."
In methods which will be run millions of times you have to be careful about that sort of thing to avoid accidentally wasting cycles on useless too strings, but this is only going to happen ~once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, done
@@ -629,8 +625,8 @@ static IntervalListInputType forFile(final File intervalListExtractable) { | |||
return INTERVAL_LIST; | |||
} | |||
|
|||
public static IntervalList getIntervalList(final File file, final boolean includeFiltered) { | |||
return forFile(file).getIntervalListInternal(file, includeFiltered); | |||
public static IntervalList getIntervalList(final PicardHtsPath file, final boolean includeFiltered) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably take HtsPath instead of PicardHtsPath so it is more general. Usuall
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@lbergelson I would support giving up on having a "non-cloud" Picard, and just make the gcloud dependency a hard dependency. |
I hope everyone enjoys THE ENTIRE GOOGLE ECOSYSTEM. :) |
@takutosato If you're not sure how to make the change to the cloud jar we can do it in a follow up PR. |
@lbergelson back to you, let me know we need further changes |
Description
Also see #1804.
Checklist (never delete this)
Never delete this, it is our record that procedure was followed. If you find that for whatever reason one of the checklist points doesn't apply to your PR, you can leave it unchecked but please add an explanation below.
Content
Review
For more detailed guidelines, see https://github.com/broadinstitute/picard/wiki/Guidelines-for-pull-requests