Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract metadata from NetCDF and HDF5 files as XML in NcML format #9153

Closed
pdurbin opened this issue Nov 9, 2022 · 8 comments · Fixed by #9239
Closed

Extract metadata from NetCDF and HDF5 files as XML in NcML format #9153

pdurbin opened this issue Nov 9, 2022 · 8 comments · Fixed by #9239
Labels
pm.netcdf-hdf5.d All 3 aims are currently under this deliverable Size: 80 A percentage of a sprint. 56 hours.
Milestone

Comments

@pdurbin
Copy link
Member

pdurbin commented Nov 9, 2022

Assuming PR #9152 is merged we'll have a library in place to start extracting XML from NetCDF and HDF5 files.

The supported XML format is called NcML and is described here: https://docs.unidata.ucar.edu/netcdf-java/current/userguide/ncml_overview.html

Yesterday there was general agreement among devs that it would be fine to save the XML as a derivative or aux file.

This will open the door for previewing the file as raw XML to start.

Additionally, we could work on created a dedicated previewer that shows the data in a nicer way than raw XML.

The code we write will look something like this:

String ncml = netcdfFile.toNcml(file.getName());

Here's the output for an HDF5 file at src/test/resources/hdf/hdf5/vlen_string_dset (from the PR above):

<?xml version="1.0" encoding="UTF-8"?>
<netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2" location="file:vlen_string_dset">
  <variable name="DS1" shape="4" type="String" />
</netcdf>

Here's part of the output for a NetCDF file at src/test/resources/netcdf/madis-raob.nc (also from the PR above):

<?xml version="1.0" encoding="UTF-8"?>
<netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2" location="file:madis-raob.nc">
  <dimension name="recNum" length="1" isUnlimited="true" />
  <dimension name="manLevel" length="22" />
  <dimension name="sigTLevel" length="150" />
  <dimension name="sigWLevel" length="76" />
  <dimension name="mWndNum" length="4" />
  <dimension name="mTropNum" length="4" />
  <dimension name="staNameLen" length="50" />
  <dimension name="QCcheckNum" length="10" />
  <dimension name="QCcheckNameLen" length="60" />
  <dimension name="maxStaticIds" length="1000" />
  <dimension name="totalIdLen" length="50" />
  <dimension name="nInventoryBins" length="32" />
  <variable name="nStaticIds" shape="" type="int">
    <attribute name="_FillValue" type="int" value="0" />
  </variable>
  <variable name="staticIds" shape="maxStaticIds totalIdLen" type="char">
    <attribute name="_FillValue" value="" />
  </variable>
  <variable name="lastRecord" shape="maxStaticIds" type="int">
    <attribute name="_FillValue" type="int" value="-1" />
  </variable>
  <variable name="invTime" shape="recNum" type="int">
    <attribute name="_FillValue" type="int" value="0" />
  </variable>
...

Here's the full XML/NcML output: madis-ncml.xml.txt


2.5 years ago @qqmyers made some suggestions for previewing XML files at IQSS/dataverse.harvard.edu#70 (comment) . Here's his comment:

"FWIW: Something like https://www.jqueryscript.net/other/tree-xml-viewer-formatter.html adapted with the wiki instructions at https://github.com/GlobalDataverseCommunityConsortium/dataverse-previewers/wiki/How-to-create-a-previewer might be a quick win. (I didn't search too hard for an XML viewer - there could be better libraries out there to start from.)"

@pdurbin
Copy link
Member Author

pdurbin commented Dec 7, 2022

Just a quick update that I have some uncommitted code locally that extracts an NcML file (XML) from a NetCDF file and saves it as an auxiliary file:

Screen Shot 2022-12-07 at 2 55 56 PM

The code is very hacky and probably doesn't work with S3. I'm been chatting with @landreev about the best place to put it. Probably a dedicated method the gets called right after ingestService.startIngestJobsForDataset and put in the same bean.

@pdurbin
Copy link
Member Author

pdurbin commented Dec 9, 2022

I cleaned up the code a bit, added a lot of TODOs, and pushed it to a branch so that @landreev and others can take a look: 711dc63

@mreekie mreekie added Size: 80 A percentage of a sprint. 56 hours. and removed Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) labels Dec 14, 2022
@mreekie
Copy link

mreekie commented Dec 14, 2022

Discussion/Notes
Leonid - When we expand this system to extract other types of metadata from files, we may consider moving this to an asynchronous queue. But for now it's out of scope.

Part of what's left is deciding if it's a goal to support direct upload.

When you don't do direct upload, there is a temporary file.
Direct - browser to S3 directly, no temporary file to have in the interim.

Two PRs will come out of this, one for viewer, one for dataverse.

This is hard coded right now. It doesn't use a queue such as JMS (which we use for file ingest).
Does not support direct upload.

Some installations will care about these formats some will. Is this the beginning of a new class of plugins?

Size for this sprint was a 33.
Size what's left at an 80

TODO: Phil put a sentence or two into laying out better than what mike did the scope of this work.
TODO: sounds like we need a follow-on investigation on the bigger picture which is outside the scope of this issue.

@mreekie
Copy link

mreekie commented Dec 14, 2022

added to sprint Dec 15, 2022

@pdurbin
Copy link
Member Author

pdurbin commented Dec 14, 2022

I demo'ed this during Tuesday's community call but here's an updated screenshot. The main difference is that I'm putting NcML in its own category in the UI ("XML from NetCDF/HDF5 (NcML)") instead of "Other":

Screen Shot 2022-12-14 at 3 35 50 PM

@mreekie captured a lot of the discussion from standup this morning (thanks!). I'll just add a bit about scope:

  • @landreev and I seem to agree that we're fine with not supporting S3 direct at this time. (My last commit, 438b86c, adds support for S3 non-direct upload, at least).
  • Yes, another PR is coming to https://github.com/gdcc/dataverse-previewers to add a previewer for the NcML file.
  • I need to think more about existing NetCDF/HDF5 files. Probably there should be an API to try to extract an NcML file from these existing files.
  • Currently, there are no limits. Any file that the netcdf-java library an open will go through the "extract NcML" process. I'm hoping this is lightweight (just a header at the front of the file) but I don't know (needs testing). Adding a limit might also be away to turn off the functionality if an installation doesn't want it.
  • I'm concerned that the eyeball and Preview tab will be shown for all files of type application/netcdf or application/x-hdf5 even when a NcML aux file was not able to be created. How to solve this?

Here are a couple screenshots of the poor UX if you enable a NcML preview tool when the NetCDF or HDF5 file is existing but hasn't had a chance to be processed (have the NcML extracted) or if the file cannot be processed (such as certain HDF5 files).

You think you'll see a preview...

Screen Shot 2022-12-15 at 12 32 48 PM

... but you don't because the NcML file doesn't exist (sad trombone)

Screen Shot 2022-12-15 at 9 58 11 AM

@mreekie mreekie moved this from 5▶🏁Been In a Sprint to 4▶⏱In This Sprint in IQSS Dataverse Project Dec 14, 2022
@pdurbin
Copy link
Member Author

pdurbin commented Dec 15, 2022

Some quick hacking and thoughts on the idea of extending the external tools framework so that tools can say if they need an aux file.

(@landreev also just suggested maybe a custom mimetype like application/netcdf+auxfile or something)

diff --git a/doc/sphinx-guides/source/_static/installation/files/root/external-tools/fabulousFileTool.json b/doc/sphinx-guides/source/_static/installation/files/root/external-tools/fabulousFileTool.json
index 1c13257609..816a2e6441 100644
--- a/doc/sphinx-guides/source/_static/installation/files/root/external-tools/fabulousFileTool.json
+++ b/doc/sphinx-guides/source/_static/installation/files/root/external-tools/fabulousFileTool.json
@@ -22,6 +22,16 @@
         "locale":"{localeCode}"
       }
     ],
+    "requirements": [
+      {
+        "auxFilesExists": [
+          {
+            "fileTag": "NcML",
+            "fileVersion": "1.0"
+          }
+        ]
+      }
+    ],
     "allowedApiCalls": [
       {
         "name":"retrieveDataFile",
diff --git a/src/main/java/edu/harvard/iq/dataverse/DatasetPage.java b/src/main/java/edu/harvard/iq/dataverse/DatasetPage.java
index 6e71f6c504..8bb1167afc 100644
--- a/src/main/java/edu/harvard/iq/dataverse/DatasetPage.java
+++ b/src/main/java/edu/harvard/iq/dataverse/DatasetPage.java
@@ -5490,7 +5490,7 @@ public class DatasetPage implements java.io.Serializable {
             return cachedTools;
         }
         DataFile dataFile = datafileService.find(fileId);
-        cachedTools = ExternalToolServiceBean.findExternalToolsByFile(externalTools, dataFile);
+        cachedTools = externalToolService.findExternalToolsByFile(externalTools, dataFile);
         cachedToolsByFileId.put(fileId, cachedTools); //add to map so we don't have to do the lifting again
         return cachedTools;
     }
diff --git a/src/main/java/edu/harvard/iq/dataverse/FilePage.java b/src/main/java/edu/harvard/iq/dataverse/FilePage.java
index 85eb79d2dd..49680c5caf 100644
--- a/src/main/java/edu/harvard/iq/dataverse/FilePage.java
+++ b/src/main/java/edu/harvard/iq/dataverse/FilePage.java
@@ -125,6 +125,8 @@ public class FilePage implements java.io.Serializable {
     ExternalToolServiceBean externalToolService;
     @EJB
     PrivateUrlServiceBean privateUrlService;
+    @EJB
+    AuxiliaryFileServiceBean auxiliaryFileService;
 
     @Inject
     DataverseRequestServiceBean dvRequestService;
@@ -237,7 +239,7 @@ public class FilePage implements java.io.Serializable {
             configureTools = externalToolService.findFileToolsByTypeAndContentType(ExternalTool.Type.CONFIGURE, contentType);
             exploreTools = externalToolService.findFileToolsByTypeAndContentType(ExternalTool.Type.EXPLORE, contentType);
             Collections.sort(exploreTools, CompareExternalToolName);
-            toolsWithPreviews  = sortExternalTools();
+            toolsWithPreviews = findPreviewTools();
             if(!toolsWithPreviews.isEmpty()){
                 setSelectedTool(toolsWithPreviews.get(0));                
             }
@@ -285,8 +287,24 @@ public class FilePage implements java.io.Serializable {
         this.datasetVersionId = datasetVersionId;
     }
 
-    private List<ExternalTool> sortExternalTools(){
-        List<ExternalTool> retList = externalToolService.findFileToolsByTypeAndContentType(ExternalTool.Type.PREVIEW, file.getContentType());
+    private List<ExternalTool> findPreviewTools(){
+        List<ExternalTool> retList = new ArrayList<>();
+        String contentType = file.getContentType();
+        List<ExternalTool> previewTools = externalToolService.findFileToolsByTypeAndContentType(ExternalTool.Type.PREVIEW, file.getContentType());
+        for (ExternalTool previewTool : previewTools) {
+            if (contentType.equals("application/netcdf") || contentType.equals("application/x-hdf5")) { // TODO: factor this out
+                String formatTag = "NcML"; // TODO factor this out
+                String formatVersion = "0.1"; // TODO factor this out
+                AuxiliaryFile auxFile = auxiliaryFileService.lookupAuxiliaryFile(file, formatTag, formatVersion);
+                if (auxFile == null) {
+                    logger.info("findPreviewTools: Can't find an aux file for netcdf/hdf5, skipping.");
+                    continue;
+                } else {
+                    logger.info("findPreviewTools: found an aux file!");
+                }
+            }
+            retList.add(previewTool);
+        }
         Collections.sort(retList, CompareExternalToolName);
         return retList;
     }
diff --git a/src/main/java/edu/harvard/iq/dataverse/externaltools/ExternalTool.java b/src/main/java/edu/harvard/iq/dataverse/externaltools/ExternalTool.java
index 1789b7a90c..ae0f7bd7a4 100644
--- a/src/main/java/edu/harvard/iq/dataverse/externaltools/ExternalTool.java
+++ b/src/main/java/edu/harvard/iq/dataverse/externaltools/ExternalTool.java
@@ -103,6 +103,16 @@ public class ExternalTool implements Serializable {
     @Column(nullable = true, columnDefinition = "TEXT")
     private String allowedApiCalls;
 
+    /**
+     * When non-null, the tool operates on (downloads) an AuxiliaryFile and here
+     * we specify a JSON object containing the parameters needed to look up the
+     * aux file. For example, formatTag=NcML and formatVersion=1.0 could be used
+     * to look up an NcML aux file. If the aux file can't be found, the tool
+     * won't be offered.
+     */
+    @Column(nullable = true, columnDefinition = "TEXT")
+    private String auxFileParams;
+
     /**
      * This default constructor is only here to prevent this error at
      * deployment:
@@ -326,5 +336,12 @@ public class ExternalTool implements Serializable {
         this.allowedApiCalls = allowedApiCalls;
     }
 
+    public String getAuxFileParams() {
+        return auxFileParams;
+    }
+
+    public void setAuxFileParams(String auxFile) {
+        this.auxFileParams = auxFile;
+    }
 
 }
diff --git a/src/main/java/edu/harvard/iq/dataverse/externaltools/ExternalToolServiceBean.java b/src/main/java/edu/harvard/iq/dataverse/externaltools/ExternalToolServiceBean.java
index a65ad2427b..b47051f188 100644
--- a/src/main/java/edu/harvard/iq/dataverse/externaltools/ExternalToolServiceBean.java
+++ b/src/main/java/edu/harvard/iq/dataverse/externaltools/ExternalToolServiceBean.java
@@ -1,5 +1,7 @@
 package edu.harvard.iq.dataverse.externaltools;
 
+import edu.harvard.iq.dataverse.AuxiliaryFile;
+import edu.harvard.iq.dataverse.AuxiliaryFileServiceBean;
 import edu.harvard.iq.dataverse.DataFile;
 import edu.harvard.iq.dataverse.DataFileServiceBean;
 import edu.harvard.iq.dataverse.authorization.users.ApiToken;
@@ -30,6 +32,7 @@ import javax.persistence.TypedQuery;
 import static edu.harvard.iq.dataverse.externaltools.ExternalTool.*;
 import java.util.stream.Collectors;
 import java.util.stream.Stream;
+import javax.ejb.EJB;
 
 @Stateless
 @Named
@@ -40,6 +43,9 @@ public class ExternalToolServiceBean {
     @PersistenceContext(unitName = "VDCNet-ejbPU")
     private EntityManager em;
 
+    @EJB
+    AuxiliaryFileServiceBean auxiliaryFileService;
+
     public List<ExternalTool> findAll() {
         TypedQuery<ExternalTool> typedQuery = em.createQuery("SELECT OBJECT(o) FROM ExternalTool AS o ORDER BY o.id", ExternalTool.class);
         return typedQuery.getResultList();
@@ -133,13 +139,24 @@ public class ExternalToolServiceBean {
      * file supports The list of tools is passed in so it doesn't hit the
      * database each time
      */
-    public static List<ExternalTool> findExternalToolsByFile(List<ExternalTool> allExternalTools, DataFile file) {
+    public List<ExternalTool> findExternalToolsByFile(List<ExternalTool> allExternalTools, DataFile file) {
         List<ExternalTool> externalTools = new ArrayList<>();
         //Map tabular data to it's mimetype (the isTabularData() check assures that this code works the same as before, but it may need to change if tabular data is split into subtypes with differing mimetypes)
         final String contentType = file.isTabularData() ? DataFileServiceBean.MIME_TYPE_TSV_ALT : file.getContentType();
         allExternalTools.forEach((externalTool) -> {
             //Match tool and file type 
             if (contentType.equals(externalTool.getContentType())) {
+                if (contentType.equals("application/netcdf") || contentType.equals("application/x-hdf5")) { // TODO: factor this out
+                    String formatTag = "NcML"; // TODO factor this out
+                    String formatVersion = "0.1"; // TODO factor this out
+                    AuxiliaryFile auxFile = auxiliaryFileService.lookupAuxiliaryFile(file, formatTag, formatVersion);
+                    if (auxFile == null) {
+                        logger.info("Can't find an aux file for netcdf/hdf5, skipping.");
+                        return; // like `continue` when in a forEach
+                    } else {
+                        logger.info("found an aux file!");
+                    }
+                }
                 externalTools.add(externalTool);
             }
         });
diff --git a/src/test/java/edu/harvard/iq/dataverse/api/NetcdfIT.java b/src/test/java/edu/harvard/iq/dataverse/api/NetcdfIT.java
index 74179b9883..a17db9e302 100644
--- a/src/test/java/edu/harvard/iq/dataverse/api/NetcdfIT.java
+++ b/src/test/java/edu/harvard/iq/dataverse/api/NetcdfIT.java
@@ -38,7 +38,9 @@ public class NetcdfIT {
         Integer datasetId = UtilIT.getDatasetIdFromResponse(createDataset);
         String datasetPid = UtilIT.getDatasetPersistentIdFromResponse(createDataset);
 
-        String pathToFile = "src/test/resources/netcdf/madis-raob";
+//        String pathToFile = "src/test/resources/netcdf/madis-raob";
+        String pathToFile = "/tmp/beta0.300000_0.hdf5";
+//        String pathToFile = "src/test/resources/hdf/hdf5/vlen_string_dset";
 
         Response uploadFile = UtilIT.uploadFileViaNative(datasetId.toString(), pathToFile, apiToken);
         uploadFile.prettyPrint();
diff --git a/src/test/java/edu/harvard/iq/dataverse/externaltools/ExternalToolServiceBeanTest.java b/src/test/java/edu/harvard/iq/dataverse/externaltools/ExternalToolServiceBeanTest.java
index 74e10d6735..d4431e64f9 100644
--- a/src/test/java/edu/harvard/iq/dataverse/externaltools/ExternalToolServiceBeanTest.java
+++ b/src/test/java/edu/harvard/iq/dataverse/externaltools/ExternalToolServiceBeanTest.java
@@ -49,8 +49,8 @@ public class ExternalToolServiceBeanTest {
         ExternalToolHandler externalToolHandler4 = new ExternalToolHandler(externalTool, dataFile, apiToken, fmd, null);
         List<ExternalTool> externalTools = new ArrayList<>();
         externalTools.add(externalTool);
-        List<ExternalTool> availableExternalTools = ExternalToolServiceBean.findExternalToolsByFile(externalTools, dataFile);
-        assertEquals(availableExternalTools.size(), 1);
+//        List<ExternalTool> availableExternalTools = ExternalToolServiceBean.findExternalToolsByFile(externalTools, dataFile);
+//        assertEquals(availableExternalTools.size(), 1);
     }
 
     @Test

pdurbin added a commit that referenced this issue Dec 20, 2022
The use case is an external tool that operates on aux files pulled out
of NetCDF/HDF5 files.
pdurbin added a commit that referenced this issue Dec 20, 2022
@pdurbin
Copy link
Member Author

pdurbin commented Dec 20, 2022

Here are a couple screenshots of the poor UX if you enable a NcML preview tool when the NetCDF or HDF5 file is existing but hasn't had a chance to be processed (have the NcML extracted) or if the file cannot be processed

I fixed this in 9edaf59 by adding a new "requirements" option for external tools. First, let's look at how the eyeball is hidden when the HDF5 file can't be parsed:

Screen Shot 2022-12-20 at 10 05 08 AM

For the good HDF5, we still see the preview:

Screen Shot 2022-12-20 at 10 10 44 AM

Here's how I documented the "requirements" option:

Screen Shot 2022-12-20 at 10 12 48 AM
Screen Shot 2022-12-20 at 10 12 31 AM

I'll go ahead and make a pull request so I can get some feedback. I have the other PR on the previewers side to work on so I'll leave this issue assigned to me rather than taking it off the board.

@pdurbin
Copy link
Member Author

pdurbin commented Dec 20, 2022

I just created a pull request to add an NcML previewer:

I can't put that PR on our project board because it isn't under IQSS so I'll put this issue in "ready for review".

Update: I'm taking this issue off the board (like usual, the main PR will close it). I just created this issue to track (on the board) the previewer PR:

@pdurbin pdurbin removed their assignment Dec 20, 2022
@pdurbin pdurbin assigned pdurbin and unassigned pdurbin Dec 21, 2022
pdurbin added a commit that referenced this issue Jan 5, 2023
@mreekie mreekie moved this from 4️⃣▶⏱In This Sprint to 5️⃣▶🏁Been In a Sprint in IQSS Dataverse Project Jan 11, 2023
pdurbin added a commit that referenced this issue Jan 19, 2023
"Use a version like '4.11.0.1' in the example above where the
previously released version was 4.11" -- dev guide

That is, these scripts should have been 5.12.1.whatever since
the last release was 5.12.1. Fixing. (They were 5.13.whatever.)
@pdurbin pdurbin added this to the 5.13 milestone Jan 20, 2023
@mreekie mreekie moved this to 🚮Clear of the Backlog in IQSS Dataverse Project Jan 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pm.netcdf-hdf5.d All 3 aims are currently under this deliverable Size: 80 A percentage of a sprint. 56 hours.
Projects
Status: No status
2 participants