Skip to content

2022 02 16 meetings

Mark A. Miller edited this page Feb 16, 2022 · 4 revisions

sample-annotator repo updates

I have switched my fork of turbomam/sample-annotator: NMDC Sample Annotator to poetry.

I have also added an example for comparing rel_to_oxygen values to MIxS' expectations, as a starting point for DataGood.

I'd like to merge this into main now.

Pull Request #48 · microbiomedata/sample-annotator

Highlights:

  • installing the poetry application as a system requirement

  • run poetry install once after switching to this new branch

  • dependencies are specified in pyproject.toml

  • are we ready to publish to PyPI? what metadata to use? some was carried forward from setup.cfg into pyproject.toml

  • still need to re-instate some command line scripts under [tool.poetry.scripts]inpyproject.toml

    • sample-util = sample_annotator.sample_utils.main

    • goldapi = sample_annotator.clients.gold_api

  • moved non-poetry configuration files to pre-poetry/

  • removed a few dependencies

    • pint... I saw several similar looking options when following the poetry init guided process

    • anything related to pipenv

  • changed the source for importing Message... see below

  • requiring python 3.9

    • test older versions with tox?
  • tests pass

  • new poestry-based GH actions pass

  • actively working on Makefile

  • refactored .gitignore based on

  • did some semi-manual reformatting, in the PyCharm default style

  • have since agreed with Harshad and Marcin to autoformat with Black on save within IDE

  • which testing framework? how to invoke?

    • pytest
    • unittest
  • difference between tests and examples

    • where do inputs and outputs go
  • new logs directory

  • what documentation framework? I haven't touched any of this:

    • config/...
    • docs
    • sphinx
  • documentation needs to be updated in general, esp. for poetry

    • ABOUT.md
    • CONTRIBUTING.md
    • README.md
  • logging best practices?


I wrote sample_annotator/clients/biosample_sqlite_client.py. It documents the expected values from Enum: rel_to_oxygen_enum - MIxS, as well as the observed values in biosample_basex_data_good_subset.db 's harmonized_wide_sel_envs.rel_to_oxygen

I had trouble running it from a poetry script wrapper (sqlite_client_cli)

ImportError: cannot import name 'Message' from 'sample_annotator.sample_annotator' (/Users/MAM/Documents/gitrepos/turbomam/sample-annotator/sample_annotator/sample_annotator.py)

But I didn't have any trouble running it directly as

python sample_annotator/clients/biosample_sqlite_client.py ...

So I commented out the Message import from .sample_annotator in sample_annotator/__init__.py and replaced that with

from report_model import Message

rel_to_oxygen entry point in Makefile:

rel_to_oxygen_example: downloads/mixs6_core.tsv  
 $(RUN) rel_to_oxygen_example \ 
        --sqlite_path $(biosample_sqlite_file) \  
 --mixs_core_path $<

rel_to_oxygen module

9:00 PT meeting with Huy, Ichchitaa, Mark and Marcin

regrets from Kjiersten

links:

The LBL team have a separate repository for converting NCBI's biosample_set.xml.gz into SQLite like biosample_basex_data_good_subset.db

The DataGood team can use the SQLite products as their input and do not need to be concerned with the conversion, which takes place in a separate repo.

The SQLite databases are available at https://portal.nersc.gov/project/m3513/biosample

Each developer will have their own local copy of the SQLite database. They will certainly become out of sync. That's one of the many reasons why the #1 deliverable is committing code into microbiomedata/sample-annotator, so that LBL people can rerun or extend the transformations on other databases in the future

LBL people can help think about ways to expose this work to the public through static reports or lightweight web APIs like flask or fastapi.

column (from harmonized_wide_sel_envs table) action
rel_to_oxygen replace illegal values with terms from controlled vocabuary or flag as un-repairable. Will require some subject matter knowledge.
depth, temp... break out into value and unit parts with quantulum3
env_broad_scale lightweight NER

11:00 PT meeting with Harshad, Marcin and Mark

Most of the notes I took in this meeting have been folded into the bullet points above

What isutils/flatten.py supped to do? I tlooks buggy.

  • ':' expected @ line 12
  • Indent expected @ line 12
  • Unresolved reference 'obj' @ line 13
  • It looks like line 12 just needs a colon and line 13 just needs indentation, but I have no idea what obj is supposed to be.

why is a static mixs.json in this repo?

sample_annotator/__init__.py:

MIXS_SCHEMA = os.path.join(MAIN_SCHEMA_DIR, 'mixs.json')

sample_annotator/metadata/sample_schema.py:

from sample_annotator import MIXS_SCHEMA
class SampleSchema:
    object: Dict = None
    slot_dict_by_alias = None
    def load(self, force=False) -> Dict:
        """
        Load the schema from config folded
        """
        if self.object and not force:
            return self.object
        with open(MIXS_SCHEMA) as stream:
            self.object = json.load(stream)
            return self.object