Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix molecule parsing issue for CDMS #2385

Merged
merged 8 commits into from
May 6, 2022
Merged

Conversation

keflavich
Copy link
Contributor

@keflavich keflavich commented Apr 29, 2022

WIP: this fixes only half the problem. Tests are needed.

Problem is described in #2375: there are molecules with ()'s in the names that break the regex. The current fix escapes those automatically.

There's a second problem: the fixed-width parser fails for some subset of molecules. Todo:

  • figure out what the correct parsing is for H2C(CN)2 (it is not obvious!)
  • find a generalized way to parse different molecular files (there does not appear to be a header for any of them?)

@keflavich keflavich added the cdms label Apr 29, 2022
@pep8speaks
Copy link

pep8speaks commented Apr 29, 2022

Hello @keflavich! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2022-05-06 00:54:31 UTC

@keflavich
Copy link
Contributor Author

I can't run tests locally because of an astropy bug that I can't correct because I'm on a plane and pip install -e . hangs. pytest is so annoyingly finnicky. I'm complaining here because I have nothing else to do while I wait for it to finish...

@keflavich
Copy link
Contributor Author

The key changes here are to the start locations of rows. I've added several tests and extensively checked the CDMS database; this fixed-format is universal, it just contained more than I had realized originally.

The correction suggested in #2375 (comment) is not correct; the GUP location here is correct.

@codecov
Copy link

codecov bot commented Apr 30, 2022

Codecov Report

Merging #2385 (777d169) into main (105d232) will decrease coverage by 0.30%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #2385      +/-   ##
==========================================
- Coverage   63.31%   63.00%   -0.31%     
==========================================
  Files         133      133              
  Lines       17271    17269       -2     
==========================================
- Hits        10935    10881      -54     
- Misses       6336     6388      +52     
Impacted Files Coverage Δ
astroquery/linelists/cdms/core.py 95.08% <100.00%> (+15.28%) ⬆️
astroquery/nrao/core.py 24.57% <0.00%> (-46.46%) ⬇️
astroquery/cadc/core.py 80.37% <0.00%> (+3.49%) ⬆️

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

@keflavich keflavich added this to the v0.4.7 milestone May 1, 2022
@keflavich keflavich requested a review from bsipocz May 1, 2022 14:11
@keflavich
Copy link
Contributor Author

This is ready for review. I squashed the dozens of whitespace commits.

@bsipocz bsipocz changed the title Fix for issue 2375 Fix molecule parsing issue for CDMS May 2, 2022
@bsipocz bsipocz added the bug label May 2, 2022
Copy link
Member

@bsipocz bsipocz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test need to be fixed, otherwise it looks good.

@@ -303,12 +331,66 @@ def tryfloat(x):
CDMS = CDMSClass()


def parse_letternumber(st):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add _ as this is not suppose to end up in the public API (and indeed it's not added to all)


return out


def build_lookup():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of curiosity, what's the reason of this being factored out from the main class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember. 😬

"""

response = MockResponseSpec('HC7S.data')
tbl = CDMS._parse_result(response)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you ever supposed to call _parse_result directly? If not the test should be refactored.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no. I see your point. I was copying from an old/previous test, but it would be better to test the user-facing API here, I agree.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh, yes. Then the previous test was failing, too, I suppose 😅

assert parse_letternumber("ZZ") == 3535


def test_hc7s():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test fails when run on it's own, it relies on other tests prior to it run successfully, a side effect of calling the private method directly.

astroquery/linelists/cdms/tests/test_cdms.py F                                                                   [100%]

======================================================= FAILURES =======================================================
______________________________________________________ test_hc7s _______________________________________________________

    def test_hc7s():
        """
        Test for a very complicated molecule
    
        CDMS.query_lines_async(100*u.GHz, 100.755608*u.GHz, molecule='HC7S', parse_name_locally=True)
        """
    
        response = MockResponseSpec('HC7S.data')
>       tbl = CDMS._parse_result(response)

astroquery/linelists/cdms/tests/test_cdms.py:97: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <astroquery.linelists.cdms.core.CDMSClass object at 0x135f68df0>
response = <astroquery.linelists.cdms.tests.test_cdms.MockResponseSpec object at 0x14d075ee0>, verbose = False

    def _parse_result(self, response, verbose=False):
        """
        Parse a response into an `~astropy.table.Table`
    
        The catalog data files are composed of fixed-width card images, with
        one card image per spectral line.  The format of each card image is
        similar to the JPL version:
        FREQ, ERR, LGINT, DR,  ELO, GUP, TAG, QNFMT,  QN',  QN"
        (F13.4,F8.4, F8.4,  I2,F10.4,  I3,  I7,    I4,  6I2,  6I2)
        but the formats are somewhat different and are encoded below.
        The first several entries are the same, but more detail is appended at
        the end of the line
    
        FREQ:  Frequency of the line in MHz.
        ERR:   Estimated or experimental error of FREQ in MHz.
        LGINT: Base 10 logarithm of the integrated intensity in units of nm^2 MHz at
            300 K.
    
        DR:    Degrees of freedom in the rotational partition function (0 for atoms,
            2 for linear molecules, and 3 for nonlinear molecules).
    
        ELO:   Lower state energy in cm^{-1} relative to the ground state.
        GUP:   Upper state degeneracy.
        MOLWT: The first half of the molecular weight tag, which is the mass in atomic
               mass units (Daltons).
        TAG:   Species tag or molecular identifier.  This only includes the
               last 3 digits of the CDMS tag
    
        A negative value of MOLWT flags that the line frequency has been
        measured in the laboratory.  We record this boolean in the 'Lab'
        column.  ERR is the reported experimental error.
    
        QNFMT: Identifies the format of the quantum numbers
        Ju/Ku/vu and Jl/Kl/vl are the upper/lower QNs
        F: the hyperfine lines
        name: molecule name
    
        The full detailed description is here:
        https://cdms.astro.uni-koeln.de/classic/predictions/description.html#description
        """
    
        if 'Zero lines were found' in response.text:
            raise EmptyResponseError(f"Response was empty; message was '{response.text}'.")
    
        soup = BeautifulSoup(response.text, 'html.parser')
        text = soup.find('pre').text
    
        starts = {'FREQ': 0,
                  'ERR': 14,
                  'LGINT': 25,
                  'DR': 36,
                  'ELO': 38,
                  'GUP': 48,
                  'MOLWT': 51,
                  'TAG': 54,
                  'QNFMT': 58,
                  'Ju': 61,
                  'Ku': 63,
                  'vu': 65,
                  'F1u': 67,
                  'F2u': 69,
                  'F3u': 71,
                  'Jl': 73,
                  'Kl': 75,
                  'vl': 77,
                  'F1l': 79,
                  'F2l': 81,
                  'F3l': 83,
                  'name': 89}
    
        result = ascii.read(text, header_start=None, data_start=0,
                            comment=r'THIS|^\s{12,14}\d{4,6}.*',
                            names=list(starts.keys()),
                            col_starts=list(starts.values()),
                            format='fixed_width', fast_reader=False)
    
        result['FREQ'].unit = u.MHz
        result['ERR'].unit = u.MHz
    
        result['Lab'] = result['MOLWT'] < 0
        result['MOLWT'] = np.abs(result['MOLWT'])
        result['MOLWT'].unit = u.Da
    
        for suf in 'ul':
            for qn in ('J', 'v', 'K', 'F1', 'F2', 'F3'):
                qnind = qn+suf
                if result[qnind].dtype != int:
                    intcol = np.array(list(map(parse_letternumber, result[qnind])),
                                      dtype=int)
                    result[qnind] = intcol
    
        # if there is a crash at this step, something went wrong with the query
        # and the _last_query_temperature was not set.  This shouldn't ever
        # happen, but, well, I anticipate it will.
>       if self._last_query_temperature == 0:
E       AttributeError: 'CDMSClass' object has no attribute '_last_query_temperature'

astroquery/linelists/cdms/core.py:269: AttributeError
=============================================== short test summary info ================================================
FAILED astroquery/linelists/cdms/tests/test_cdms.py::test_hc7s - AttributeError: 'CDMSClass' object has no attribute ...
================================================== 1 failed in 0.93s ===================================================

@bsipocz
Copy link
Member

bsipocz commented May 2, 2022

Does this now fully address #2375? If yes, please link them together making sure the merge will close the issue.

@keflavich keflavich linked an issue May 2, 2022 that may be closed by this pull request
@keflavich
Copy link
Contributor Author

The failing/weird mocks were inherited from jplspec, which has tests that skip the actual query steps and just test the mock responses. So we should refactor that too.

keflavich added 7 commits May 2, 2022 22:20
whitespace cleanup

pep8 cleanup

pep8 cleanup

escaped character fix.  The previous commit fixed the outstanding pep8 issue but the fix was not recognized

fix import

yay tests pass

more cleanup of the tags

more fixes

fix local tests
fix mistake

remote tests pass now

fix docs

code style fixes (whitespace)

rst whitespace text (I don't think my code highlighter works right for
rst files...)

restore an unintentionally-deleted line (but this probably won't fix the
failure b/c this "response" is still not defined... we need this example
to use the previous as context...)
confusingly.... describe how string matching is done.  Also do some
minor performance cleanup
@bsipocz
Copy link
Member

bsipocz commented May 3, 2022

The failing/weird mocks were inherited from jplspec, which has tests that skip the actual query steps and just test the mock responses. So we should refactor that too.

Yeah, I suppose we missed a few things here and there during reviews 😅

Copy link
Member

@bsipocz bsipocz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extremely minor nitpicks that I'll fix in a commit along with the whitespace, so we can merge this.



@pytest.mark.remote_data
def test_2375():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use names (and PR titles) that are descriptive. No one will remember what issue 2375 was when this test starts failing :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_complicated_molecule_parsing, then? (but link #2375 in the docstring?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeap, I went with test_molecule_with_parens, and agree, keeping the reference to the issue number is super useful

CHANGES.rst Outdated
@@ -24,6 +24,11 @@ jplsbdb
- Fix a bug for jplsdbd query when the returned physical quantity contains
a unit with exponential. [#2377]

linelists/cdms
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we are checking for it, but the syntax here follows the namespace rather than the directory structure (separated with .)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, I didn't realize

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is something we blindly inherited from core astropy, but I don't think we actually check for it anywhere. So at this point, it's just "historical reasons" 🤷‍♀️

@bsipocz
Copy link
Member

bsipocz commented May 6, 2022

Codecov is confused and messed up due to the totally unrelated nrao disabling, so I just go ahead and merge this now.
Thanks @keflavich!

@bsipocz bsipocz merged commit 729fd4d into astropy:main May 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug while reading information from CDMS table
3 participants