Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Updrade onmt #6

Draft
wants to merge 24 commits into
base: main
Choose a base branch
from
Draft

[WIP] Updrade onmt #6

wants to merge 24 commits into from

Conversation

irinaespejo
Copy link
Member

@irinaespejo irinaespejo commented Apr 9, 2024

This PR is WIP to upgrade the dependency on onmt from a the forked version to latest v.3.5.1 dropping the need of a fork.

And also upgrading to python 3.11

Changes made:

  1. Preprocessing
    Running rxn-onmt-preprocess throws an error about /bin/bash command onmt_preprocess not found. This comes from this line. The onmt_preprocess functionality was dropped by OpenNMT from v.1.2.0 -> v.2.2.0

    Solution: changes can be found in src/rxn/onmt_models/scripts/rxn_onmt_preprocess.py by upgrading the command to onmt_build_vocab here and a helper wrapper function here. The use of onmt_build_vocab is used all over the official docs

  2. Training
    The idea is to still call on cli onmt_train -config /path/to/config.yaml via run_command() but in a way such that it resembles as much as possible the official way here.
    Turns out we only need to: instead of passing onmt_train -- <all arguments> we dump the arguments that rxn-onmt-train receives via cli to a config.yaml in the same way OpenNMT v.3.5.1 expects them.

    Solution: changes can be found in src/rxn/onmt_models/scripts/rxn_onmt_train.py added a wrapper function here because OpenNMT v.3.5.1 expects src_vocab and tgt_vocab in config file. See PR#7in rxn-onmt-utils for changes in class OnmtTrainCommand

Copy link

cla-bot bot commented Apr 9, 2024

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: irinaespejo.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email [email protected]
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@rxn4chemistry rxn4chemistry deleted a comment from cla-bot bot Apr 9, 2024
Copy link

cla-bot bot commented Apr 9, 2024

Thanks a lot for working on rxn4chemistry, we strongly value contributions from our users.

It appears that one of the commiters in the PR did not sign the CLA for contributors. To do so, you can open an issue to sign the CLA clicking here. More details can be found here.

You can then request a new CLA check by commenting "@cla-bot check" on this PR.

@irinaespejo
Copy link
Member Author

@cla-bot check

@cla-bot cla-bot bot added the cla-signed label Apr 9, 2024
Copy link

cla-bot bot commented Apr 9, 2024

The cla-bot has been summoned, and re-checked this pull request!

@thegodone
Copy link

Is it working or still WIP ?

@drugilsberg
Copy link

Is it working or still WIP ?

Functionally speaking the integration is complete, but we are leaving it as a WIP until results from both backward and forward models are compatible with the version using the older ONMT.

@thegodone
Copy link

thegodone commented Aug 5, 2024

Are differences of performances influence by the new tokenization ? Have you observed a 1% difference or a 0.1% difference, or larger ? Just to have an idea (order scaled) of the modification effects on performances.

@drugilsberg
Copy link

Are differences of performances influence by the new tokenization ? Have you observed a 1% difference or a 0.1% difference, or larger ? Just to have an idea (order scaled) of the modification effects on performances.

For forward models there is no difference, while for backward we observe a significant drop in round trip accuracy (50%). For the latter it might be due to some issues in running the inference appropriately as the loss and the token-level metrics are the same. We will post updates in the PR as soon as we have them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants