Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

submission: pangoling: Access to word predictability using large language (transformer) models #575

Open
13 of 29 tasks
bnicenboim opened this issue Feb 21, 2023 · 57 comments
Open
13 of 29 tasks

Comments

@bnicenboim
Copy link

bnicenboim commented Feb 21, 2023

Submitting Author Name: Bruno Nicenboim
Submitting Author Github Handle: @bnicenboim
Repository: https://github.com/bnicenboim/pangoling
Version submitted: 0.0.0.9005
Submission type: Standard
Editor: @emilyriederer
Reviewers: @lisalevinson, @utkuturk

Archive: TBD
Version accepted: TBD
Language: en

  • Paste the full DESCRIPTION file inside a code block below:
Package: pangoling
Type: Package
Title: Access to Large Language Model Predictions
Version: 0.0.0.9005
Authors@R: c(
    person("Bruno", "Nicenboim",
    email = "[email protected]",
    role = c( "aut","cre"),
    comment = c(ORCID = "0000-0002-5176-3943")),
    person("Chris", "Emmerly", role = "ctb"),
    person("Giovanni", "Cassani", role = "ctb"))
Description: Access to word predictability using large language (transformer) models.
URL: https://bruno.nicenboim.me/pangoling, https://github.com/bnicenboim/pangoling
BugReports: https://github.com/bnicenboim/pangoling/issues
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: false
Config/reticulate:
  list(
    packages = list(
      list(package = "torch"),
      list(package = "transformers")
    )
  )
Imports: 
    data.table,
    memoise,
    reticulate,
    tidyselect,
    tidytable (>= 0.7.2),
    utils,
    cachem
Suggests: 
    rmarkdown,
    knitr,
    testthat (>= 3.0.0),
    tictoc,
    covr,
    spelling
Config/testthat/edition: 3
RoxygenNote: 7.2.3
Roxygen: list(markdown = TRUE)
Depends: 
    R (>= 4.1.0)
VignetteBuilder: knitr
StagedInstall: yes
Language: en-US

Scope

  • Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):

    • data retrieval
    • data extraction
    • data munging
    • data deposition
    • data validation and testing
    • workflow automation
    • version control
    • citation management and bibliometrics
    • scientific software wrappers
    • field and lab reproducibility tools
    • database software bindings
    • geospatial data
    • text analysis
  • Explain how and why the package falls under these categories (briefly, 1-2 sentences):

The package is built on top of the python package transformers, and it offers some basic functionality for text analysis, including tokenization and perplexity calculation. Crucially pangoling also offers word predictability, which is widely used as a predictor in psycho and neurolinguistics, and it's not trivial to obtain. Also transformers works with "tokens" rather than "words", and then pangoling takes cares of the mapping between the tokens to the target words (or even phrases).

  • Who is the target audience and what are scientific applications of this package?

This is mostly for psycho/neuro/- linguists that use word predictability as a predictor in their research, such as in ERP/EEG and reading studies.

Another R package that acts as a wrapper for transformers is text However, text is more general, and its focus is on Natural Language Processing and Machine Learning. pangoling is much more specific and the focus is on measures used as predictors in analyses of data from experiments, rather than NLP. text doesn't allow for generating pangoling output in a straightforward way and in fact, I'm not sure if it's even possible to get token probabilities from text since it seems more limited than the python package transformers.

NA

  • If you made a pre-submission inquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.

#573

  • Explain reasons for any pkgcheck items which your package is unable to pass.

pkgcheck fails only because of the use of <<-. But this is done in .OnLoad as recommended by reticulate. Also see this issue .

Technical checks

Confirm each of the following by checking the box.

This package:

Publication options

  • Do you intend for this package to go on CRAN?

  • Do you intend for this package to go on Bioconductor?

  • Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:

MEE Options
  • The package is novel and will be of interest to the broad readership of the journal.
  • The manuscript describing the package is no longer than 3000 words.
  • You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see MEE's Policy on Publishing Code)
  • (Scope: Do consider MEE's Aims and Scope for your manuscript. We make no guarantee that your manuscript will be within MEE scope.)
  • (Although not required, we strongly recommend having a full manuscript prepared when you submit here.)
  • (Please do not submit your package separately to Methods in Ecology and Evolution)

Code of conduct

@ropensci-review-bot
Copy link
Collaborator

Thanks for submitting to rOpenSci, our editors and @ropensci-review-bot will reply soon. Type @ropensci-review-bot help for help.

@ropensci-review-bot
Copy link
Collaborator

🚀

Editor check started

👋

@ropensci-review-bot
Copy link
Collaborator

Checks for pangoling (v0.0.0.9005)

git hash: 543c11bd

  • ✔️ Package name is available
  • ✔️ has a 'codemeta.json' file.
  • ✔️ has a 'contributing' file.
  • ✔️ uses 'roxygen2'.
  • ✔️ 'DESCRIPTION' has a URL field.
  • ✔️ 'DESCRIPTION' has a BugReports field.
  • ✔️ Package has at least one HTML vignette
  • ✔️ All functions have examples.
  • ✔️ Package has continuous integration checks.
  • ✖️ Package coverage is 0.9% (should be at least 75%).
  • ✔️ R CMD check found no errors.
  • ✔️ R CMD check found no warnings.

Important: All failing checks above must be addressed prior to proceeding

Package License: MIT + file LICENSE


1. Package Dependencies

Details of Package Dependency Usage (click to open)

The table below tallies all function calls to all packages ('ncalls'), both internal (r-base + recommended, along with the package itself), and external (imported and suggested packages). 'NA' values indicate packages to which no identified calls to R functions could be found. Note that these results are generated by an automated code-tagging system which may not be entirely accurate.

type package ncalls
internal base 97
internal pangoling 41
internal graphics 2
internal stats 1
imports tidytable 20
imports reticulate 5
imports memoise 3
imports cachem 2
imports data.table 1
imports tidyselect 1
imports utils NA
suggests rmarkdown NA
suggests knitr NA
suggests testthat NA
suggests tictoc NA
suggests covr NA
suggests spelling NA
linking_to NA NA

Click below for tallies of functions used in each package. Locations of each call within this package may be generated locally by running 's <- pkgstats::pkgstats(<path/to/repo>)', and examining the 'external_calls' table.

base

lapply (19), length (7), c (6), dim (6), paste0 (6), t (6), unlist (4), by (3), list (3), names (3), seq_len (3), which (3), do.call (2), getOption (2), matrix (2), ncol (2), rep (2), seq_along (2), sum (2), unname (2), as.list (1), floor (1), for (1), grepl (1), lengths (1), mode (1), new.env (1), options (1), rownames (1), split (1), switch (1), vector (1)

pangoling

create_tensor_lst (5), lst_to_kwargs (5), char_to_token (4), encode (4), get_id (4), get_vocab (4), get_word_by_word_texts (2), masked_lp_mat (2), causal_config (1), causal_lp (1), causal_lp_mats (1), causal_mat (1), causal_next_tokens_tbl (1), causal_preload (1), causal_tokens_lp_tbl (1), chr_detect (1), masked_config (1), num_to_token (1), word_lp (1)

tidytable

map_chr. (4), map2 (3), map. (2), pmap. (2), arrange. (1), map (1), map_dbl. (1), map_dfr (1), map_dfr. (1), map2_dbl. (1), pmap_chr (1), relocate (1), tidytable (1)

reticulate

py_to_r (5)

memoise

memoise (3)

cachem

cache_mem (2)

graphics

text (2)

data.table

chmatch (1)

stats

lm (1)

tidyselect

everything (1)

NOTE: Some imported packages appear to have no associated function calls; please ensure with author that these 'Imports' are listed appropriately.


2. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has:

  • code in R (100% in 7 files) and
  • 1 authors
  • 2 vignettes
  • no internal data file
  • 7 imported packages
  • 14 exported functions (median 9 lines of code)
  • 57 non-exported functions in R (median 9 lines of code)

Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages
The following terminology is used:

  • loc = "Lines of Code"
  • fn = "function"
  • exp/not_exp = exported / not exported

All parameters are explained as tooltips in the locally-rendered HTML version of this report generated by the checks_to_markdown() function

The final measure (fn_call_network_size) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile.

measure value percentile noteworthy
files_R 7 45.7
files_vignettes 2 85.7
files_tests 7 86.4
loc_R 733 59.3
loc_vignettes 154 39.9
loc_tests 293 63.6
num_vignettes 2 89.2
n_fns_r 71 67.2
n_fns_r_exported 14 56.3
n_fns_r_not_exported 57 71.1
n_fns_per_file_r 6 75.6
num_params_per_fn 4 54.6
loc_per_fn_r 9 24.3
loc_per_fn_r_exp 10 21.6
loc_per_fn_r_not_exp 9 27.1
rel_whitespace_R 10 45.0
rel_whitespace_vignettes 42 47.5
rel_whitespace_tests 12 48.3
doclines_per_fn_exp 44 55.2
doclines_per_fn_not_exp 0 0.0 TRUE
fn_call_network_size 72 73.5

2a. Network visualisation

Click to see the interactive network visualisation of calls between objects in package


3. goodpractice and other checks

Details of goodpractice checks (click to open)

3a. Continuous Integration Badges

R-CMD-check.yaml

GitHub Workflow Results

id name conclusion sha run_number date
4232217801 pages build and deployment success c6ce46 16 2023-02-21
4232181659 pkgdown success 543c11 44 2023-02-21
4232181660 R-CMD-check success 543c11 37 2023-02-21
4232181654 test-coverage success 543c11 39 2023-02-21

3b. goodpractice results

R CMD check with rcmdcheck

rcmdcheck found no errors, warnings, or notes

Test coverage with covr

Package coverage: 0.89

The following files are not completely covered by tests:

file coverage
R/tr_causal.R 0%
R/tr_masked.R 0%
R/tr_utils.R 0%
R/utils.R 0%
R/zzz.R 0%

Cyclocomplexity with cyclocomp

The following function have cyclocomplexity >= 15:

function cyclocomplexity
word_lp 16

Static code analyses with lintr

lintr found the following 39 potential issues:

message number of times
Avoid library() and require() calls in packages 5
Lines should not be more than 80 characters. 34


Package Versions

package version
pkgstats 0.1.3
pkgcheck 0.1.1.11


Editor-in-Chief Instructions:

Processing may not proceed until the items marked with ✖️ have been resolved.

@bnicenboim
Copy link
Author

Hi,
" Package coverage is 0.9% (should be at least 75%)." -> this seems to be an error.
See that the coverage is 94%.
Maybe it is because the tests skip almost everything, if there is no python installed? (As recommended by reticulate).

Also when I run pkgcheck::pkgcheck() I get the following:

✔ Package coverage is 94.8%.

@maurolepore
Copy link
Member

maurolepore commented Feb 22, 2023

Thanks @bnicenboim for your full submission and for explaining the issue with test coverage. Your explanation makes sense, so we can move forward. I'll start searching for a handling editor.

In the meantime you may want to start thinking of potential reviewers to suggest to the handling editor.

@bnicenboim
Copy link
Author

Is there a pool of potential reviewers that I can have access to?

@maurolepore
Copy link
Member

maurolepore commented Feb 26, 2023

I guess authors are mostly guided by their knowledge of their intended audience. But for inspiration see how editors look for reviewers. Editors have access to a private airtable database, but often we look elsewhere.

@maurolepore maurolepore pinned this issue Feb 26, 2023
@maurolepore maurolepore unpinned this issue Feb 26, 2023
@maurolepore
Copy link
Member

maurolepore commented Mar 7, 2023

Dear @bnicenboim I'm sorry for the extraordinary delay in finding a handling editor. Most editors are busy and some handling more than one package. And the very few available are not yet due to handle another submission. Please hold a bit longer.

@bnicenboim
Copy link
Author

ok, thanks for letting me know, no problem.

@karthik
Copy link
Member

karthik commented Mar 10, 2023

@ropensci-review-bot assign @karthik as editor

@ropensci-review-bot
Copy link
Collaborator

Assigned! @karthik is now the editor

@karthik
Copy link
Member

karthik commented Mar 10, 2023

👋 @bnicenboim
I'll follow up with next steps in a few days. In the meantime I'll start pinging a few reviewers.

@bnicenboim
Copy link
Author

Hi, any news about the next steps?

@karthik
Copy link
Member

karthik commented Mar 30, 2023

Hi @bnicenboim
Apologies for the delay, I've been out sick this past week. I'll update this thread in the next few days.

@karthik
Copy link
Member

karthik commented May 2, 2023

Editor checks:

  • Documentation: The package has sufficient documentation available online (README, pkgdown docs) to allow for an assessment of functionality and scope without installing the package. In particular,
    • Is the case for the package well made?
    • Is the reference index page clear (grouped by topic if necessary)?
    • Are vignettes readable, sufficiently detailed and not just perfunctory?
  • Fit: The package meets criteria for fit and overlap.
  • Installation instructions: Are installation instructions clear enough for human users?
  • Tests: If the package has some interactivity / HTTP / plot production etc. are the tests using state-of-the-art tooling?
  • Contributing information: Is the documentation for contribution clear enough e.g. tokens for tests, playgrounds?
  • License: The package has a CRAN or OSI accepted license.
  • Project management: Are the issue and PR trackers in a good shape, e.g. are there outstanding bugs, is it clear when feature requests are meant to be tackled?

Editor comments

No additional comments at this time. I'm looking for reviewers at the moment, but if you've got any suggestions for people with expertise but no conflict, please suggest names.

@karthik
Copy link
Member

karthik commented May 2, 2023

@ropensci-review-bot seeking reviewers

@ropensci-review-bot
Copy link
Collaborator

Please add this badge to the README of your package repository:

[![Status at rOpenSci Software Peer Review](https://badges.ropensci.org/575_status.svg)](https://github.com/ropensci/software-review/issues/575)

Furthermore, if your package does not have a NEWS.md file yet, please create one to capture the changes made during the review process. See https://devguide.ropensci.org/releasing.html#news

@bnicenboim
Copy link
Author

I really don't know about reviewers, I guess someone involved in the packages named here:
https://cran.r-project.org/web/views/NaturalLanguageProcessing.html

Or maybe someone based on the reverse imports of reticulate:
https://cloud.r-project.org/web/packages/reticulate/index.html

@karthik
Copy link
Member

karthik commented May 3, 2023

@ropensci-review-bot assign @lisalevinson as reviewer

@ropensci-review-bot
Copy link
Collaborator

@lisalevinson added to the reviewers list. Review due date is 2023-05-24. Thanks @lisalevinson for accepting to review! Please refer to our reviewer guide.

rOpenSci’s community is our best asset. We aim for reviews to be open, non-adversarial, and focused on improving software quality. Be respectful and kind! See our reviewers guide and code of conduct for more.

@ropensci-review-bot
Copy link
Collaborator

@lisalevinson: If you haven't done so, please fill this form for us to update our reviewers records.

@ropensci-review-bot
Copy link
Collaborator

Submission on hold!

@bnicenboim
Copy link
Author

Hi @bnicenboim, I'm checking in on issues that have been static for awhile. I see this issue was moving along alright and then fell silent.

Let us know if there is progress to report or any questions you may have. We can move to to a "Hold" status, or keep it open as-is if you expect to continue the respond to the reviews.

Hi, I did a lot of progress (and unrelated changes) and I had some question for @lisalevinson that was never answered, but regardless it's fine to have this on hold until my teaching is over and I can finalize the last things.

@lisalevinson
Copy link

@bnicenboim So sorry - I never saw your questions at all. I was on medical leave when you posted them and had an away message on my email, but that wouldn't go back to you through GitHub notifications of course! And then it must have gotten lost in a sea of miscellaneous GitHub emails when I tried to sort through everything after.

It would take me some time to try it all out again and remember how everything works - honestly right now this will be hard for me to do before the end of May because I am co-organizing HSP and that is right after our current semester ends. Would that be an OK timeline for when you plan to pick it up again? I may be able to squeeze it in earlier but I don't want to make promises I'm not sure that I can keep!

@bnicenboim
Copy link
Author

@bnicenboim So sorry - I never saw your questions at all. I was on medical leave when you posted them and had an away message on my email, but that wouldn't go back to you through GitHub notifications of course! And then it must have gotten lost in a sea of miscellaneous GitHub emails when I tried to sort through everything after.

It would take me some time to try it all out again and remember how everything works - honestly right now this will be hard for me to do before the end of May because I am co-organizing HSP and that is right after our current semester ends. Would that be an OK timeline for when you plan to pick it up again? I may be able to squeeze it in earlier but I don't want to make promises I'm not sure that I can keep!

I don't think I'll touch anything until maybe June, so no worries from my side.

@ropensci-review-bot
Copy link
Collaborator

@ldecicco-USGS: Please review the holding status

@emilyriederer
Copy link

Hi @bnicenboim - checking in with the rOpenSci editors team. Do you still plan on picking this back up?

@bnicenboim
Copy link
Author

Yes, I'm picking it up. Sorry for the long delay. I also wanted to add some other features. I'll see how easy it is or if at least the function names that I'll have are compatible with these other usages I'm thinking about.

Also, I don't think my answer to @lisalevinson is relevant anymore. She's completely right in that the two uses of causal_lp are confusing. I'll divide it into two functions.

@bnicenboim
Copy link
Author

Phew, it took some time but I'm mostly done.

I'm checking that I still comply with all the requisites of ropensci. Then, I'll report the changes I made and how I answered the reviewers' issues.

@bnicenboim
Copy link
Author

@ropensci-review-bot check package

@ropensci-review-bot
Copy link
Collaborator

Thanks, about to send the query.

@ropensci-review-bot
Copy link
Collaborator

🚀

Editor check started

👋

@ropensci-review-bot
Copy link
Collaborator

Checks for pangoling (v0.0.0.9010)

git hash: 5931c547

  • ✔️ Package name is available
  • ✔️ has a 'codemeta.json' file.
  • ✔️ has a 'contributing' file.
  • ✔️ uses 'roxygen2'.
  • ✔️ 'DESCRIPTION' has a URL field.
  • ✔️ 'DESCRIPTION' has a BugReports field.
  • ✔️ Package has at least one HTML vignette
  • ✔️ All functions have examples.
  • ✔️ Package has continuous integration checks.
  • ✖️ Package coverage failed
  • ✖️ R CMD check found 1 error.
  • ✔️ R CMD check found no warnings.

Important: All failing checks above must be addressed prior to proceeding

Package License: MIT + file LICENSE


1. Package Dependencies

Details of Package Dependency Usage (click to open)

The table below tallies all function calls to all packages ('ncalls'), both internal (r-base + recommended, along with the package itself), and external (imported and suggested packages). 'NA' values indicate packages to which no identified calls to R functions could be found. Note that these results are generated by an automated code-tagging system which may not be entirely accurate.

type package ncalls
internal base 171
internal pangoling 53
internal graphics 2
internal stats 1
imports tidytable 28
imports reticulate 14
imports memoise 3
imports cachem 2
imports data.table 1
imports tidyselect 1
imports utils NA
suggests rmarkdown NA
suggests knitr NA
suggests testthat NA
suggests tictoc NA
suggests covr NA
suggests spelling NA
suggests rstudioapi NA
suggests brms NA
linking_to NA NA

Click below for tallies of functions used in each package. Locations of each call within this package may be generated locally by running 's <- pkgstats::pkgstats(<path/to/repo>)', and examining the 'external_calls' table.

base

lapply (26), names (12), length (11), list (11), c (10), dim (9), paste0 (8), by (7), unlist (6), sequence (5), setdiff (5), t (5), which (5), drop (4), split (4), rep (3), seq_along (3), seq_len (3), unname (3), all (2), as.list (2), do.call (2), matrix (2), sum (2), unsplit (2), version (2), for (1), getOption (1), grepl (1), if (1), ifelse (1), is.null (1), lengths (1), mode (1), ncol (1), new.env (1), nrow (1), options (1), rbind (1), rownames (1), switch (1), vapply (1), vector (1)

pangoling

create_tensor_lst (8), get_vocab (8), encode (6), conc_words (4), get_id (4), lst_to_kwargs (4), causal_config (1), causal_lp (1), causal_lp_mats (1), causal_mat (1), causal_pred_mats (1), causal_preload (1), causal_targets_pred (1), causal_tokens_lp_tbl (1), causal_words_pred (1), char_to_token (1), chr_detect (1), install_py_pangoling (1), is_mac (1), is_really_string (1), ln_p_change (1), masked_lp_mat (1), num_to_token (1), safe_decode (1), word_lp (1)

tidytable

map (6), map_chr (5), map_dfr (3), map2 (3), pmap (3), arrange (2), tidytable (2), map_dbl (1), map2_dbl (1), pmap_chr (1), relocate (1)

reticulate

py_to_r (9), virtualenv_starter (2), import (1), use_virtualenv (1), virtualenv_remove (1)

memoise

memoise (3)

cachem

cache_mem (2)

graphics

text (2)

data.table

chmatch (1)

stats

lm (1)

tidyselect

everything (1)

NOTE: Some imported packages appear to have no associated function calls; please ensure with author that these 'Imports' are listed appropriately.


2. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has:

  • code in R (94% in 9 files) and TeX (6% in 1 files)
  • 1 authors
  • no vignette
  • 2 internal data files
  • 7 imported packages
  • 23 exported functions (median 28 lines of code)
  • 82 non-exported functions in R (median 10 lines of code)

Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages
The following terminology is used:

  • loc = "Lines of Code"
  • fn = "function"
  • exp/not_exp = exported / not exported

All parameters are explained as tooltips in the locally-rendered HTML version of this report generated by the checks_to_markdown() function

The final measure (fn_call_network_size) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile.

measure value percentile noteworthy
files_R 9 52.3
files_inst 1 96.7
files_vignettes 0 0.0 TRUE
files_tests 9 85.7
loc_R 1383 73.9
loc_inst 83 38.4
loc_tests 536 72.4
num_vignettes 0 0.0 TRUE
data_size_total 19169 73.4
data_size_median 9584 78.2
n_fns_r 105 76.1
n_fns_r_exported 23 70.8
n_fns_r_not_exported 82 77.8
n_fns_per_file_r 8 83.4
num_params_per_fn 7 84.1
loc_per_fn_r 11 32.9
loc_per_fn_r_exp 28 60.3
loc_per_fn_r_not_exp 10 35.6
rel_whitespace_R 13 66.2
rel_whitespace_inst 14 36.1
rel_whitespace_tests 16 66.2
doclines_per_fn_exp 78 83.4
doclines_per_fn_not_exp 0 0.0 TRUE
fn_call_network_size 141 83.4

2a. Network visualisation

Click to see the interactive network visualisation of calls between objects in package


3. goodpractice and other checks

Details of goodpractice checks (click to open)

3a. Continuous Integration Badges

R-CMD-check.yaml

GitHub Workflow Results

id name conclusion sha run_number date
12864365257 pages build and deployment success 3a7bec 24 2025-01-20
12864039437 pkgdown success 5931c5 84 2025-01-20
12864039448 R-CMD-check success 5931c5 77 2025-01-20
12864039440 test-coverage success 5931c5 79 2025-01-20

3b. goodpractice results

R CMD check with rcmdcheck

R CMD check generated the following error:

  1. checking examples ... ERROR
    Running examples in ‘pangoling-Ex.R’ failed
    The error most likely occurred in:

Name: causal_config

Title: Returns the configuration of a causal model

Aliases: causal_config

** Examples

causal_config(model = "gpt2")
Error in py_run_string_impl(code, local, convert) :
ModuleNotFoundError: No module named 'torch'
Run reticulate::py_last_error() for details.
Calls: causal_config ... eval -> -> -> py_run_string_impl
Execution halted

R CMD check generated the following note:

  1. checking data for non-ASCII characters ... NOTE
    Note: found 25872 marked UTF-8 strings

R CMD check generated the following check_fails:

  1. rcmdcheck_non_ascii_characters_in_data
  2. rcmdcheck_examples_run

Test coverage with covr

ERROR: Test Coverage Failed

Cyclocomplexity with cyclocomp

The following functions have cyclocomplexity >= 15:

function cyclocomplexity
install_py_pangoling 17
word_lp 17

Static code analyses with lintr

lintr found no issues with this package!


Package Versions

package version
pkgstats 0.2.0.48
pkgcheck 0.1.2.77


Editor-in-Chief Instructions:

Processing may not proceed until the items marked with ✖️ have been resolved.

@bnicenboim
Copy link
Author

Hi, I'm not sure what to do about the errors. I don't get the warnings or notes here:
https://github.com/bnicenboim/pangoling/actions/runs/12864039448/job/35861748929

It might be that rmdcheck doesn't install the python dependencies needed? But I'm following the latest advice from reticulate (https://rstudio.github.io/reticulate/articles/package.html) in both creating a function to install the python dependencies rather than doing it automatically and how to test them using Github Actions

@mpadge
Copy link
Member

mpadge commented Jan 20, 2025

Hi @bnicenboim, Mark here from the rOpenSci team. Please ignore those check failures; they are because we don't have the ability to install package-specific python dependencies in our check system. Your link to successful GitHub workflow is fine. We also no longer have an editor assigned to this submission, since Karthik has stepped back from rOpenSci, so I'll ensure that happens, and you should hear from newly assigned editor asap. Thanks for returning to your submission - we always appreciate things getting moving again!

@bnicenboim
Copy link
Author

Hi everyone, after a long time, I have an update! I made a lot of changes—see the NEWS file. Renamed most functions, added more arguments, simplified the installation, added a troubleshooting vignette and a worked out example with Chinese sentences and surprisal values.

Thanks again for the reviews! The long delay turned out to be beneficial, as looking at it with fresh eyes helped me improve the documentation.

I address the reviewers' comments below.


@utkuturk's Review

Clarity issue: Python dependencies

@utkuturk mentioned difficulties with installing the Python dependencies.

Response: I updated the installation process to follow the latest reticulate recommendations. There's now an install_py_pangoling() function to handle installation, along with a troubleshooting vignette.

Clarity issue: Downloading models

@utkuturk suggested making it clearer that the package downloads models locally and does not operate through an API.

Response: This is now explicitly stated in the startup message:

pangoling version 0.0.0.9010  
An introduction to the package can be found at https://bruno.nicenboim.me/pangoling/articles/  
Notice that pretrained models and tokenizers are downloaded from https://huggingface.co/ the first time they are used.  
To change the cache folder, use:  
set_cache_folder(my_new_path)  

Advanced Configuration: NULL Arguments

@utkuturk noted that the documentation lacks information on how NULL arguments work in exported functions.

Response: I think this is out of scope. The package is fully functional without users needing to configure models. However, I’ve ensured that advanced users can still modify configurations if needed.


@lisalevinson's Review

Workflow: Extracting probabilities for all words

@lisalevinson typically gathers probabilities for all words in a sentence rather than just one target word and found the argument structure cumbersome, especially for masked models.

Response: The challenge with a position-based approach (e.g., target = 8) is defining what "8" refers to—words (what counts as a word?) or tokens? Instead, here’s how it's done now:

(preds <- causal_tokens_pred_lst("The apple doesn't fall far from the tree."))
[[1]]
          The        Ġapple        Ġdoesn            't         Ġfall          Ġfar 
           NA -1.090052e+01 -5.499081e+00 -8.279233e-04 -3.597742e+00 -2.911830e+00 
        Ġfrom          Ġthe         Ġtree             . 
-7.454723e-01 -2.066519e-01 -2.808024e-01 -1.300929e+00 
# this creates lists because it allows for several sentences
preds[[1]][9]  # Probability for "tree"

For masked models one needs to explicitly cut the sentence where the targets are and propose the possible targets:

masked_targets_pred(
  prev_contexts = c("The", "The"),
  targets = c("apple", "pear"),
  after_contexts = c("doesn't fall far from the tree.", "doesn't fall far from the tree."),
  model = "bert-base-uncased"
)

Reporting and documentation of probabilities

@lisalevinson noted that the log probabilities are in natural log but couldn't find this documented. Since base-2 logs (bits) are common in psycholinguistics, it would be useful to specify the log base.

Response: This was a great suggestion! I added a log.p argument:

Base of the logarithm used for the output predictability values. If TRUE (default), the natural logarithm (base e) 
is used. If FALSE, the raw probabilities are returned. Alternatively, log.p can be set to a numeric value specifying 
the base of the logarithm (e.g., 2 for base-2 logarithms). To get surprisal in bits (rather than predictability), set log.p = 1/2.

Output formats

@lisalevinson found the function outputs inconsistent and suggested a vignette on reshaping the data for analysis.

Response: I standardized the outputs as much as possible and added a vignette on using pangoling in psycholinguistic research.

Non-English usage and examples

@lisalevinson suggested including examples for non-English languages, particularly those without spaces (e.g., Chinese).

Response: I added a sep argument to handle languages without spaces:

A string specifying how words are separated within contexts or groups. Default is " ". 
For languages that don't have spaces between words (e.g., Chinese), set sep = "".

Also the vignette is with an example in Chinese and using surprisal rather than predictability ;)

Multi-token words are already handled correctly—the package sums log probabilities across tokens.


Specific Function Comments

causal_lp()

@lisalevinson found it confusing that x could be a full sentence or a set of targets.

Response: This is now split into two functions:

batch_size Documentation

@lisalevinson found the explanation unclear.

Response: I clarified the documentation:

"batch_size: Maximum batch size. Larger batches speed up processing but require more memory."

causal_lp_mats() (now causal_pred_mats())

@lisalevinson suggested clarifying why it returns a matrix instead of a dataframe.

Response: It returns a matrix due to memory constraints and because all values are numeric. The documentation now explicitly states:

The function splits the input x into groups specified by the by argument and processes each group independently. For each group, the model computes the predictability of each token in its vocabulary based on preceding context.

Each matrix contains:
- Rows representing the model's vocabulary.
- Columns corresponding to tokens in the group (e.g., a sentence or paragraph).
- By default, values in the matrices are the natural logarithm of word probabilities.

She also suggested including examples for multiple sentences and how to wrangle the output.

Response: I added examples for processing multiple sentences.

Token vs. word Confusion

@lisalevinson pointed out that column names in some functions refer to "words," but they are actually tokens (e.g., "doesn" and "’t" are separate).

Response: I corrected the documentation to consistently refer to "tokens" instead of "words" where applicable. The function descriptions now clarify when they operate on tokenized text rather than full words.

causal_tokens_lp_tbl() (now causal_tokens_pred_lst())

@lisalevinson found the argument name .id unintuitive.

Response: .id has been removed—it’s no longer needed.

She also found the distinction between texts (full sentences) and x (target words) inconsistent across functions.

Response: causal_tokens_pred_lst() still takes texts because it processes full sentences or paragraphs and returns token probabilities. In contrast, causal_words_pred() take x, which represents smaller linguistic units like words or phrases. This distinction is now explicitly explained in the documentation.

masked_lp() preferring a different argument structure.

Response: I answered also above.

masked_tokens_tbl()

@lisalevinson suggested a clearer argument structure and better differentiation from causal_tokens_lp_tbl().

Response:

Response: The difference comes from how masked models work—masked predictions return probabilities for a single masked token position, whereas causal models generate predictions for all possible next tokens based on the context. I updated the documentation to clarify this distinction.

For masked models, this follows the standard Hugging Face approach:

masked_tokens_pred_tbl("The [MASK] doesn't fall far from the tree.", model = "bert-base-uncased")

This returns all possible completions with their log probabilities.

Test Failures:

Response: All tests pass.


This should now be clearer and more consistent. I tried to summarize the points to be able to answer them, if something is missing (or I misunderstood it), please let me know.

@emilyriederer
Copy link

Thank you so much for the detailed notes and updates, @bnicenboim !

@lisalevinson, @utkuturk - I know it has been a bit since you completed your reviews. Could you please reference @bnicenboim 's update comments above and assess if you feel these address the themes you raised? Ideally, you could please acknowledge using this template

@emilyriederer
Copy link

@ropensci-review-bot assign @emilyriederer as editor

@ropensci-review-bot
Copy link
Collaborator

Assigned! @emilyriederer is now the editor

@emilyriederer
Copy link

@ropensci-review-bot submit review time 10

@ropensci-review-bot
Copy link
Collaborator

Logged review for utkuturk (hours: 10)

@emilyriederer
Copy link

@ropensci-review-bot submit review #575 (comment) time 11

@ropensci-review-bot
Copy link
Collaborator

Logged review for lisalevinson (hours: 11)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants