Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

submission: pangoling: Access to word predictability using large language (transformer) models #575

Closed
13 of 29 tasks
bnicenboim opened this issue Feb 21, 2023 · 67 comments
Closed
13 of 29 tasks

Comments

@bnicenboim
Copy link

bnicenboim commented Feb 21, 2023

Date accepted: 2025-02-28

Submitting Author Name: Bruno Nicenboim
Submitting Author Github Handle: @bnicenboim
Repository: https://github.com/bnicenboim/pangoling
Version submitted: 0.0.0.9005
Submission type: Standard
Editor: @emilyriederer
Reviewers: @lisalevinson, @utkuturk

Archive: TBD
Version accepted: TBD
Language: en

  • Paste the full DESCRIPTION file inside a code block below:
Package: pangoling
Type: Package
Title: Access to Large Language Model Predictions
Version: 0.0.0.9005
Authors@R: c(
    person("Bruno", "Nicenboim",
    email = "[email protected]",
    role = c( "aut","cre"),
    comment = c(ORCID = "0000-0002-5176-3943")),
    person("Chris", "Emmerly", role = "ctb"),
    person("Giovanni", "Cassani", role = "ctb"))
Description: Access to word predictability using large language (transformer) models.
URL: https://bruno.nicenboim.me/pangoling, https://github.com/bnicenboim/pangoling
BugReports: https://github.com/bnicenboim/pangoling/issues
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: false
Config/reticulate:
  list(
    packages = list(
      list(package = "torch"),
      list(package = "transformers")
    )
  )
Imports: 
    data.table,
    memoise,
    reticulate,
    tidyselect,
    tidytable (>= 0.7.2),
    utils,
    cachem
Suggests: 
    rmarkdown,
    knitr,
    testthat (>= 3.0.0),
    tictoc,
    covr,
    spelling
Config/testthat/edition: 3
RoxygenNote: 7.2.3
Roxygen: list(markdown = TRUE)
Depends: 
    R (>= 4.1.0)
VignetteBuilder: knitr
StagedInstall: yes
Language: en-US

Scope

  • Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):

    • data retrieval
    • data extraction
    • data munging
    • data deposition
    • data validation and testing
    • workflow automation
    • version control
    • citation management and bibliometrics
    • scientific software wrappers
    • field and lab reproducibility tools
    • database software bindings
    • geospatial data
    • text analysis
  • Explain how and why the package falls under these categories (briefly, 1-2 sentences):

The package is built on top of the python package transformers, and it offers some basic functionality for text analysis, including tokenization and perplexity calculation. Crucially pangoling also offers word predictability, which is widely used as a predictor in psycho and neurolinguistics, and it's not trivial to obtain. Also transformers works with "tokens" rather than "words", and then pangoling takes cares of the mapping between the tokens to the target words (or even phrases).

  • Who is the target audience and what are scientific applications of this package?

This is mostly for psycho/neuro/- linguists that use word predictability as a predictor in their research, such as in ERP/EEG and reading studies.

Another R package that acts as a wrapper for transformers is text However, text is more general, and its focus is on Natural Language Processing and Machine Learning. pangoling is much more specific and the focus is on measures used as predictors in analyses of data from experiments, rather than NLP. text doesn't allow for generating pangoling output in a straightforward way and in fact, I'm not sure if it's even possible to get token probabilities from text since it seems more limited than the python package transformers.

NA

  • If you made a pre-submission inquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.

#573

  • Explain reasons for any pkgcheck items which your package is unable to pass.

pkgcheck fails only because of the use of <<-. But this is done in .OnLoad as recommended by reticulate. Also see this issue .

Technical checks

Confirm each of the following by checking the box.

This package:

Publication options

  • Do you intend for this package to go on CRAN?

  • Do you intend for this package to go on Bioconductor?

  • Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:

MEE Options
  • The package is novel and will be of interest to the broad readership of the journal.
  • The manuscript describing the package is no longer than 3000 words.
  • You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see MEE's Policy on Publishing Code)
  • (Scope: Do consider MEE's Aims and Scope for your manuscript. We make no guarantee that your manuscript will be within MEE scope.)
  • (Although not required, we strongly recommend having a full manuscript prepared when you submit here.)
  • (Please do not submit your package separately to Methods in Ecology and Evolution)

Code of conduct

@ropensci-review-bot
Copy link
Collaborator

Thanks for submitting to rOpenSci, our editors and @ropensci-review-bot will reply soon. Type @ropensci-review-bot help for help.

@ropensci-review-bot
Copy link
Collaborator

🚀

Editor check started

👋

@ropensci-review-bot
Copy link
Collaborator

Checks for pangoling (v0.0.0.9005)

git hash: 543c11bd

  • ✔️ Package name is available
  • ✔️ has a 'codemeta.json' file.
  • ✔️ has a 'contributing' file.
  • ✔️ uses 'roxygen2'.
  • ✔️ 'DESCRIPTION' has a URL field.
  • ✔️ 'DESCRIPTION' has a BugReports field.
  • ✔️ Package has at least one HTML vignette
  • ✔️ All functions have examples.
  • ✔️ Package has continuous integration checks.
  • ✖️ Package coverage is 0.9% (should be at least 75%).
  • ✔️ R CMD check found no errors.
  • ✔️ R CMD check found no warnings.

Important: All failing checks above must be addressed prior to proceeding

Package License: MIT + file LICENSE


1. Package Dependencies

Details of Package Dependency Usage (click to open)

The table below tallies all function calls to all packages ('ncalls'), both internal (r-base + recommended, along with the package itself), and external (imported and suggested packages). 'NA' values indicate packages to which no identified calls to R functions could be found. Note that these results are generated by an automated code-tagging system which may not be entirely accurate.

type package ncalls
internal base 97
internal pangoling 41
internal graphics 2
internal stats 1
imports tidytable 20
imports reticulate 5
imports memoise 3
imports cachem 2
imports data.table 1
imports tidyselect 1
imports utils NA
suggests rmarkdown NA
suggests knitr NA
suggests testthat NA
suggests tictoc NA
suggests covr NA
suggests spelling NA
linking_to NA NA

Click below for tallies of functions used in each package. Locations of each call within this package may be generated locally by running 's <- pkgstats::pkgstats(<path/to/repo>)', and examining the 'external_calls' table.

base

lapply (19), length (7), c (6), dim (6), paste0 (6), t (6), unlist (4), by (3), list (3), names (3), seq_len (3), which (3), do.call (2), getOption (2), matrix (2), ncol (2), rep (2), seq_along (2), sum (2), unname (2), as.list (1), floor (1), for (1), grepl (1), lengths (1), mode (1), new.env (1), options (1), rownames (1), split (1), switch (1), vector (1)

pangoling

create_tensor_lst (5), lst_to_kwargs (5), char_to_token (4), encode (4), get_id (4), get_vocab (4), get_word_by_word_texts (2), masked_lp_mat (2), causal_config (1), causal_lp (1), causal_lp_mats (1), causal_mat (1), causal_next_tokens_tbl (1), causal_preload (1), causal_tokens_lp_tbl (1), chr_detect (1), masked_config (1), num_to_token (1), word_lp (1)

tidytable

map_chr. (4), map2 (3), map. (2), pmap. (2), arrange. (1), map (1), map_dbl. (1), map_dfr (1), map_dfr. (1), map2_dbl. (1), pmap_chr (1), relocate (1), tidytable (1)

reticulate

py_to_r (5)

memoise

memoise (3)

cachem

cache_mem (2)

graphics

text (2)

data.table

chmatch (1)

stats

lm (1)

tidyselect

everything (1)

NOTE: Some imported packages appear to have no associated function calls; please ensure with author that these 'Imports' are listed appropriately.


2. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has:

  • code in R (100% in 7 files) and
  • 1 authors
  • 2 vignettes
  • no internal data file
  • 7 imported packages
  • 14 exported functions (median 9 lines of code)
  • 57 non-exported functions in R (median 9 lines of code)

Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages
The following terminology is used:

  • loc = "Lines of Code"
  • fn = "function"
  • exp/not_exp = exported / not exported

All parameters are explained as tooltips in the locally-rendered HTML version of this report generated by the checks_to_markdown() function

The final measure (fn_call_network_size) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile.

measure value percentile noteworthy
files_R 7 45.7
files_vignettes 2 85.7
files_tests 7 86.4
loc_R 733 59.3
loc_vignettes 154 39.9
loc_tests 293 63.6
num_vignettes 2 89.2
n_fns_r 71 67.2
n_fns_r_exported 14 56.3
n_fns_r_not_exported 57 71.1
n_fns_per_file_r 6 75.6
num_params_per_fn 4 54.6
loc_per_fn_r 9 24.3
loc_per_fn_r_exp 10 21.6
loc_per_fn_r_not_exp 9 27.1
rel_whitespace_R 10 45.0
rel_whitespace_vignettes 42 47.5
rel_whitespace_tests 12 48.3
doclines_per_fn_exp 44 55.2
doclines_per_fn_not_exp 0 0.0 TRUE
fn_call_network_size 72 73.5

2a. Network visualisation

Click to see the interactive network visualisation of calls between objects in package


3. goodpractice and other checks

Details of goodpractice checks (click to open)

3a. Continuous Integration Badges

R-CMD-check.yaml

GitHub Workflow Results

id name conclusion sha run_number date
4232217801 pages build and deployment success c6ce46 16 2023-02-21
4232181659 pkgdown success 543c11 44 2023-02-21
4232181660 R-CMD-check success 543c11 37 2023-02-21
4232181654 test-coverage success 543c11 39 2023-02-21

3b. goodpractice results

R CMD check with rcmdcheck

rcmdcheck found no errors, warnings, or notes

Test coverage with covr

Package coverage: 0.89

The following files are not completely covered by tests:

file coverage
R/tr_causal.R 0%
R/tr_masked.R 0%
R/tr_utils.R 0%
R/utils.R 0%
R/zzz.R 0%

Cyclocomplexity with cyclocomp

The following function have cyclocomplexity >= 15:

function cyclocomplexity
word_lp 16

Static code analyses with lintr

lintr found the following 39 potential issues:

message number of times
Avoid library() and require() calls in packages 5
Lines should not be more than 80 characters. 34


Package Versions

package version
pkgstats 0.1.3
pkgcheck 0.1.1.11


Editor-in-Chief Instructions:

Processing may not proceed until the items marked with ✖️ have been resolved.

@bnicenboim
Copy link
Author

Hi,
" Package coverage is 0.9% (should be at least 75%)." -> this seems to be an error.
See that the coverage is 94%.
Maybe it is because the tests skip almost everything, if there is no python installed? (As recommended by reticulate).

Also when I run pkgcheck::pkgcheck() I get the following:

✔ Package coverage is 94.8%.

@maurolepore
Copy link
Member

maurolepore commented Feb 22, 2023

Thanks @bnicenboim for your full submission and for explaining the issue with test coverage. Your explanation makes sense, so we can move forward. I'll start searching for a handling editor.

In the meantime you may want to start thinking of potential reviewers to suggest to the handling editor.

@bnicenboim
Copy link
Author

Is there a pool of potential reviewers that I can have access to?

@maurolepore
Copy link
Member

maurolepore commented Feb 26, 2023

I guess authors are mostly guided by their knowledge of their intended audience. But for inspiration see how editors look for reviewers. Editors have access to a private airtable database, but often we look elsewhere.

@maurolepore maurolepore pinned this issue Feb 26, 2023
@maurolepore maurolepore unpinned this issue Feb 26, 2023
@maurolepore
Copy link
Member

maurolepore commented Mar 7, 2023

Dear @bnicenboim I'm sorry for the extraordinary delay in finding a handling editor. Most editors are busy and some handling more than one package. And the very few available are not yet due to handle another submission. Please hold a bit longer.

@bnicenboim
Copy link
Author

ok, thanks for letting me know, no problem.

@karthik
Copy link
Member

karthik commented Mar 10, 2023

@ropensci-review-bot assign @karthik as editor

@ropensci-review-bot
Copy link
Collaborator

Assigned! @karthik is now the editor

@karthik
Copy link
Member

karthik commented Mar 10, 2023

👋 @bnicenboim
I'll follow up with next steps in a few days. In the meantime I'll start pinging a few reviewers.

@bnicenboim
Copy link
Author

Hi, any news about the next steps?

@karthik
Copy link
Member

karthik commented Mar 30, 2023

Hi @bnicenboim
Apologies for the delay, I've been out sick this past week. I'll update this thread in the next few days.

@karthik
Copy link
Member

karthik commented May 2, 2023

Editor checks:

  • Documentation: The package has sufficient documentation available online (README, pkgdown docs) to allow for an assessment of functionality and scope without installing the package. In particular,
    • Is the case for the package well made?
    • Is the reference index page clear (grouped by topic if necessary)?
    • Are vignettes readable, sufficiently detailed and not just perfunctory?
  • Fit: The package meets criteria for fit and overlap.
  • Installation instructions: Are installation instructions clear enough for human users?
  • Tests: If the package has some interactivity / HTTP / plot production etc. are the tests using state-of-the-art tooling?
  • Contributing information: Is the documentation for contribution clear enough e.g. tokens for tests, playgrounds?
  • License: The package has a CRAN or OSI accepted license.
  • Project management: Are the issue and PR trackers in a good shape, e.g. are there outstanding bugs, is it clear when feature requests are meant to be tackled?

Editor comments

No additional comments at this time. I'm looking for reviewers at the moment, but if you've got any suggestions for people with expertise but no conflict, please suggest names.

@karthik
Copy link
Member

karthik commented May 2, 2023

@ropensci-review-bot seeking reviewers

@ropensci-review-bot
Copy link
Collaborator

Please add this badge to the README of your package repository:

[![Status at rOpenSci Software Peer Review](https://badges.ropensci.org/575_status.svg)](https://github.com/ropensci/software-review/issues/575)

Furthermore, if your package does not have a NEWS.md file yet, please create one to capture the changes made during the review process. See https://devguide.ropensci.org/releasing.html#news

@bnicenboim
Copy link
Author

I really don't know about reviewers, I guess someone involved in the packages named here:
https://cran.r-project.org/web/views/NaturalLanguageProcessing.html

Or maybe someone based on the reverse imports of reticulate:
https://cloud.r-project.org/web/packages/reticulate/index.html

@karthik
Copy link
Member

karthik commented May 3, 2023

@ropensci-review-bot assign @lisalevinson as reviewer

@ropensci-review-bot
Copy link
Collaborator

@lisalevinson added to the reviewers list. Review due date is 2023-05-24. Thanks @lisalevinson for accepting to review! Please refer to our reviewer guide.

rOpenSci’s community is our best asset. We aim for reviews to be open, non-adversarial, and focused on improving software quality. Be respectful and kind! See our reviewers guide and code of conduct for more.

@ropensci-review-bot
Copy link
Collaborator

@lisalevinson: If you haven't done so, please fill this form for us to update our reviewers records.

@bnicenboim
Copy link
Author

Hi, I'm not sure what to do about the errors. I don't get the warnings or notes here:
https://github.com/bnicenboim/pangoling/actions/runs/12864039448/job/35861748929

It might be that rmdcheck doesn't install the python dependencies needed? But I'm following the latest advice from reticulate (https://rstudio.github.io/reticulate/articles/package.html) in both creating a function to install the python dependencies rather than doing it automatically and how to test them using Github Actions

@mpadge
Copy link
Member

mpadge commented Jan 20, 2025

Hi @bnicenboim, Mark here from the rOpenSci team. Please ignore those check failures; they are because we don't have the ability to install package-specific python dependencies in our check system. Your link to successful GitHub workflow is fine. We also no longer have an editor assigned to this submission, since Karthik has stepped back from rOpenSci, so I'll ensure that happens, and you should hear from newly assigned editor asap. Thanks for returning to your submission - we always appreciate things getting moving again!

@bnicenboim
Copy link
Author

Hi everyone, after a long time, I have an update! I made a lot of changes—see the NEWS file. Renamed most functions, added more arguments, simplified the installation, added a troubleshooting vignette and a worked out example with Chinese sentences and surprisal values.

Thanks again for the reviews! The long delay turned out to be beneficial, as looking at it with fresh eyes helped me improve the documentation.

I address the reviewers' comments below.


@utkuturk's Review

Clarity issue: Python dependencies

@utkuturk mentioned difficulties with installing the Python dependencies.

Response: I updated the installation process to follow the latest reticulate recommendations. There's now an install_py_pangoling() function to handle installation, along with a troubleshooting vignette.

Clarity issue: Downloading models

@utkuturk suggested making it clearer that the package downloads models locally and does not operate through an API.

Response: This is now explicitly stated in the startup message:

pangoling version 0.0.0.9010  
An introduction to the package can be found at https://bruno.nicenboim.me/pangoling/articles/  
Notice that pretrained models and tokenizers are downloaded from https://huggingface.co/ the first time they are used.  
To change the cache folder, use:  
set_cache_folder(my_new_path)  

Advanced Configuration: NULL Arguments

@utkuturk noted that the documentation lacks information on how NULL arguments work in exported functions.

Response: I think this is out of scope. The package is fully functional without users needing to configure models. However, I’ve ensured that advanced users can still modify configurations if needed.


@lisalevinson's Review

Workflow: Extracting probabilities for all words

@lisalevinson typically gathers probabilities for all words in a sentence rather than just one target word and found the argument structure cumbersome, especially for masked models.

Response: The challenge with a position-based approach (e.g., target = 8) is defining what "8" refers to—words (what counts as a word?) or tokens? Instead, here’s how it's done now:

(preds <- causal_tokens_pred_lst("The apple doesn't fall far from the tree."))
[[1]]
          The        Ġapple        Ġdoesn            't         Ġfall          Ġfar 
           NA -1.090052e+01 -5.499081e+00 -8.279233e-04 -3.597742e+00 -2.911830e+00 
        Ġfrom          Ġthe         Ġtree             . 
-7.454723e-01 -2.066519e-01 -2.808024e-01 -1.300929e+00 
# this creates lists because it allows for several sentences
preds[[1]][9]  # Probability for "tree"

For masked models one needs to explicitly cut the sentence where the targets are and propose the possible targets:

masked_targets_pred(
  prev_contexts = c("The", "The"),
  targets = c("apple", "pear"),
  after_contexts = c("doesn't fall far from the tree.", "doesn't fall far from the tree."),
  model = "bert-base-uncased"
)

Reporting and documentation of probabilities

@lisalevinson noted that the log probabilities are in natural log but couldn't find this documented. Since base-2 logs (bits) are common in psycholinguistics, it would be useful to specify the log base.

Response: This was a great suggestion! I added a log.p argument:

Base of the logarithm used for the output predictability values. If TRUE (default), the natural logarithm (base e) 
is used. If FALSE, the raw probabilities are returned. Alternatively, log.p can be set to a numeric value specifying 
the base of the logarithm (e.g., 2 for base-2 logarithms). To get surprisal in bits (rather than predictability), set log.p = 1/2.

Output formats

@lisalevinson found the function outputs inconsistent and suggested a vignette on reshaping the data for analysis.

Response: I standardized the outputs as much as possible and added a vignette on using pangoling in psycholinguistic research.

Non-English usage and examples

@lisalevinson suggested including examples for non-English languages, particularly those without spaces (e.g., Chinese).

Response: I added a sep argument to handle languages without spaces:

A string specifying how words are separated within contexts or groups. Default is " ". 
For languages that don't have spaces between words (e.g., Chinese), set sep = "".

Also the vignette is with an example in Chinese and using surprisal rather than predictability ;)

Multi-token words are already handled correctly—the package sums log probabilities across tokens.


Specific Function Comments

causal_lp()

@lisalevinson found it confusing that x could be a full sentence or a set of targets.

Response: This is now split into two functions:

batch_size Documentation

@lisalevinson found the explanation unclear.

Response: I clarified the documentation:

"batch_size: Maximum batch size. Larger batches speed up processing but require more memory."

causal_lp_mats() (now causal_pred_mats())

@lisalevinson suggested clarifying why it returns a matrix instead of a dataframe.

Response: It returns a matrix due to memory constraints and because all values are numeric. The documentation now explicitly states:

The function splits the input x into groups specified by the by argument and processes each group independently. For each group, the model computes the predictability of each token in its vocabulary based on preceding context.

Each matrix contains:
- Rows representing the model's vocabulary.
- Columns corresponding to tokens in the group (e.g., a sentence or paragraph).
- By default, values in the matrices are the natural logarithm of word probabilities.

She also suggested including examples for multiple sentences and how to wrangle the output.

Response: I added examples for processing multiple sentences.

Token vs. word Confusion

@lisalevinson pointed out that column names in some functions refer to "words," but they are actually tokens (e.g., "doesn" and "’t" are separate).

Response: I corrected the documentation to consistently refer to "tokens" instead of "words" where applicable. The function descriptions now clarify when they operate on tokenized text rather than full words.

causal_tokens_lp_tbl() (now causal_tokens_pred_lst())

@lisalevinson found the argument name .id unintuitive.

Response: .id has been removed—it’s no longer needed.

She also found the distinction between texts (full sentences) and x (target words) inconsistent across functions.

Response: causal_tokens_pred_lst() still takes texts because it processes full sentences or paragraphs and returns token probabilities. In contrast, causal_words_pred() take x, which represents smaller linguistic units like words or phrases. This distinction is now explicitly explained in the documentation.

masked_lp() preferring a different argument structure.

Response: I answered also above.

masked_tokens_tbl()

@lisalevinson suggested a clearer argument structure and better differentiation from causal_tokens_lp_tbl().

Response:

Response: The difference comes from how masked models work—masked predictions return probabilities for a single masked token position, whereas causal models generate predictions for all possible next tokens based on the context. I updated the documentation to clarify this distinction.

For masked models, this follows the standard Hugging Face approach:

masked_tokens_pred_tbl("The [MASK] doesn't fall far from the tree.", model = "bert-base-uncased")

This returns all possible completions with their log probabilities.

Test Failures:

Response: All tests pass.


This should now be clearer and more consistent. I tried to summarize the points to be able to answer them, if something is missing (or I misunderstood it), please let me know.

@emilyriederer
Copy link

Thank you so much for the detailed notes and updates, @bnicenboim !

@lisalevinson, @utkuturk - I know it has been a bit since you completed your reviews. Could you please reference @bnicenboim 's update comments above and assess if you feel these address the themes you raised? Ideally, you could please acknowledge using this template

@emilyriederer
Copy link

@ropensci-review-bot assign @emilyriederer as editor

@ropensci-review-bot
Copy link
Collaborator

Assigned! @emilyriederer is now the editor

@emilyriederer
Copy link

@ropensci-review-bot submit review time 10

@ropensci-review-bot
Copy link
Collaborator

Logged review for utkuturk (hours: 10)

@emilyriederer
Copy link

@ropensci-review-bot submit review #575 (comment) time 11

@ropensci-review-bot
Copy link
Collaborator

Logged review for lisalevinson (hours: 11)

@utkuturk
Copy link

Reviewer Response

Final approval (post-review)

  • The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 11

@maurolepore
Copy link
Member

Dear @bnicenboim this is to mark the start of my EiC rotation. I'm reviewing all open issues and noting what I see.

I see @bnicenboim addressed reviewers' requests, @emilyriederer asked reviewers for approvals and @utkuturk already approved. We're waiting for @lisalevinson to approve or request further changes.

@ropensci-review-bot
Copy link
Collaborator

@bnicenboim: please post your response with @ropensci-review-bot submit response <url to issue comment> if you haven't done so already (this is an automatic reminder).

Here's the author guide for response. https://devguide.ropensci.org/authors-guide.html

@lisalevinson
Copy link

I followed up in email with Emily, but just to flag here as well, I have it on my todo list to review this again, but have not yet found time to fit it in since my schedule is usually booked out for several weeks.

@emilyriederer
Copy link

Thank you for keeping this in mind, @lisalevinson !

@lisalevinson
Copy link

Reviewer Response

Thanks for this new revised version - it is much improved! I approve of this revision, but I have a couple of minor comments/questions for consideration at some point.

For the batch size in causal_word_pred(), I don't understand what the batch is from the documentation - batches of the groups/sentences, or batches of tokens/words? In the Mandarin example, I switched the batch between size 1 to 10 - the former had tokens per batch around 27-33 tokens (characters), which makes sense as sentence per batch, but the latter said batches all said they had 33 tokens. I don't follow what that means. Not sure if something is weird with the output message or the batches themselves.

Another concern is about the word/token order within groups/sentences. The word context/prefix is based purely on row order? It doesn't seem like the wordn value is used at all. This seems fragile if tokens possibly get sorted/rearranged, and it might not be an obvious problem if one is not carefully inspecting the order of tokens in the outputs. I know usually it will be fine but I always am nervous when anything in data analysis crucially depends on row order.

Final approval (post-review)

  • The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 13

@bnicenboim
Copy link
Author

Hi, Thanks a @utkuturk and @lisalevinson !!
Now some questions for @emilyriederer and @maurolepore.
What is the next step in following up on this? Would it be possible to publish something on the JOSS based on the package? What's the role of the review on this?

@maurolepore
Copy link
Member

maurolepore commented Feb 28, 2025

@emilyriederer I hope you don't mind ... I'll jump in while I'm doing an overall review of issues.

@bnicenboim, I see both reviewers approved but @lisalevinson seems to have some thoughtful comments. I'll go ahead and call the bot to approve the package, but please do acknowledge @lisalevinson comments and share your thoughts.

RE JOSS, our guidelines suggest your next step is to reach out to them, letting them know of our approval:

If you feel your package is in scope for the Journal of Open-Source Software (JOSS), do not submit it to JOSS consideration until after the rOpenSci review process is over: if your package is deemed in scope by JOSS editors, only the accompanying short paper would be reviewed, (not the software that will have been extended reviewed by rOpenSci by that time). Not all rOpenSci packages will meet the criteria for JOSS.
-- https://devdevguide.netlify.app/softwarereview_author.html#preparing-for-submission

Also the bot will reply with a number of tasks. Please complete them ASAP and do let us know if you need help.

@maurolepore
Copy link
Member

@ropensci-review-bot approve pangoling

@ropensci-review-bot
Copy link
Collaborator

Approved! Thanks @bnicenboim for submitting and @lisalevinson, @utkuturk for your reviews! 😁

To-dos:

  • Transfer the repo to rOpenSci's "ropensci" GitHub organization under "Settings" in your repo. I have invited you to a team that should allow you to do so. You will need to enable two-factor authentication for your GitHub account.
    This invitation will expire after one week. If it happens write a comment @ropensci-review-bot invite me to ropensci/<package-name> which will re-send an invitation.
  • After transfer write a comment @ropensci-review-bot finalize transfer of <package-name> where <package-name> is the repo/package name. This will give you admin access back.
  • Fix all links to the GitHub repo to point to the repo under the ropensci organization.
  • Delete your current code of conduct file if you had one since rOpenSci's default one will apply, see https://devguide.ropensci.org/collaboration.html#coc-file
  • If you already had a pkgdown website and are ok relying only on rOpenSci central docs building and branding,
    • deactivate the automatic deployment you might have set up
    • remove styling tweaks from your pkgdown config but keep that config file
    • replace the whole current pkgdown website with a redirecting page
    • replace your package docs URL with https://docs.ropensci.org/package_name
    • In addition, in your DESCRIPTION file, include the docs link in the URL field alongside the link to the GitHub repository, e.g.: URL: https://docs.ropensci.org/foobar, https://github.com/ropensci/foobar
  • Skim the docs of the pkgdown automatic deployment, in particular if your website needs MathJax.
  • Fix any links in badges for CI and coverage to point to the new repository URL.
  • Increment the package version to reflect the changes you made during review. In NEWS.md, add a heading for the new version and one bullet for each user-facing change, and each developer-facing change that you think is relevant.
  • We're starting to roll out software metadata files to all rOpenSci packages via the Codemeta initiative, see https://docs.ropensci.org/codemetar/ for how to include it in your package, after installing the package - should be easy as running codemetar::write_codemeta() in the root of your package.
  • You can add this installation method to your package README install.packages("<package-name>", repos = "https://ropensci.r-universe.dev") thanks to R-universe.

Should you want to acknowledge your reviewers in your package DESCRIPTION, you can do so by making them "rev"-type contributors in the Authors@R field (with their consent).

Welcome aboard! We'd love to host a post about your package - either a short introduction to it with an example for a technical audience or a longer post with some narrative about its development or something you learned, and an example of its use for a broader readership. If you are interested, consult the blog guide, and tag @ropensci/blog-editors in your reply. They will get in touch about timing and can answer any questions.

We maintain an online book with our best practice and tips, this chapter starts the 3d section that's about guidance for after onboarding (with advice on releases, package marketing, GitHub grooming); the guide also feature CRAN gotchas. Please tell us what could be improved.

Last but not least, you can volunteer as a reviewer via filling a short form.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants