submission: pangoling: Access to word predictability using large language (transformer) models #575

bnicenboim · 2023-02-21T13:29:18Z

Date accepted: 2025-02-28

Submitting Author Name: Bruno Nicenboim
Submitting Author Github Handle: @bnicenboim
Repository: https://github.com/bnicenboim/pangoling
Version submitted: 0.0.0.9005
Submission type: Standard
Editor: @emilyriederer
Reviewers: @lisalevinson, @utkuturk

Archive: TBD
Version accepted: TBD
Language: en

Paste the full DESCRIPTION file inside a code block below:

Package: pangoling
Type: Package
Title: Access to Large Language Model Predictions
Version: 0.0.0.9005
Authors@R: c(
    person("Bruno", "Nicenboim",
    email = "[email protected]",
    role = c( "aut","cre"),
    comment = c(ORCID = "0000-0002-5176-3943")),
    person("Chris", "Emmerly", role = "ctb"),
    person("Giovanni", "Cassani", role = "ctb"))
Description: Access to word predictability using large language (transformer) models.
URL: https://bruno.nicenboim.me/pangoling, https://github.com/bnicenboim/pangoling
BugReports: https://github.com/bnicenboim/pangoling/issues
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: false
Config/reticulate:
  list(
    packages = list(
      list(package = "torch"),
      list(package = "transformers")
    )
  )
Imports: 
    data.table,
    memoise,
    reticulate,
    tidyselect,
    tidytable (>= 0.7.2),
    utils,
    cachem
Suggests: 
    rmarkdown,
    knitr,
    testthat (>= 3.0.0),
    tictoc,
    covr,
    spelling
Config/testthat/edition: 3
RoxygenNote: 7.2.3
Roxygen: list(markdown = TRUE)
Depends: 
    R (>= 4.1.0)
VignetteBuilder: knitr
StagedInstall: yes
Language: en-US

Scope

Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):
- data retrieval
- data extraction
- data munging
- data deposition
- data validation and testing
- workflow automation
- version control
- citation management and bibliometrics
- scientific software wrappers
- field and lab reproducibility tools
- database software bindings
- geospatial data
- text analysis
Explain how and why the package falls under these categories (briefly, 1-2 sentences):

The package is built on top of the python package transformers, and it offers some basic functionality for text analysis, including tokenization and perplexity calculation. Crucially pangoling also offers word predictability, which is widely used as a predictor in psycho and neurolinguistics, and it's not trivial to obtain. Also transformers works with "tokens" rather than "words", and then pangoling takes cares of the mapping between the tokens to the target words (or even phrases).

Who is the target audience and what are scientific applications of this package?

This is mostly for psycho/neuro/- linguists that use word predictability as a predictor in their research, such as in ERP/EEG and reading studies.

Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category?

Another R package that acts as a wrapper for transformers is text However, text is more general, and its focus is on Natural Language Processing and Machine Learning. pangoling is much more specific and the focus is on measures used as predictors in analyses of data from experiments, rather than NLP. text doesn't allow for generating pangoling output in a straightforward way and in fact, I'm not sure if it's even possible to get token probabilities from text since it seems more limited than the python package transformers.

(If applicable) Does your package comply with our guidance around Ethics, Data Privacy and Human Subjects Research?

NA

If you made a pre-submission inquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.

#573

Explain reasons for any pkgcheck items which your package is unable to pass.

pkgcheck fails only because of the use of <<-. But this is done in .OnLoad as recommended by reticulate. Also see this issue .

Technical checks

Confirm each of the following by checking the box.

I have read the rOpenSci packaging guide.
I have read the author guide and I expect to maintain this package for at least 2 years or to find a replacement.

This package:

does not violate the Terms of Service of any service it interacts with.
has a CRAN and OSI accepted license. -> I think so, it has an MIT license
contains a README with instructions for installing the development version.
includes documentation with examples for all functions, created with roxygen2.
contains a vignette with examples of its essential functions and uses.
has a test suite.
has continuous integration, including reporting of test coverage.

Publication options

Do you intend for this package to go on CRAN?
Do you intend for this package to go on Bioconductor?
Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:

MEE Options

The package is novel and will be of interest to the broad readership of the journal.
The manuscript describing the package is no longer than 3000 words.
You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see MEE's Policy on Publishing Code)
(Scope: Do consider MEE's Aims and Scope for your manuscript. We make no guarantee that your manuscript will be within MEE scope.)
(Although not required, we strongly recommend having a full manuscript prepared when you submit here.)
(Please do not submit your package separately to Methods in Ecology and Evolution)

Code of conduct

I agree to abide by rOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

The text was updated successfully, but these errors were encountered:

ropensci-review-bot · 2023-02-21T13:29:19Z

Thanks for submitting to rOpenSci, our editors and @ropensci-review-bot will reply soon. Type @ropensci-review-bot help for help.

ropensci-review-bot · 2023-02-21T13:29:23Z

🚀

Editor check started

👋

ropensci-review-bot · 2023-02-21T13:35:48Z

Checks for pangoling (v0.0.0.9005)

git hash: 543c11bd

✔️ Package name is available
✔️ has a 'codemeta.json' file.
✔️ has a 'contributing' file.
✔️ uses 'roxygen2'.
✔️ 'DESCRIPTION' has a URL field.
✔️ 'DESCRIPTION' has a BugReports field.
✔️ Package has at least one HTML vignette
✔️ All functions have examples.
✔️ Package has continuous integration checks.
✖️ Package coverage is 0.9% (should be at least 75%).
✔️ R CMD check found no errors.
✔️ R CMD check found no warnings.

Important: All failing checks above must be addressed prior to proceeding

Package License: MIT + file LICENSE

1. Package Dependencies

Details of Package Dependency Usage (click to open)

The table below tallies all function calls to all packages ('ncalls'), both internal (r-base + recommended, along with the package itself), and external (imported and suggested packages). 'NA' values indicate packages to which no identified calls to R functions could be found. Note that these results are generated by an automated code-tagging system which may not be entirely accurate.

type	package	ncalls
internal	base	97
internal	pangoling	41
internal	graphics	2
internal	stats	1
imports	tidytable	20
imports	reticulate	5
imports	memoise	3
imports	cachem	2
imports	data.table	1
imports	tidyselect	1
imports	utils	NA
suggests	rmarkdown	NA
suggests	knitr	NA
suggests	testthat	NA
suggests	tictoc	NA
suggests	covr	NA
suggests	spelling	NA
linking_to	NA	NA

Click below for tallies of functions used in each package. Locations of each call within this package may be generated locally by running 's <- pkgstats::pkgstats(<path/to/repo>)', and examining the 'external_calls' table.

base

lapply (19), length (7), c (6), dim (6), paste0 (6), t (6), unlist (4), by (3), list (3), names (3), seq_len (3), which (3), do.call (2), getOption (2), matrix (2), ncol (2), rep (2), seq_along (2), sum (2), unname (2), as.list (1), floor (1), for (1), grepl (1), lengths (1), mode (1), new.env (1), options (1), rownames (1), split (1), switch (1), vector (1)

pangoling

create_tensor_lst (5), lst_to_kwargs (5), char_to_token (4), encode (4), get_id (4), get_vocab (4), get_word_by_word_texts (2), masked_lp_mat (2), causal_config (1), causal_lp (1), causal_lp_mats (1), causal_mat (1), causal_next_tokens_tbl (1), causal_preload (1), causal_tokens_lp_tbl (1), chr_detect (1), masked_config (1), num_to_token (1), word_lp (1)

tidytable

map_chr. (4), map2 (3), map. (2), pmap. (2), arrange. (1), map (1), map_dbl. (1), map_dfr (1), map_dfr. (1), map2_dbl. (1), pmap_chr (1), relocate (1), tidytable (1)

reticulate

py_to_r (5)

memoise

memoise (3)

cachem

cache_mem (2)

graphics

text (2)

data.table

chmatch (1)

stats

lm (1)

tidyselect

everything (1)

NOTE: Some imported packages appear to have no associated function calls; please ensure with author that these 'Imports' are listed appropriately.

2. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has:

code in R (100% in 7 files) and
1 authors
2 vignettes
no internal data file
7 imported packages
14 exported functions (median 9 lines of code)
57 non-exported functions in R (median 9 lines of code)

Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages
The following terminology is used:

loc = "Lines of Code"
fn = "function"
exp/not_exp = exported / not exported

All parameters are explained as tooltips in the locally-rendered HTML version of this report generated by the checks_to_markdown() function

The final measure (fn_call_network_size) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile.

measure	value	percentile	noteworthy
files_R	7	45.7
files_vignettes	2	85.7
files_tests	7	86.4
loc_R	733	59.3
loc_vignettes	154	39.9
loc_tests	293	63.6
num_vignettes	2	89.2
n_fns_r	71	67.2
n_fns_r_exported	14	56.3
n_fns_r_not_exported	57	71.1
n_fns_per_file_r	6	75.6
num_params_per_fn	4	54.6
loc_per_fn_r	9	24.3
loc_per_fn_r_exp	10	21.6
loc_per_fn_r_not_exp	9	27.1
rel_whitespace_R	10	45.0
rel_whitespace_vignettes	42	47.5
rel_whitespace_tests	12	48.3
doclines_per_fn_exp	44	55.2
doclines_per_fn_not_exp	0	0.0	TRUE
fn_call_network_size	72	73.5

2a. Network visualisation

Click to see the interactive network visualisation of calls between objects in package

3. `goodpractice` and other checks

Details of goodpractice checks (click to open)

3a. Continuous Integration Badges

GitHub Workflow Results

id	name	conclusion	sha	run_number	date
4232217801	pages build and deployment	success	c6ce46	16	2023-02-21
4232181659	pkgdown	success	543c11	44	2023-02-21
4232181660	R-CMD-check	success	543c11	37	2023-02-21
4232181654	test-coverage	success	543c11	39	2023-02-21

3b. `goodpractice` results

`R CMD check` with rcmdcheck

rcmdcheck found no errors, warnings, or notes

Test coverage with covr

Package coverage: 0.89

The following files are not completely covered by tests:

file	coverage
R/tr_causal.R	0%
R/tr_masked.R	0%
R/tr_utils.R	0%
R/utils.R	0%
R/zzz.R	0%

Cyclocomplexity with cyclocomp

The following function have cyclocomplexity >= 15:

function	cyclocomplexity
word_lp	16

Static code analyses with lintr

lintr found the following 39 potential issues:

message	number of times
Avoid library() and require() calls in packages	5
Lines should not be more than 80 characters.	34

Package Versions

package	version
pkgstats	0.1.3
pkgcheck	0.1.1.11

Editor-in-Chief Instructions:

Processing may not proceed until the items marked with ✖️ have been resolved.

bnicenboim · 2023-02-21T13:55:10Z

Hi,
" Package coverage is 0.9% (should be at least 75%)." -> this seems to be an error.
See that the coverage is 94%.
Maybe it is because the tests skip almost everything, if there is no python installed? (As recommended by reticulate).

Also when I run pkgcheck::pkgcheck() I get the following:

✔ Package coverage is 94.8%.

maurolepore · 2023-02-22T10:56:13Z

Thanks @bnicenboim for your full submission and for explaining the issue with test coverage. Your explanation makes sense, so we can move forward. I'll start searching for a handling editor.

In the meantime you may want to start thinking of potential reviewers to suggest to the handling editor.

bnicenboim · 2023-02-22T11:03:41Z

Is there a pool of potential reviewers that I can have access to?

maurolepore · 2023-02-26T14:00:13Z

I guess authors are mostly guided by their knowledge of their intended audience. But for inspiration see how editors look for reviewers. Editors have access to a private airtable database, but often we look elsewhere.

maurolepore · 2023-03-07T20:52:46Z

Dear @bnicenboim I'm sorry for the extraordinary delay in finding a handling editor. Most editors are busy and some handling more than one package. And the very few available are not yet due to handle another submission. Please hold a bit longer.

bnicenboim · 2023-03-08T08:12:35Z

ok, thanks for letting me know, no problem.

karthik · 2023-03-10T16:47:05Z

@ropensci-review-bot assign @karthik as editor

ropensci-review-bot · 2023-03-10T16:47:09Z

Assigned! @karthik is now the editor

karthik · 2023-03-10T16:47:38Z

👋 @bnicenboim
I'll follow up with next steps in a few days. In the meantime I'll start pinging a few reviewers.

bnicenboim · 2023-03-23T14:18:17Z

Hi, any news about the next steps?

karthik · 2023-03-30T15:50:51Z

Hi @bnicenboim
Apologies for the delay, I've been out sick this past week. I'll update this thread in the next few days.

karthik · 2023-05-02T17:22:56Z

karthik · 2023-05-02T17:25:44Z

@ropensci-review-bot seeking reviewers

ropensci-review-bot · 2023-05-02T17:25:45Z

Please add this badge to the README of your package repository:

[![Status at rOpenSci Software Peer Review](https://badges.ropensci.org/575_status.svg)](https://github.com/ropensci/software-review/issues/575)

Furthermore, if your package does not have a NEWS.md file yet, please create one to capture the changes made during the review process. See https://devguide.ropensci.org/releasing.html#news

bnicenboim · 2023-05-03T11:19:26Z

I really don't know about reviewers, I guess someone involved in the packages named here:
https://cran.r-project.org/web/views/NaturalLanguageProcessing.html

Or maybe someone based on the reverse imports of reticulate:
https://cloud.r-project.org/web/packages/reticulate/index.html

karthik · 2023-05-03T15:08:23Z

@ropensci-review-bot assign @lisalevinson as reviewer

ropensci-review-bot · 2023-05-03T15:08:26Z

@lisalevinson added to the reviewers list. Review due date is 2023-05-24. Thanks @lisalevinson for accepting to review! Please refer to our reviewer guide.

rOpenSci’s community is our best asset. We aim for reviews to be open, non-adversarial, and focused on improving software quality. Be respectful and kind! See our reviewers guide and code of conduct for more.

ropensci-review-bot · 2023-05-03T15:08:30Z

@lisalevinson: If you haven't done so, please fill this form for us to update our reviewers records.

bnicenboim · 2025-01-20T09:21:06Z

Hi, I'm not sure what to do about the errors. I don't get the warnings or notes here:
https://github.com/bnicenboim/pangoling/actions/runs/12864039448/job/35861748929

It might be that rmdcheck doesn't install the python dependencies needed? But I'm following the latest advice from reticulate (https://rstudio.github.io/reticulate/articles/package.html) in both creating a function to install the python dependencies rather than doing it automatically and how to test them using Github Actions

mpadge · 2025-01-20T09:43:49Z

Hi @bnicenboim, Mark here from the rOpenSci team. Please ignore those check failures; they are because we don't have the ability to install package-specific python dependencies in our check system. Your link to successful GitHub workflow is fine. We also no longer have an editor assigned to this submission, since Karthik has stepped back from rOpenSci, so I'll ensure that happens, and you should hear from newly assigned editor asap. Thanks for returning to your submission - we always appreciate things getting moving again!

bnicenboim · 2025-01-20T11:04:33Z

Hi everyone, after a long time, I have an update! I made a lot of changes—see the NEWS file. Renamed most functions, added more arguments, simplified the installation, added a troubleshooting vignette and a worked out example with Chinese sentences and surprisal values.

Thanks again for the reviews! The long delay turned out to be beneficial, as looking at it with fresh eyes helped me improve the documentation.

I address the reviewers' comments below.

@utkuturk's Review

Clarity issue: Python dependencies

@utkuturk mentioned difficulties with installing the Python dependencies.

Response: I updated the installation process to follow the latest reticulate recommendations. There's now an install_py_pangoling() function to handle installation, along with a troubleshooting vignette.

Clarity issue: Downloading models

@utkuturk suggested making it clearer that the package downloads models locally and does not operate through an API.

Response: This is now explicitly stated in the startup message:

pangoling version 0.0.0.9010  
An introduction to the package can be found at https://bruno.nicenboim.me/pangoling/articles/  
Notice that pretrained models and tokenizers are downloaded from https://huggingface.co/ the first time they are used.  
To change the cache folder, use:  
set_cache_folder(my_new_path)

Advanced Configuration: `NULL` Arguments

@utkuturk noted that the documentation lacks information on how NULL arguments work in exported functions.

Response: I think this is out of scope. The package is fully functional without users needing to configure models. However, I’ve ensured that advanced users can still modify configurations if needed.

@lisalevinson's Review

Workflow: Extracting probabilities for all words

@lisalevinson typically gathers probabilities for all words in a sentence rather than just one target word and found the argument structure cumbersome, especially for masked models.

Response: The challenge with a position-based approach (e.g., target = 8) is defining what "8" refers to—words (what counts as a word?) or tokens? Instead, here’s how it's done now:

(preds <- causal_tokens_pred_lst("The apple doesn't fall far from the tree."))

[[1]]
          The        Ġapple        Ġdoesn            't         Ġfall          Ġfar 
           NA -1.090052e+01 -5.499081e+00 -8.279233e-04 -3.597742e+00 -2.911830e+00 
        Ġfrom          Ġthe         Ġtree             . 
-7.454723e-01 -2.066519e-01 -2.808024e-01 -1.300929e+00

# this creates lists because it allows for several sentences
preds[[1]][9]  # Probability for "tree"

For masked models one needs to explicitly cut the sentence where the targets are and propose the possible targets:

masked_targets_pred(
  prev_contexts = c("The", "The"),
  targets = c("apple", "pear"),
  after_contexts = c("doesn't fall far from the tree.", "doesn't fall far from the tree."),
  model = "bert-base-uncased"
)

Reporting and documentation of probabilities

@lisalevinson noted that the log probabilities are in natural log but couldn't find this documented. Since base-2 logs (bits) are common in psycholinguistics, it would be useful to specify the log base.

Response: This was a great suggestion! I added a log.p argument:

Base of the logarithm used for the output predictability values. If TRUE (default), the natural logarithm (base e) 
is used. If FALSE, the raw probabilities are returned. Alternatively, log.p can be set to a numeric value specifying 
the base of the logarithm (e.g., 2 for base-2 logarithms). To get surprisal in bits (rather than predictability), set log.p = 1/2.

Output formats

@lisalevinson found the function outputs inconsistent and suggested a vignette on reshaping the data for analysis.

Response: I standardized the outputs as much as possible and added a vignette on using pangoling in psycholinguistic research.

Non-English usage and examples

@lisalevinson suggested including examples for non-English languages, particularly those without spaces (e.g., Chinese).

Response: I added a sep argument to handle languages without spaces:

A string specifying how words are separated within contexts or groups. Default is " ". 
For languages that don't have spaces between words (e.g., Chinese), set sep = "".

Also the vignette is with an example in Chinese and using surprisal rather than predictability ;)

Multi-token words are already handled correctly—the package sums log probabilities across tokens.

Specific Function Comments

`causal_lp()`

@lisalevinson found it confusing that x could be a full sentence or a set of targets.

Response: This is now split into two functions:

causal_words_pred() for word-by-word (or phrase-by-phrase) full sentences. This one uses x
causal_targets_pred() for target words. This one uses targets.

`batch_size` Documentation

@lisalevinson found the explanation unclear.

Response: I clarified the documentation:

"batch_size: Maximum batch size. Larger batches speed up processing but require more memory."

`causal_lp_mats()` (now `causal_pred_mats()`)

@lisalevinson suggested clarifying why it returns a matrix instead of a dataframe.

Response: It returns a matrix due to memory constraints and because all values are numeric. The documentation now explicitly states:

The function splits the input x into groups specified by the by argument and processes each group independently. For each group, the model computes the predictability of each token in its vocabulary based on preceding context.

Each matrix contains:
- Rows representing the model's vocabulary.
- Columns corresponding to tokens in the group (e.g., a sentence or paragraph).
- By default, values in the matrices are the natural logarithm of word probabilities.

She also suggested including examples for multiple sentences and how to wrangle the output.

Response: I added examples for processing multiple sentences.

Token vs. word Confusion

@lisalevinson pointed out that column names in some functions refer to "words," but they are actually tokens (e.g., "doesn" and "’t" are separate).

Response: I corrected the documentation to consistently refer to "tokens" instead of "words" where applicable. The function descriptions now clarify when they operate on tokenized text rather than full words.

`causal_tokens_lp_tbl()` (now `causal_tokens_pred_lst()`)

@lisalevinson found the argument name .id unintuitive.

Response: .id has been removed—it’s no longer needed.

She also found the distinction between texts (full sentences) and x (target words) inconsistent across functions.

Response: causal_tokens_pred_lst() still takes texts because it processes full sentences or paragraphs and returns token probabilities. In contrast, causal_words_pred() take x, which represents smaller linguistic units like words or phrases. This distinction is now explicitly explained in the documentation.

`masked_lp()` preferring a different argument structure.

Response: I answered also above.

`masked_tokens_tbl()`

@lisalevinson suggested a clearer argument structure and better differentiation from causal_tokens_lp_tbl().

Response:

masked_tokens_tbl() is now masked_tokens_pred_tbl().

Response: The difference comes from how masked models work—masked predictions return probabilities for a single masked token position, whereas causal models generate predictions for all possible next tokens based on the context. I updated the documentation to clarify this distinction.

For masked models, this follows the standard Hugging Face approach:

masked_tokens_pred_tbl("The [MASK] doesn't fall far from the tree.", model = "bert-base-uncased")

This returns all possible completions with their log probabilities.

Test Failures:

Response: All tests pass.

This should now be clearer and more consistent. I tried to summarize the points to be able to answer them, if something is missing (or I misunderstood it), please let me know.

emilyriederer · 2025-01-22T00:11:24Z

Thank you so much for the detailed notes and updates, @bnicenboim !

@lisalevinson, @utkuturk - I know it has been a bit since you completed your reviews. Could you please reference @bnicenboim 's update comments above and assess if you feel these address the themes you raised? Ideally, you could please acknowledge using this template

emilyriederer · 2025-01-22T00:11:42Z

@ropensci-review-bot assign @emilyriederer as editor

ropensci-review-bot · 2025-01-22T00:11:46Z

Assigned! @emilyriederer is now the editor

emilyriederer · 2025-01-22T00:13:17Z

@ropensci-review-bot submit review time 10

ropensci-review-bot · 2025-01-22T00:13:22Z

Logged review for utkuturk (hours: 10)

emilyriederer · 2025-01-22T00:14:28Z

@ropensci-review-bot submit review #575 (comment) time 11

ropensci-review-bot · 2025-01-22T00:14:30Z

Logged review for lisalevinson (hours: 11)

utkuturk · 2025-01-25T01:31:20Z

Reviewer Response

Final approval (post-review)

The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 11

maurolepore · 2025-02-02T17:28:46Z

Dear @bnicenboim this is to mark the start of my EiC rotation. I'm reviewing all open issues and noting what I see.

I see @bnicenboim addressed reviewers' requests, @emilyriederer asked reviewers for approvals and @utkuturk already approved. We're waiting for @lisalevinson to approve or request further changes.

ropensci-review-bot · 2025-02-03T00:14:33Z

@bnicenboim: please post your response with @ropensci-review-bot submit response <url to issue comment> if you haven't done so already (this is an automatic reminder).

Here's the author guide for response. https://devguide.ropensci.org/authors-guide.html

lisalevinson · 2025-02-03T15:40:08Z

I followed up in email with Emily, but just to flag here as well, I have it on my todo list to review this again, but have not yet found time to fit it in since my schedule is usually booked out for several weeks.

emilyriederer · 2025-02-14T02:20:37Z

Thank you for keeping this in mind, @lisalevinson !

lisalevinson · 2025-02-19T17:56:10Z

Reviewer Response

Thanks for this new revised version - it is much improved! I approve of this revision, but I have a couple of minor comments/questions for consideration at some point.

For the batch size in causal_word_pred(), I don't understand what the batch is from the documentation - batches of the groups/sentences, or batches of tokens/words? In the Mandarin example, I switched the batch between size 1 to 10 - the former had tokens per batch around 27-33 tokens (characters), which makes sense as sentence per batch, but the latter said batches all said they had 33 tokens. I don't follow what that means. Not sure if something is weird with the output message or the batches themselves.

Another concern is about the word/token order within groups/sentences. The word context/prefix is based purely on row order? It doesn't seem like the wordn value is used at all. This seems fragile if tokens possibly get sorted/rearranged, and it might not be an obvious problem if one is not carefully inspecting the order of tokens in the outputs. I know usually it will be fine but I always am nervous when anything in data analysis crucially depends on row order.

Final approval (post-review)

The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 13

bnicenboim · 2025-02-28T09:16:20Z

Hi, Thanks a @utkuturk and @lisalevinson !!
Now some questions for @emilyriederer and @maurolepore.
What is the next step in following up on this? Would it be possible to publish something on the JOSS based on the package? What's the role of the review on this?

maurolepore · 2025-02-28T23:17:02Z

@emilyriederer I hope you don't mind ... I'll jump in while I'm doing an overall review of issues.

@bnicenboim, I see both reviewers approved but @lisalevinson seems to have some thoughtful comments. I'll go ahead and call the bot to approve the package, but please do acknowledge @lisalevinson comments and share your thoughts.

RE JOSS, our guidelines suggest your next step is to reach out to them, letting them know of our approval:

If you feel your package is in scope for the Journal of Open-Source Software (JOSS), do not submit it to JOSS consideration until after the rOpenSci review process is over: if your package is deemed in scope by JOSS editors, only the accompanying short paper would be reviewed, (not the software that will have been extended reviewed by rOpenSci by that time). Not all rOpenSci packages will meet the criteria for JOSS.
-- https://devdevguide.netlify.app/softwarereview_author.html#preparing-for-submission

Also the bot will reply with a number of tasks. Please complete them ASAP and do let us know if you need help.

maurolepore · 2025-02-28T23:18:01Z

@ropensci-review-bot approve pangoling

ropensci-review-bot · 2025-02-28T23:18:04Z

Approved! Thanks @bnicenboim for submitting and @lisalevinson, @utkuturk for your reviews! 😁

To-dos:

Should you want to acknowledge your reviewers in your package DESCRIPTION, you can do so by making them "rev"-type contributors in the Authors@R field (with their consent).

Welcome aboard! We'd love to host a post about your package - either a short introduction to it with an example for a technical audience or a longer post with some narrative about its development or something you learned, and an example of its use for a broader readership. If you are interested, consult the blog guide, and tag @ropensci/blog-editors in your reply. They will get in touch about timing and can answer any questions.

We maintain an online book with our best practice and tips, this chapter starts the 3d section that's about guidance for after onboarding (with advice on releases, package marketing, GitHub grooming); the guide also feature CRAN gotchas. Please tell us what could be improved.

Last but not least, you can volunteer as a reviewer via filling a short form.

maurolepore added the 0/editorial-team-prep label Feb 22, 2023

maurolepore mentioned this issue Feb 22, 2023

presubmission inquiry for pangoling: Access to word predictability using large language (transformer) models. #573

Closed

20 tasks

maurolepore pinned this issue Feb 26, 2023

maurolepore unpinned this issue Feb 26, 2023

ropensci-review-bot assigned karthik Mar 10, 2023

ropensci-review-bot added 1/editor-checks and removed 0/editorial-team-prep labels Mar 10, 2023

ropensci-review-bot added 2/seeking-reviewer(s) and removed 1/editor-checks labels May 2, 2023

ropensci-review-bot unassigned karthik Jan 22, 2025

ropensci-review-bot assigned emilyriederer Jan 22, 2025

ropensci-review-bot added the 1/editor-checks label Jan 22, 2025

emilyriederer removed the 1/editor-checks label Jan 22, 2025

ropensci-review-bot added 4/review(s)-in-awaiting-changes and removed 3/reviewer(s)-assigned labels Jan 22, 2025

emilyriederer removed the holding label Jan 22, 2025

ropensci-review-bot added the 6/approved label Feb 28, 2025

ropensci-review-bot closed this as completed Feb 28, 2025

bnicenboim mentioned this issue Mar 2, 2025

word order for causal_words_pred bnicenboim/pangoling#29

Closed

submission: pangoling: Access to word predictability using large language (transformer) models #575

submission: pangoling: Access to word predictability using large language (transformer) models #575

Comments

bnicenboim commented Feb 21, 2023 • edited by ropensci-review-bot Loading

Archive: TBD Version accepted: TBD Language: en

Scope

Technical checks

Publication options

Code of conduct

ropensci-review-bot commented Feb 21, 2023

ropensci-review-bot commented Feb 21, 2023

ropensci-review-bot commented Feb 21, 2023

Checks for pangoling (v0.0.0.9005)

1. Package Dependencies

2. Statistical Properties

2a. Network visualisation

3. goodpractice and other checks

3a. Continuous Integration Badges

3b. goodpractice results

R CMD check with rcmdcheck

Test coverage with covr

Cyclocomplexity with cyclocomp

Static code analyses with lintr

Editor-in-Chief Instructions:

bnicenboim commented Feb 21, 2023

maurolepore commented Feb 22, 2023 • edited Loading

bnicenboim commented Feb 22, 2023

maurolepore commented Feb 26, 2023 • edited Loading

maurolepore commented Mar 7, 2023 • edited Loading

bnicenboim commented Mar 8, 2023

karthik commented Mar 10, 2023

ropensci-review-bot commented Mar 10, 2023

karthik commented Mar 10, 2023

bnicenboim commented Mar 23, 2023

karthik commented Mar 30, 2023

karthik commented May 2, 2023 • edited Loading

Editor checks:

Editor comments

karthik commented May 2, 2023

ropensci-review-bot commented May 2, 2023

bnicenboim commented May 3, 2023

karthik commented May 3, 2023

ropensci-review-bot commented May 3, 2023

ropensci-review-bot commented May 3, 2023

bnicenboim commented Jan 20, 2025

mpadge commented Jan 20, 2025

bnicenboim commented Jan 20, 2025

@utkuturk's Review

Clarity issue: Python dependencies

Clarity issue: Downloading models

Advanced Configuration: NULL Arguments

@lisalevinson's Review

Workflow: Extracting probabilities for all words

Reporting and documentation of probabilities

Output formats

Non-English usage and examples

Specific Function Comments

causal_lp()

batch_size Documentation

causal_lp_mats() (now causal_pred_mats())

Token vs. word Confusion

causal_tokens_lp_tbl() (now causal_tokens_pred_lst())

masked_lp() preferring a different argument structure.

masked_tokens_tbl()

Test Failures:

emilyriederer commented Jan 22, 2025

emilyriederer commented Jan 22, 2025

ropensci-review-bot commented Jan 22, 2025

emilyriederer commented Jan 22, 2025

ropensci-review-bot commented Jan 22, 2025

emilyriederer commented Jan 22, 2025

ropensci-review-bot commented Jan 22, 2025

utkuturk commented Jan 25, 2025

Reviewer Response

Final approval (post-review)

maurolepore commented Feb 2, 2025

ropensci-review-bot commented Feb 3, 2025

lisalevinson commented Feb 3, 2025

emilyriederer commented Feb 14, 2025

lisalevinson commented Feb 19, 2025

bnicenboim commented Feb 21, 2023 •

edited by ropensci-review-bot

Loading

Archive: TBD
Version accepted: TBD
Language: en

3. `goodpractice` and other checks

3b. `goodpractice` results

`R CMD check` with rcmdcheck

maurolepore commented Feb 22, 2023 •

edited

Loading

maurolepore commented Feb 26, 2023 •

edited

Loading

maurolepore commented Mar 7, 2023 •

edited

Loading

karthik commented May 2, 2023 •

edited

Loading

Advanced Configuration: `NULL` Arguments

`causal_lp()`

`batch_size` Documentation

`causal_lp_mats()` (now `causal_pred_mats()`)

`causal_tokens_lp_tbl()` (now `causal_tokens_pred_lst()`)

`masked_lp()` preferring a different argument structure.

`masked_tokens_tbl()`

maurolepore commented Feb 28, 2025 •

edited

Loading