Pre-submission inquiry for {kgrams}: Classical k-gram Language Models #452

vgherard · 2021-07-14T16:00:21Z

Submitting Author: Valerio Gherardi (@vgherard)
Repository: https://github.com/vgherard/kgrams
Submission type: Pre-submission

Paste the full DESCRIPTION file inside a code block below:

Package: kgrams
Title: Classical k-gram Language Models
Version: 0.1.0.9000
Authors@R: 
    person(given = "Valerio",
           family = "Gherardi",
           role = c("aut", "cre"),
           email = "[email protected]",
           comment = c(ORCID = "0000-0002-8215-3013"))
Description: 
        Tools for training and evaluating k-gram language models in R, 
        supporting several probability smoothing techniques, 
        perplexity computations, random text generation and more.
License: GPL (>= 3)
Encoding: UTF-8
LazyData: true
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.1.1
SystemRequirements: C++11
LinkingTo: 
    Rcpp, RcppProgress
Imports: 
    Rcpp, rlang, methods, utils,  RcppProgress (>= 0.1), Rdpack
Depends: 
    R (>= 3.5)
Suggests: 
    testthat (>= 3.0.0),
    covr,
    knitr,
    rmarkdown
Config/testthat/edition: 3
RdMacros: Rdpack
VignetteBuilder: knitr
URL: https://vgherard.github.io/kgrams/,
    https://github.com/vgherard/kgrams
BugReports: https://github.com/vgherard/kgrams/issues

Scope

Please indicate which category or categories from our package fit policies or statistical package categories this package falls under. (Please check an appropriate box below):

Data Lifecycle Packages
- data retrieval
- data extraction
- database access
- data munging
- data deposition
- workflow automation
- version control
- citation management and bibliometrics
- scientific software wrappers
- database software bindings
- geospatial data
- text data
Statistical Packages
- Bayesian and Monte Carlo Routines
- Dimensionality Reduction, Clustering, and Unsupervised Learning
- Machine Learning
- Regression and Supervised Learning
- Exploratory Data Analysis (EDA) and Summary Statistics
- Spatial Analyses
- Time Series Analyses
Explain how and why the package falls under these categories (briefly, 1-2 sentences). Please note any areas you are unsure of:

This package implements classical k-gram language model algorithms, including utilities for training, evaluation and text prediction. Language models are an angular stone of Natural Language Processing applications, and the conceptual simplicity of k-gram models makes them a good model baseline, also of pedagogical value.

k-gram models are a simple form of Machine-Learning applied to text data; as such, machine-learning is definitely the most appropriate category within the above ones. I would be inclined to define this as an "Unsupervised" learning problem, since the target function being learned (the language's probability distribution over sentences) is clearly not explicit in the training data - but have never seen this particular qualification in the literature.

If submitting a statistical package, have you already incorporated documentation of standards into your code via the srr package?

Not yet (NB: this is a presubmission inquiry).

Who is the target audience and what are scientific applications of this package?

The package can be useful for students and/or researchers, for performing small-scale experiments with Natural Language Processing. In addition, it might be helpful in the building of more complex language models, for quick baseline modeling.

Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category?

I am not aware of any R package with same purpose and functionalities of kgrams. The CRAN package ngram has some relative overlap in scope, in that it provides k-gram tokenization algorithms and random text generation, but offers no support for language model algorithms.

(If applicable) Does your package comply with our guidance around Ethics, Data Privacy and Human Subjects Research?

Not applicable.

Any other questions or issues we should be aware of?:

The package was accepted some months ago by CRAN.
Despite the "lifecycle:experimental" badge and the development version number, I haven't made important API changes or addied features to this package for a long time. With this submission, I'm hoping to receive your feedback regarding possible improvements and/or whether the package is mature enough to be considered stable.

The text was updated successfully, but these errors were encountered:

noamross · 2021-07-15T17:39:47Z

Thank you for our first statistical package pre-submission, @vgherard! I believe this clearly falls in scope and look forward to a full submission once you have incorporated the srr standards component. I am querying the editorial board to ask for an opinion as to whether this package should also apply standards from the Supervised or Unsupervised learning categories.

vgherard · 2021-07-15T19:10:21Z

Thanks @noamross, great :) I will begin looking to the srr standards, then. It may take me some time, but I'm up for it. Earlier I did a quick check with autotest and it seems like there's some trouble in parsing some of my examples, let's see if I can get it to work quickly.

noamross · 2021-07-15T19:15:45Z

Please ping me and @mpadge here with any questions, we know we are working out the kinks in the new system and are eager to help with the process to make it better!

vgherard · 2021-07-16T07:33:05Z

Thanks @noamross (@mpadge), I've filed an issue at ropensci-review-tools/autotest#49

noamross · 2022-08-09T18:59:06Z

Hello @vgherard! We're going back to some in-progress submissions that got stuck in an ambiguous state. Sorry that we haven't reached out in a while. I just wanted to see if ropensci peer review is something you were still interested in pursuing.

vgherard · 2022-08-11T20:46:06Z

Dear @noamross thanks for checking in and sorry for the long silence, I totally forgot about this process being open.

Sadly, right now I'm too short of time for a relatively demanding submission like this... Apart from this, over time I became a bit unsatisfied with certain aspects of this package, which I'd at least try to improve before submitting.

I'll close this, with the hope to come back to it in a not too far future :-)

Thanks!

mpadge · 2023-05-16T10:40:15Z

@vgherard Any updates on the status of your package? We'd still be very interested in receiving a full submission 👍

vgherard · 2023-06-06T08:18:17Z

Dears, thanks for keeping in touch.

I had a look at the requirements I would need to cover in order to submit {kgrams}, and again, sorry but this is too much for me.

The output of pkgcheck() alone looks intimidating - function names, usage of <<-, usage of sapply(), etc.etc.. Also, I imagine that passing autotest and srr would probably be much more demanding.

These are in general quick things, but with a package of the dimension of {kgrams} it takes a good amount of effort to finally get the green light - an effort I'm not really interested into, since the only thing I'm doing with that package at the moment is keeping it alive on CRAN :')

It's understood that when I say "too much" I refer only to my individual case - I think the work you're doing by putting up this review process is awesome.

For next package ideas I will definitely consider implementing ropensci standard from the onset!

mpadge · 2023-06-06T09:20:48Z

Thanks @vgherard, I definitely understand. It's a shame, but you are probably right that it wouldn't be a trivial amount of work to prepare it. Thanks for considering, and for the kind words, and we look forward to future submissions at any time.

This was referenced Jul 14, 2021

Pre-submission inquiry for {kgrams}: Classical k-gram Language Models #450

Closed

Documentation of standards via srr vgherard/kgrams#2

Closed

noamross added 0/presubmission stats labels Jul 15, 2021

noamross self-assigned this Jul 15, 2021

vgherard closed this as completed Aug 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pre-submission inquiry for {kgrams}: Classical k-gram Language Models #452

Pre-submission inquiry for {kgrams}: Classical k-gram Language Models #452

vgherard commented Jul 14, 2021

noamross commented Jul 15, 2021

vgherard commented Jul 15, 2021

noamross commented Jul 15, 2021

vgherard commented Jul 16, 2021

noamross commented Aug 9, 2022

vgherard commented Aug 11, 2022

mpadge commented May 16, 2023

vgherard commented Jun 6, 2023

mpadge commented Jun 6, 2023

Pre-submission inquiry for {kgrams}: Classical k-gram Language Models #452

Pre-submission inquiry for {kgrams}: Classical k-gram Language Models #452

Comments

vgherard commented Jul 14, 2021

Scope

noamross commented Jul 15, 2021

vgherard commented Jul 15, 2021

noamross commented Jul 15, 2021

vgherard commented Jul 16, 2021

noamross commented Aug 9, 2022

vgherard commented Aug 11, 2022

mpadge commented May 16, 2023

vgherard commented Jun 6, 2023

mpadge commented Jun 6, 2023