Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pre-submission inquiry for {kgrams}: Classical k-gram Language Models #452

Closed
1 of 19 tasks
vgherard opened this issue Jul 14, 2021 · 9 comments
Closed
1 of 19 tasks

Comments

@vgherard
Copy link

Submitting Author: Valerio Gherardi (@vgherard)
Repository: https://github.com/vgherard/kgrams
Submission type: Pre-submission


  • Paste the full DESCRIPTION file inside a code block below:
Package: kgrams
Title: Classical k-gram Language Models
Version: 0.1.0.9000
Authors@R: 
    person(given = "Valerio",
           family = "Gherardi",
           role = c("aut", "cre"),
           email = "[email protected]",
           comment = c(ORCID = "0000-0002-8215-3013"))
Description: 
        Tools for training and evaluating k-gram language models in R, 
        supporting several probability smoothing techniques, 
        perplexity computations, random text generation and more.
License: GPL (>= 3)
Encoding: UTF-8
LazyData: true
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.1.1
SystemRequirements: C++11
LinkingTo: 
    Rcpp, RcppProgress
Imports: 
    Rcpp, rlang, methods, utils,  RcppProgress (>= 0.1), Rdpack
Depends: 
    R (>= 3.5)
Suggests: 
    testthat (>= 3.0.0),
    covr,
    knitr,
    rmarkdown
Config/testthat/edition: 3
RdMacros: Rdpack
VignetteBuilder: knitr
URL: https://vgherard.github.io/kgrams/,
    https://github.com/vgherard/kgrams
BugReports: https://github.com/vgherard/kgrams/issues

Scope

  • Please indicate which category or categories from our package fit policies or statistical package categories this package falls under. (Please check an appropriate box below):

    Data Lifecycle Packages

    • data retrieval
    • data extraction
    • database access
    • data munging
    • data deposition
    • workflow automation
    • version control
    • citation management and bibliometrics
    • scientific software wrappers
    • database software bindings
    • geospatial data
    • text data

    Statistical Packages

    • Bayesian and Monte Carlo Routines
    • Dimensionality Reduction, Clustering, and Unsupervised Learning
    • Machine Learning
    • Regression and Supervised Learning
    • Exploratory Data Analysis (EDA) and Summary Statistics
    • Spatial Analyses
    • Time Series Analyses
  • Explain how and why the package falls under these categories (briefly, 1-2 sentences). Please note any areas you are unsure of:

This package implements classical k-gram language model algorithms, including utilities for training, evaluation and text prediction. Language models are an angular stone of Natural Language Processing applications, and the conceptual simplicity of k-gram models makes them a good model baseline, also of pedagogical value.

k-gram models are a simple form of Machine-Learning applied to text data; as such, machine-learning is definitely the most appropriate category within the above ones. I would be inclined to define this as an "Unsupervised" learning problem, since the target function being learned (the language's probability distribution over sentences) is clearly not explicit in the training data - but have never seen this particular qualification in the literature.

Not yet (NB: this is a presubmission inquiry).

  • Who is the target audience and what are scientific applications of this package?

The package can be useful for students and/or researchers, for performing small-scale experiments with Natural Language Processing. In addition, it might be helpful in the building of more complex language models, for quick baseline modeling.

I am not aware of any R package with same purpose and functionalities of kgrams. The CRAN package ngram has some relative overlap in scope, in that it provides k-gram tokenization algorithms and random text generation, but offers no support for language model algorithms.

Not applicable.

  • Any other questions or issues we should be aware of?:
  1. The package was accepted some months ago by CRAN.
  2. Despite the "lifecycle:experimental" badge and the development version number, I haven't made important API changes or addied features to this package for a long time. With this submission, I'm hoping to receive your feedback regarding possible improvements and/or whether the package is mature enough to be considered stable.
@noamross
Copy link
Contributor

Thank you for our first statistical package pre-submission, @vgherard! I believe this clearly falls in scope and look forward to a full submission once you have incorporated the srr standards component. I am querying the editorial board to ask for an opinion as to whether this package should also apply standards from the Supervised or Unsupervised learning categories.

@vgherard
Copy link
Author

Thanks @noamross, great :) I will begin looking to the srr standards, then. It may take me some time, but I'm up for it. Earlier I did a quick check with autotest and it seems like there's some trouble in parsing some of my examples, let's see if I can get it to work quickly.

@noamross
Copy link
Contributor

Please ping me and @mpadge here with any questions, we know we are working out the kinks in the new system and are eager to help with the process to make it better!

@vgherard
Copy link
Author

Thanks @noamross (@mpadge), I've filed an issue at ropensci-review-tools/autotest#49

@noamross
Copy link
Contributor

noamross commented Aug 9, 2022

Hello @vgherard! We're going back to some in-progress submissions that got stuck in an ambiguous state. Sorry that we haven't reached out in a while. I just wanted to see if ropensci peer review is something you were still interested in pursuing.

@vgherard
Copy link
Author

Dear @noamross thanks for checking in and sorry for the long silence, I totally forgot about this process being open.

Sadly, right now I'm too short of time for a relatively demanding submission like this... Apart from this, over time I became a bit unsatisfied with certain aspects of this package, which I'd at least try to improve before submitting.

I'll close this, with the hope to come back to it in a not too far future :-)

Thanks!

@mpadge
Copy link
Member

mpadge commented May 16, 2023

@vgherard Any updates on the status of your package? We'd still be very interested in receiving a full submission 👍

@vgherard
Copy link
Author

vgherard commented Jun 6, 2023

Dears, thanks for keeping in touch.

I had a look at the requirements I would need to cover in order to submit {kgrams}, and again, sorry but this is too much for me.

The output of pkgcheck() alone looks intimidating - function names, usage of <<-, usage of sapply(), etc.etc.. Also, I imagine that passing autotest and srr would probably be much more demanding.

These are in general quick things, but with a package of the dimension of {kgrams} it takes a good amount of effort to finally get the green light - an effort I'm not really interested into, since the only thing I'm doing with that package at the moment is keeping it alive on CRAN :')

It's understood that when I say "too much" I refer only to my individual case - I think the work you're doing by putting up this review process is awesome.

For next package ideas I will definitely consider implementing ropensci standard from the onset!

@mpadge
Copy link
Member

mpadge commented Jun 6, 2023

Thanks @vgherard, I definitely understand. It's a shame, but you are probably right that it wouldn't be a trivial amount of work to prepare it. Thanks for considering, and for the kind words, and we look forward to future submissions at any time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants