From 2a7bdabfa3979f805c9a3c1dd6cda4ce30f418b2 Mon Sep 17 00:00:00 2001 From: Indrajeet Patil Date: Tue, 3 Sep 2024 17:20:33 +0200 Subject: [PATCH] JOSS publication (#1885) * add boilerplate * say more about need and benefits of lintr * update PDF as well * fix merge issues [skip ci] * add a few examples * try width [skip ci] * discuss specific tags * only show examples that lint * add section on best practices * retain TeX file as well [skip ci] * address Jim's comment [skip ci] * Update paper/paper.Rmd Co-authored-by: Michael Chirico * mention tidyverse style guide up front * we only need md document * Update draft-pdf.yml * preserve YAML; add initial list of authors * initial acknowledgments * change example for best practices * also add example that doesn't lint * Update paper/paper.Rmd [skip ci] Co-authored-by: Michael Chirico * Update paper/paper.Rmd [skip ci] Co-authored-by: Michael Chirico * Update paper/paper.Rmd [skip ci] Co-authored-by: Michael Chirico * update ignore regex; update workflow * add bib entry for style guide * change a couple of examples * add common mistakes section * michael has an orcid somehow :) * Apply Michael's suggestions from code review Co-authored-by: Michael Chirico * reknit * add chunk labels * move customizability segment to later * Just create a new section to highlight extensibility * more on customization * add citation to wiki page * Update paper/paper.Rmd [skip ci] Co-authored-by: Michael Chirico * Update paper/paper.Rmd [skip ci] Co-authored-by: Michael Chirico * consistently use with-without lint pairing [skip ci] * use latest, blazingly fast upload artifact action * authors: Jim first, everyone else alphabetical * Update paper.md Co-authored-by: Jim Hester * update Rmd file for Jim's ORCID --------- Co-authored-by: Michael Chirico Co-authored-by: Michael Chirico Co-authored-by: Jim Hester --- .Rbuildignore | 1 + .github/workflows/draft-pdf.yml | 30 + paper/apa.csl | 1916 +++++++++++++++++++++++++++++++ paper/paper.Rmd | 246 ++++ paper/paper.bib | 41 + paper/paper.md | 342 ++++++ 6 files changed, 2576 insertions(+) create mode 100644 .github/workflows/draft-pdf.yml create mode 100644 paper/apa.csl create mode 100644 paper/paper.Rmd create mode 100644 paper/paper.bib create mode 100644 paper/paper.md diff --git a/.Rbuildignore b/.Rbuildignore index e609141bd..257a72963 100644 --- a/.Rbuildignore +++ b/.Rbuildignore @@ -29,3 +29,4 @@ ^vignettes/[^-]+.gif$ ^CRAN-SUBMISSION$ ^CODE_OF_CONDUCT\.md$ +^paper$ diff --git a/.github/workflows/draft-pdf.yml b/.github/workflows/draft-pdf.yml new file mode 100644 index 000000000..d72255cdc --- /dev/null +++ b/.github/workflows/draft-pdf.yml @@ -0,0 +1,30 @@ +# TODO: delete this file once the paper is published +on: + push: + branches: [main] + pull_request: + branches: [main] + +jobs: + paper: + runs-on: ubuntu-latest + name: Paper Draft + steps: + - name: Checkout + uses: actions/checkout@v4 + + - name: Build draft PDF + uses: openjournals/openjournals-draft-action@master + with: + journal: joss + # This should be the path to the paper within your repo. + paper-path: paper/paper.md + + - name: Upload + uses: actions/upload-artifact@v4 + with: + name: paper + # This is the output path where Pandoc will write the compiled + # PDF. Note, this should be the same directory as the input + # paper.md + path: paper/paper.pdf diff --git a/paper/apa.csl b/paper/apa.csl new file mode 100644 index 000000000..081857d9d --- /dev/null +++ b/paper/apa.csl @@ -0,0 +1,1916 @@ + + diff --git a/paper/paper.Rmd b/paper/paper.Rmd new file mode 100644 index 000000000..4aa995404 --- /dev/null +++ b/paper/paper.Rmd @@ -0,0 +1,246 @@ +--- +title: "Static Code Analysis for R" +date: "`r Sys.Date()`" +tags: ["R", "linter", "tidyverse"] +authors: + - name: Jim Hester + affiliation: 1 + orcid: 0000-0002-2739-7082 + - name: Florent Angly + affiliation: ~ + orcid: ~ + - name: Michael Chirico + affiliation: 2 + orcid: 0000-0003-0787-087X + - name: Russ Hyde + affiliation: 5 + orcid: ~ + - name: Ren Kun + affiliation: ~ + orcid: ~ + - name: Indrajeet Patil + orcid: 0000-0003-1995-6531 + affiliation: 4 + - name: Alexander Rosenstock + affiliation: 3 + orcid: ~ +affiliations: + - index: 1 + name: Netflix + - index: 2 + name: Google + - index: 3 + name: Mathematisches Institut der Heinrich-Heine-Universität Düsseldorf + - index: 4 + name: Preisenergie GmbH, Munich, Germany + - index: 5 + name: Jumping Rivers +output: + md_document: + variant: "markdown" + preserve_yaml: true + standalone: true +bibliography: paper.bib +csl: apa.csl +link-citations: yes +--- + +```{r setup, warning=FALSE, message=FALSE, echo=FALSE} +knitr::opts_chunk$set( + collapse = TRUE, + out.width = "100%", + comment = "#>" +) + +library(lintr) + +withr::local_options(list( + lintr.format_width = 60L +)) +``` + +# Statement of Need + +R is an interpreted, dynamically-typed programming language [@base2023]. It is a popular choice for statistical analysis and visualization, and is used by a wide range of researchers and data scientists. The `{lintr}` package is an open-source R package that provides static code analysis [@enwiki:1218663830] to check for a variety of common problems related to readability, efficiency, consistency, style, etc. In particular, by default it enforces the tidyverse style guide [@Wickham2023]. It is designed to be easy to use and integrate into existing workflows, and can be used as part of an automated build or continuous integration process. `{lintr}` also integrates with a number of popular IDEs and text editors, such as RStudio and Visual Studio Code, making it convenient for users to run `{lintr}` checks on their code as they work. + +# Features + +As of this writing, `{lintr}` offers `r length(all_linters())` linters. + +```{r all_linters} +library(lintr) + +length(all_linters()) +``` + +Naturally, we can't discuss all of them here. To see details about all available linters, we encourage readers to see . + +We will showcase one linter for each kind of common problem found in R code. + +- **Best practices** + +`{lintr}` offers linters that can detect problematic antipatterns and suggest alternatives that follow best practices. + +For example, expressions like `ifelse(x, TRUE, FALSE)` and `ifelse(x, FALSE, TRUE)` are redundant; just `x` or `!x` suffice in R code where logical vectors are a core data structure. The `redundant_ifelse_linter()` linter detects such discouraged usages. + +```{r redundant_ifelse_linter_with_lint} +lint( + text = "ifelse(x >= 2.5, TRUE, FALSE)", + linters = redundant_ifelse_linter() +) +``` + +```{r redundant_ifelse_linter_without_lint} +lint( + text = "x >= 2.5", + linters = redundant_ifelse_linter() +) +``` + +- **Efficiency** + +Sometimes the users might not be aware of a more efficient way offered by R for carrying out a computation. `{lintr}` offers linters to improve code efficiency by avoiding common inefficient patterns. + +For example, the `any_is_na_linter()` linter detects usages of `any(is.na(x))` and suggests `anyNA(x)` as a more efficient alternative to detect presence of *any* missing values. + +```{r any_is_na_linter_with_lint} +lint( + text = "any(is.na(x), na.rm = TRUE)", + linters = any_is_na_linter() +) +``` + +`anyNA()` in R is more efficient than `any(is.na())` because it stops execution once a missing value is found, while `is.na()` evaluates the entire vector. + +```{r any_is_na_linter_without_lint} +lint( + text = "anyNA(x)", + linters = any_is_na_linter() +) +``` + +- **Readability** + +Coders spend significantly more time reading than writing code [@mcconnell2004code]. Thus, writing readable code makes the code more maintainable and reduces the possibility of introducing bugs stemming from a poor understanding of the code. + +`{lintr}` provides a number of linters that suggest more readable alternatives. For example, `comparison_negation_linter()` blocks usages like `!(x == y)` where a direct relational operator is appropriate. + +```{r comparison_negation_linter_with_lint} +lint( + text = "!x == 2", + linters = comparison_negation_linter() +) +``` + +Note also the complicated operator precedence. The more readable alternative here uses `!=`: + +```{r comparison_negation_linter_without_lint} +lint( + text = "x != 2", + linters = comparison_negation_linter() +) +``` + +- **Tidyverse style** + +`{lintr}` also provides linters to enforce the style used throughout the `{tidyverse}` [@Wickham2019] ecosystem of R packages. This style of coding has been outlined in the tidyverse style guide [@Wickham2023]. + +For example, the style guide recommends using snake_case for identifiers: + +```{r object_name_linter_with_lint} +lint( + text = "MyVar <- 1L", + linters = object_name_linter() +) +``` + +```{r object_name_linter_without_lint} +lint( + text = "my_var <- 1L", + linters = object_name_linter() +) +``` + +- **Common mistakes** + +One category of linters helps you detect some common mistakes statically and provide early feedback. + +For example, duplicate arguments in function calls can sometimes cause run-time errors: + +```{r duplicate_args_error_example, error=TRUE} +mean(x = 1:5, x = 2:3) +``` + +But `duplicate_argument_linter()` can check for this statically: + +```{r duplicate_argument_linter_with_lint} +lint( + text = "mean(x = 1:5, x = 2:3)", + linters = duplicate_argument_linter() +) +``` + +Even for cases where duplicate arguments are not an error, this linter explicitly discourages duplicate arguments. + +```{r duplicate_argument_linter_without_lint} +lint( + text = "list(x = TRUE, x = FALSE)", + linters = duplicate_argument_linter() +) +``` + +This is because objects with duplicated names objects can be hard to work with programmatically and should typically be avoided. + +```{r duplicate_arguments_example} +l <- list(x = TRUE, x = FALSE) +l["x"] +l[names(l) == "x"] +``` + +# Extensibility + +`{lintr}` is designed for extensibility by allowing users to easily create custom linting rules. +There are two main ways to customize it: + + - Use additional arguments in existing linters. For example, although tidyverse style guide prefers snake_case for identifiers, if a project's conventions require it, the relevant linter can be customized to support it: + + ```{r object_name_linter_with_custom_style} + lint( + text = "my.var <- 1L", + linters = object_name_linter(styles = "dotted.case") + ) + ``` + + - Create new linters (by leveraging functions like `lintr::make_linter_from_xpath()`) tailored to match project- or organization-specific coding standards. + +# Benefits of using `{lintr}` + +There are several benefits to using `{lintr}` to analyze and improve R code. One of the most obvious is that it can help users identify and fix problems in their code, which can save time and effort during the development process. By catching issues early on, `{lintr}` can help prevent bugs and other issues from creeping into code, which can save time and effort when it comes to debugging and testing. + +Another benefit of `{lintr}` is that it can help users write more readable and maintainable code. By enforcing a consistent style and highlighting potential issues, `{lintr}` can help users write code that is easier to understand and work with. This is especially important for larger projects or teams, where multiple contributors may be working on the same codebase and it is important to ensure that code is easy to follow and understand, particularly when frequently switching context among code primarily authored by different people. + +It can also be a useful tool for teaching and learning R. By providing feedback on code style and potential issues, it can help users learn good coding practices and improve their skills over time. This can be especially useful for beginners, who may not yet be familiar with all of the best practices for writing R code. + +Finally, `{lintr}` has had a large and active user community since its birth in 2014 which has contributed to its rapid development, maintenance, and adoption. At the time of writing, `{lintr}` is in a mature and stable state and therefore provides a reliable API that is unlikely to feature fundamental breaking changes. + +# Conclusion + +`{lintr}` is a valuable tool for R users to help improve the quality and reliability of their code. Its static code analysis capabilities, combined with its flexibility and ease of use, make it relevant and valuable for a wide range of applications. + +# Licensing and Availability + +`{lintr}` is licensed under the MIT License, with all source code openly developed and stored on GitHub (), along with a corresponding issue tracker for bug reporting and feature enhancements. + +# Conflicts of interest + +The authors declare no conflict of interest. + +# Funding + +This work was not financially supported by any of the affiliated institutions of the authors. + +# Acknowledgments + +`{lintr}` would not be possible without the immense work of the [R-core team](https://www.r-project.org/contributors.html) who maintain the R language and we are deeply indebted to them. We are also grateful to all contributors to `{lintr}`. + +# References diff --git a/paper/paper.bib b/paper/paper.bib new file mode 100644 index 000000000..f9578f0c9 --- /dev/null +++ b/paper/paper.bib @@ -0,0 +1,41 @@ +@Article{Wickham2019, + title = {Welcome to the {tidyverse}}, + author = {Hadley Wickham and Mara Averick and Jennifer Bryan and Winston Chang and Lucy D'Agostino McGowan and Romain François and Garrett Grolemund and Alex Hayes and Lionel Henry and Jim Hester and Max Kuhn and Thomas Lin Pedersen and Evan Miller and Stephan Milton Bache and Kirill Müller and Jeroen Ooms and David Robinson and Dana Paige Seidel and Vitalie Spinu and Kohske Takahashi and Davis Vaughan and Claus Wilke and Kara Woo and Hiroaki Yutani}, + year = {2019}, + journal = {Journal of Open Source Software}, + volume = {4}, + number = {43}, + pages = {1686}, + doi = {10.21105/joss.01686}, + } + +@Manual{Wickham2023, + title = {The Tidyverse Style Guide}, + author = {Hadley Wickham}, + year = {2023}, + url = {https://style.tidyverse.org/index.html}, + } + +@Manual{base2023, + title = {{R}: A Language and Environment for Statistical Computing}, + author = {{R Core Team}}, + organization = {R Foundation for Statistical Computing}, + address = {Vienna, Austria}, + year = {2023}, + url = {https://www.R-project.org/}, + } + +@book{mcconnell2004code, + title={Code Complete}, + author={McConnell, Steve}, + year={2004}, + publisher={Pearson Education} + } + + @misc{ enwiki:1218663830, + author = "{Wikipedia contributors}", + title = "Static program analysis --- {Wikipedia}{,} The Free Encyclopedia", + year = "2024", + url = "https://en.wikipedia.org/w/index.php?title=Static_program_analysis&oldid=1218663830", + note = "[Online; accessed 7-May-2024]" + } diff --git a/paper/paper.md b/paper/paper.md new file mode 100644 index 000000000..079b54841 --- /dev/null +++ b/paper/paper.md @@ -0,0 +1,342 @@ +--- +title: "Static Code Analysis for R" +date: "2024-06-21" +tags: ["R", "linter", "tidyverse"] +authors: + - name: Jim Hester + affiliation: 1 + orcid: 0000-0002-2739-7082 + - name: Florent Angly + affiliation: ~ + orcid: ~ + - name: Michael Chirico + affiliation: 2 + orcid: 0000-0003-0787-087X + - name: Russ Hyde + affiliation: 5 + orcid: ~ + - name: Ren Kun + affiliation: ~ + orcid: ~ + - name: Indrajeet Patil + orcid: 0000-0003-1995-6531 + affiliation: 4 + - name: Alexander Rosenstock + affiliation: 3 + orcid: ~ +affiliations: + - index: 1 + name: Netflix + - index: 2 + name: Google + - index: 3 + name: Mathematisches Institut der Heinrich-Heine-Universität Düsseldorf + - index: 4 + name: Preisenergie GmbH, Munich, Germany + - index: 5 + name: Jumping Rivers +output: + md_document: + variant: "markdown" + preserve_yaml: true + standalone: true +bibliography: paper.bib +csl: apa.csl +link-citations: yes +--- + +# Statement of Need + +R is an interpreted, dynamically-typed programming language [@base2023]. +It is a popular choice for statistical analysis and visualization, and +is used by a wide range of researchers and data scientists. The +`{lintr}` package is an open-source R package that provides static code +analysis [@enwiki:1218663830] to check for a variety of common problems +related to readability, efficiency, consistency, style, etc. In +particular, by default it enforces the tidyverse style guide +[@Wickham2023]. It is designed to be easy to use and integrate into +existing workflows, and can be used as part of an automated build or +continuous integration process. `{lintr}` also integrates with a number +of popular IDEs and text editors, such as RStudio and Visual Studio +Code, making it convenient for users to run `{lintr}` checks on their +code as they work. + +# Features + +As of this writing, `{lintr}` offers 113 linters. + +``` r +library(lintr) + +length(all_linters()) +#> [1] 113 +``` + +Naturally, we can't discuss all of them here. To see details about all +available linters, we encourage readers to see +. + +We will showcase one linter for each kind of common problem found in R +code. + +- **Best practices** + +`{lintr}` offers linters that can detect problematic antipatterns and +suggest alternatives that follow best practices. + +For example, expressions like `ifelse(x, TRUE, FALSE)` and +`ifelse(x, FALSE, TRUE)` are redundant; just `x` or `!x` suffice in R +code where logical vectors are a core data structure. The +`redundant_ifelse_linter()` linter detects such discouraged usages. + +``` r +lint( + text = "ifelse(x >= 2.5, TRUE, FALSE)", + linters = redundant_ifelse_linter() +) +#> :1:1: warning: [redundant_ifelse_linter] Just use the +#> logical condition (or its negation) directly instead of +#> calling ifelse(x, TRUE, FALSE) +#> ifelse(x >= 2.5, TRUE, FALSE) +#> ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +``` + +``` r +lint( + text = "x >= 2.5", + linters = redundant_ifelse_linter() +) +``` + +- **Efficiency** + +Sometimes the users might not be aware of a more efficient way offered +by R for carrying out a computation. `{lintr}` offers linters to improve +code efficiency by avoiding common inefficient patterns. + +For example, the `any_is_na_linter()` linter detects usages of +`any(is.na(x))` and suggests `anyNA(x)` as a more efficient alternative +to detect presence of *any* missing values. + +``` r +lint( + text = "any(is.na(x), na.rm = TRUE)", + linters = any_is_na_linter() +) +#> :1:1: warning: [any_is_na_linter] anyNA(x) is better +#> than any(is.na(x)). +#> any(is.na(x), na.rm = TRUE) +#> ^~~~~~~~~~~~~~~~~~~~~~~~~~~ +``` + +`anyNA()` in R is more efficient than `any(is.na())` because it stops +execution once a missing value is found, while `is.na()` evaluates the +entire vector. + +``` r +lint( + text = "anyNA(x)", + linters = any_is_na_linter() +) +``` + +- **Readability** + +Coders spend significantly more time reading than writing code +[@mcconnell2004code]. Thus, writing readable code makes the code more +maintainable and reduces the possibility of introducing bugs stemming +from a poor understanding of the code. + +`{lintr}` provides a number of linters that suggest more readable +alternatives. For example, `comparison_negation_linter()` blocks usages +like `!(x == y)` where a direct relational operator is appropriate. + +``` r +lint( + text = "!x == 2", + linters = comparison_negation_linter() +) +#> :1:1: warning: [comparison_negation_linter] Use x != +#> y, not !(x == y). +#> !x == 2 +#> ^~~~~~~ +``` + +Note also the complicated operator precedence. The more readable +alternative here uses `!=`: + +``` r +lint( + text = "x != 2", + linters = comparison_negation_linter() +) +``` + +- **Tidyverse style** + +`{lintr}` also provides linters to enforce the style used throughout the +`{tidyverse}` [@Wickham2019] ecosystem of R packages. This style of +coding has been outlined in the tidyverse style guide [@Wickham2023]. + +For example, the style guide recommends using snake_case for +identifiers: + +``` r +lint( + text = "MyVar <- 1L", + linters = object_name_linter() +) +#> :1:1: style: [object_name_linter] Variable and +#> function name style should match snake_case or symbols. +#> MyVar <- 1L +#> ^~~~~ +``` + +``` r +lint( + text = "my_var <- 1L", + linters = object_name_linter() +) +``` + +- **Common mistakes** + +One category of linters helps you detect some common mistakes statically +and provide early feedback. + +For example, duplicate arguments in function calls can sometimes cause +run-time errors: + +``` r +mean(x = 1:5, x = 2:3) +#> Error in mean(x = 1:5, x = 2:3): formal argument "x" matched by multiple actual arguments +``` + +But `duplicate_argument_linter()` can check for this statically: + +``` r +lint( + text = "mean(x = 1:5, x = 2:3)", + linters = duplicate_argument_linter() +) +#> :1:15: warning: [duplicate_argument_linter] Avoid +#> duplicate arguments in function calls. +#> mean(x = 1:5, x = 2:3) +#> ^ +``` + +Even for cases where duplicate arguments are not an error, this linter +explicitly discourages duplicate arguments. + +``` r +lint( + text = "list(x = TRUE, x = FALSE)", + linters = duplicate_argument_linter() +) +#> :1:16: warning: [duplicate_argument_linter] Avoid +#> duplicate arguments in function calls. +#> list(x = TRUE, x = FALSE) +#> ^ +``` + +This is because objects with duplicated names objects can be hard to +work with programmatically and should typically be avoided. + +``` r +l <- list(x = TRUE, x = FALSE) +l["x"] +#> $x +#> [1] TRUE +``` + +``` r +l[names(l) == "x"] +#> $x +#> [1] TRUE +#> +#> $x +#> [1] FALSE +``` + +# Extensibility + +`{lintr}` is designed for extensibility by allowing users to easily +create custom linting rules. There are two main ways to customize it: + +- Use additional arguments in existing linters. For example, although + tidyverse style guide prefers snake_case for identifiers, if a + project's conventions require it, the relevant linter can be + customized to support it: + +``` r +lint( + text = "my.var <- 1L", + linters = object_name_linter(styles = "dotted.case") +) +``` + +- Create new linters (by leveraging functions like + `lintr::make_linter_from_xpath()`) tailored to match project- or + organization-specific coding standards. + +# Benefits of using `{lintr}` + +There are several benefits to using `{lintr}` to analyze and improve R +code. One of the most obvious is that it can help users identify and fix +problems in their code, which can save time and effort during the +development process. By catching issues early on, `{lintr}` can help +prevent bugs and other issues from creeping into code, which can save +time and effort when it comes to debugging and testing. + +Another benefit of `{lintr}` is that it can help users write more +readable and maintainable code. By enforcing a consistent style and +highlighting potential issues, `{lintr}` can help users write code that +is easier to understand and work with. This is especially important for +larger projects or teams, where multiple contributors may be working on +the same codebase and it is important to ensure that code is easy to +follow and understand, particularly when frequently switching context +among code primarily authored by different people. + +It can also be a useful tool for teaching and learning R. By providing +feedback on code style and potential issues, it can help users learn +good coding practices and improve their skills over time. This can be +especially useful for beginners, who may not yet be familiar with all of +the best practices for writing R code. + +Finally, `{lintr}` has had a large and active user community since its +birth in 2014 which has contributed to its rapid development, +maintenance, and adoption. At the time of writing, `{lintr}` is in a +mature and stable state and therefore provides a reliable API that is +unlikely to feature fundamental breaking changes. + +# Conclusion + +`{lintr}` is a valuable tool for R users to help improve the quality and +reliability of their code. Its static code analysis capabilities, +combined with its flexibility and ease of use, make it relevant and +valuable for a wide range of applications. + +# Licensing and Availability + +`{lintr}` is licensed under the MIT License, with all source code openly +developed and stored on GitHub (), along +with a corresponding issue tracker for bug reporting and feature +enhancements. + +# Conflicts of interest + +The authors declare no conflict of interest. + +# Funding + +This work was not financially supported by any of the affiliated +institutions of the authors. + +# Acknowledgments + +`{lintr}` would not be possible without the immense work of the [R-core +team](https://www.r-project.org/contributors.html) who maintain the R +language and we are deeply indebted to them. We are also grateful to all +contributors to `{lintr}`. + +# References {#references .unnumbered}