Skip to content

Commit

Permalink
Merge pull request #5746 from mmaechler/master
Browse files Browse the repository at this point in the history
R-devel NOTEs (forwarding URLs; "lost braces" in man/*.Rd)
  • Loading branch information
jangorecki authored Nov 23, 2023
2 parents 6b9d559 + 514fd34 commit b41a653
Show file tree
Hide file tree
Showing 9 changed files with 94 additions and 97 deletions.
4 changes: 2 additions & 2 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -691,7 +691,7 @@

1. In v1.13.0 (July 2020) native parsing of datetime was added to `fread` by Michael Chirico which dramatically improved performance. Before then datetime was read as type character by default which was slow. Since v1.13.0, UTC-marked datetime (e.g. `2020-07-24T10:11:12.134Z` where the final `Z` is present) has been read automatically as POSIXct and quickly. We provided the migration option `datatable.old.fread.datetime.character` to revert to the previous slow character behavior. We also added the `tz=` argument to control unmarked datetime; i.e. where the `Z` (or equivalent UTC postfix) is missing in the data. The default `tz=""` reads unmarked datetime as character as before, slowly. We gave you the ability to set `tz="UTC"` to turn on the new behavior and read unmarked datetime as UTC, quickly. R sessions that are running in UTC by setting the TZ environment variable, as is good practice and common in production, have also been reading unmarked datetime as UTC since v1.13.0, much faster. Note 1 of v1.13.0 (below in this file) ended `In addition to convenience, fread is now significantly faster in the presence of dates, UTC-marked datetimes, and unmarked datetime when tz="UTC" is provided.`.

At `rstudio::global(2021)`, Neal Richardson, Director of Engineering at Ursa Labs, compared Arrow CSV performance to `data.table` CSV performance, [Bigger Data With Ease Using Apache Arrow](https://www.rstudio.com/resources/rstudioglobal-2021/bigger-data-with-ease-using-apache-arrow/). He opened by comparing to `data.table` as his main point. Arrow was presented as 3 times faster than `data.table`. He talked at length about this result. However, no reproducible code was provided and we were not contacted in advance in case we had any comments. He mentioned New York Taxi data in his talk which is a dataset known to us as containing unmarked datetime. [Rebuttal](https://twitter.com/MattDowle/status/1360073970498875394).
At `rstudio::global(2021)`, Neal Richardson, Director of Engineering at Ursa Labs, compared Arrow CSV performance to `data.table` CSV performance, [Bigger Data With Ease Using Apache Arrow](https://posit.co/resources/videos/bigger-data-with-ease-using-apache-arrow/). He opened by comparing to `data.table` as his main point. Arrow was presented as 3 times faster than `data.table`. He talked at length about this result. However, no reproducible code was provided and we were not contacted in advance in case we had any comments. He mentioned New York Taxi data in his talk which is a dataset known to us as containing unmarked datetime. [Rebuttal](https://twitter.com/MattDowle/status/1360073970498875394).

`tz=`'s default is now changed from `""` to `"UTC"`. If you have been using `tz=` explicitly then there should be no change. The change to read UTC-marked datetime as POSIXct rather than character already happened in v1.13.0. The change now is that unmarked datetimes are now read as UTC too by default without needing to set `tz="UTC"`. None of the 1,017 CRAN packages directly using `data.table` are affected. As before, the migration option `datatable.old.fread.datetime.character` can still be set to TRUE to revert to the old character behavior. This migration option is temporary and will be removed in the near future.
Expand Down Expand Up @@ -2136,7 +2136,7 @@ When `j` is a symbol (as in the quanteda and xgboost examples above) it will con
2. Just to state explicitly: data.table does not now depend on or require OpenMP. If you don't have it (as on CRAN's Mac it appears but not in general on Mac) then data.table should build, run and pass all tests just fine.
3. There are now 5,910 raw tests as reported by `test.data.table()`. Tests cover 91% of the 4k lines of R and 89% of the 7k lines of C. These stats are now known thanks to Jim Hester's [Covr](https://CRAN.R-project.org/package=covr) package and [Codecov.io](https://about.codecov.io/). If anyone is looking for something to help with, creating tests to hit the missed lines shown by clicking the `R` and `src` folders at the bottom [here](https://codecov.io/github/Rdatatable/data.table?branch=master) would be very much appreciated.
3. There are now 5,910 raw tests as reported by `test.data.table()`. Tests cover 91% of the 4k lines of R and 89% of the 7k lines of C. These stats are now known thanks to Jim Hester's [Covr](https://CRAN.R-project.org/package=covr) package and [Codecov.io](https://about.codecov.io/). If anyone is looking for something to help with, creating tests to hit the missed lines shown by clicking the `R` and `src` folders at the bottom [here](https://app.codecov.io/github/Rdatatable/data.table?branch=master) would be very much appreciated.

4. The FAQ vignette has been revised given the changes in v1.9.8. In particular, the very first FAQ.

Expand Down
43 changes: 22 additions & 21 deletions man/data.table.Rd
Original file line number Diff line number Diff line change
Expand Up @@ -62,13 +62,13 @@ data.table(\dots, keep.rownames=FALSE, check.names=FALSE, key=NULL, stringsAsFac
If \code{i} is a \code{data.table}, the columns in \code{i} to be matched against \code{x} can be specified using one of these ways:
\itemize{
\item{\code{on} argument (see below). It allows for both \code{equi-} and the newly implemented \code{non-equi} joins.}
\item \code{on} argument (see below). It allows for both \code{equi-} and the newly implemented \code{non-equi} joins.
\item{If not, \code{x} \emph{must be keyed}. Key can be set using \code{\link{setkey}}. If \code{i} is also keyed, then first \emph{key} column of \code{i} is matched against first \emph{key} column of \code{x}, second against second, etc..
\item If not, \code{x} \emph{must be keyed}. Key can be set using \code{\link{setkey}}. If \code{i} is also keyed, then first \emph{key} column of \code{i} is matched against first \emph{key} column of \code{x}, second against second, etc..
If \code{i} is not keyed, then first column of \code{i} is matched against first \emph{key} column of \code{x}, second column of \code{i} against second \emph{key} column of \code{x}, etc\ldots
This is summarised in code as \code{min(length(key(x)), if (haskey(i)) length(key(i)) else ncol(i))}.}
This is summarised in code as \code{min(length(key(x)), if (haskey(i)) length(key(i)) else ncol(i))}.
}
Using \code{on=} is recommended (even during keyed joins) as it helps understand the code better and also allows for \emph{non-equi} joins.
Expand Down Expand Up @@ -100,15 +100,15 @@ data.table(\dots, keep.rownames=FALSE, check.names=FALSE, key=NULL, stringsAsFac
\item{by}{ Column names are seen as if they are variables (as in \code{j} when \code{with=TRUE}). The \code{data.table} is then grouped by the \code{by} and \code{j} is evaluated within each group. The order of the rows within each group is preserved, as is the order of the groups. \code{by} accepts:
\itemize{
\item{A single unquoted column name: e.g., \code{DT[, .(sa=sum(a)), by=x]}}
\item A single unquoted column name: e.g., \code{DT[, .(sa=sum(a)), by=x]}
\item{a \code{list()} of expressions of column names: e.g., \code{DT[, .(sa=sum(a)), by=.(x=x>0, y)]}}
\item a \code{list()} of expressions of column names: e.g., \code{DT[, .(sa=sum(a)), by=.(x=x>0, y)]}
\item{a single character string containing comma separated column names (where spaces are significant since column names may contain spaces even at the start or end): e.g., \code{DT[, sum(a), by="x,y,z"]}}
\item a single character string containing comma separated column names (where spaces are significant since column names may contain spaces even at the start or end): e.g., \code{DT[, sum(a), by="x,y,z"]}
\item{a character vector of column names: e.g., \code{DT[, sum(a), by=c("x", "y")]}}
\item a character vector of column names: e.g., \code{DT[, sum(a), by=c("x", "y")]}
\item{or of the form \code{startcol:endcol}: e.g., \code{DT[, sum(a), by=x:z]}}
\item or of the form \code{startcol:endcol}: e.g., \code{DT[, sum(a), by=x:z]}
}
\emph{Advanced:} When \code{i} is a \code{list} (or \code{data.frame} or \code{data.table}), \code{DT[i, j, by=.EACHI]} evaluates \code{j} for the groups in `DT` that each row in \code{i} joins to. That is, you can join (in \code{i}) and aggregate (in \code{j}) simultaneously. We call this \emph{grouping by each i}. See \href{https://stackoverflow.com/a/27004566/559784}{this StackOverflow answer} for a more detailed explanation until we \href{https://github.com/Rdatatable/data.table/issues/944}{roll out vignettes}.
Expand All @@ -128,19 +128,19 @@ data.table(\dots, keep.rownames=FALSE, check.names=FALSE, key=NULL, stringsAsFac
\item{roll}{ When \code{i} is a \code{data.table} and its row matches to all but the last \code{x} join column, and its value in the last \code{i} join column falls in a gap (including after the last observation in \code{x} for that group), then:
\itemize{
\item{\code{+Inf} (or \code{TRUE}) rolls the \emph{prevailing} value in \code{x} forward. It is also known as last observation carried forward (LOCF).}
\item{\code{-Inf} rolls backwards instead; i.e., next observation carried backward (NOCB).}
\item{finite positive or negative number limits how far values are carried forward or backward.}
\item{"nearest" rolls the nearest value instead.}
\item \code{+Inf} (or \code{TRUE}) rolls the \emph{prevailing} value in \code{x} forward. It is also known as last observation carried forward (LOCF).
\item \code{-Inf} rolls backwards instead; i.e., next observation carried backward (NOCB).
\item finite positive or negative number limits how far values are carried forward or backward.
\item "nearest" rolls the nearest value instead.
}
Rolling joins apply to the last join column, generally a date but can be any variable. It is particularly fast using a modified binary search.
A common idiom is to select a contemporaneous regular time series (\code{dts}) across a set of identifiers (\code{ids}): \code{DT[CJ(ids,dts),roll=TRUE]} where \code{DT} has a 2-column key (id,date) and \code{\link{CJ}} stands for \emph{cross join}.}
\item{rollends}{ A logical vector length 2 (a single logical is recycled) indicating whether values falling before the first value or after the last value for a group should be rolled as well.
\itemize{
\item{If \code{rollends[2]=TRUE}, it will roll the last value forward. \code{TRUE} by default for LOCF and \code{FALSE} for NOCB rolls.}
\item{If \code{rollends[1]=TRUE}, it will roll the first value backward. \code{TRUE} by default for NOCB and \code{FALSE} for LOCF rolls.}
\item If \code{rollends[2]=TRUE}, it will roll the last value forward. \code{TRUE} by default for LOCF and \code{FALSE} for NOCB rolls.
\item If \code{rollends[1]=TRUE}, it will roll the first value backward. \code{TRUE} by default for NOCB and \code{FALSE} for LOCF rolls.
}
When \code{roll} is a finite number, that limit is also applied when rolling the ends.}
Expand All @@ -163,15 +163,16 @@ data.table(\dots, keep.rownames=FALSE, check.names=FALSE, key=NULL, stringsAsFac
\item{on}{ Indicate which columns in \code{x} should be joined with which columns in \code{i} along with the type of binary operator to join with (see non-equi joins below on this). When specified, this overrides the keys set on \code{x} and \code{i}. When \code{.NATURAL} keyword provided then \emph{natural join} is made (join on common columns). There are multiple ways of specifying the \code{on} argument:
\itemize{
\item{As an unnamed character vector, e.g., \code{X[Y, on=c("a", "b")]}, used when columns \code{a} and \code{b} are common to both \code{X} and \code{Y}.}
\item{\emph{Foreign key joins}: As a \emph{named} character vector when the join columns have different names in \code{X} and \code{Y}.
\item As an unnamed character vector, e.g., \code{X[Y, on=c("a", "b")]}, used when columns \code{a} and \code{b} are common to both \code{X} and \code{Y}.
\item \emph{Foreign key joins}: As a \emph{named} character vector when the join columns have different names in \code{X} and \code{Y}.
For example, \code{X[Y, on=c(x1="y1", x2="y2")]} joins \code{X} and \code{Y} by matching columns \code{x1} and \code{x2} in \code{X} with columns \code{y1} and \code{y2} in \code{Y}, respectively.
From v1.9.8, you can also express foreign key joins using the binary operator \code{==}, e.g. \code{X[Y, on=c("x1==y1", "x2==y2")]}.
NB: shorthand like \code{X[Y, on=c("a", V2="b")]} is also possible if, e.g., column \code{"a"} is common between the two tables.}
\item{For convenience during interactive scenarios, it is also possible to use \code{.()} syntax as \code{X[Y, on=.(a, b)]}.}
\item{From v1.9.8, (non-equi) joins using binary operators \code{>=, >, <=, <} are also possible, e.g., \code{X[Y, on=c("x>=a", "y<=b")]}, or for interactive use as \code{X[Y, on=.(x>=a, y<=b)]}.}
NB: shorthand like \code{X[Y, on=c("a", V2="b")]} is also possible if, e.g., column \code{"a"} is common between the two tables.
\item For convenience during interactive scenarios, it is also possible to use \code{.()} syntax as \code{X[Y, on=.(a, b)]}.
\item From v1.9.8, (non-equi) joins using binary operators \code{>=, >, <=, <} are also possible, e.g., \code{X[Y, on=c("x>=a", "y<=b")]}, or for interactive use as \code{X[Y, on=.(x>=a, y<=b)]}.
}
See examples as well as \href{../doc/datatable-secondary-indices-and-auto-indexing.html}{\code{vignette("datatable-secondary-indices-and-auto-indexing")}}.
}
Expand All @@ -182,8 +183,8 @@ data.table(\dots, keep.rownames=FALSE, check.names=FALSE, key=NULL, stringsAsFac
\code{data.table} builds on base \R functionality to reduce 2 types of time:\cr
\enumerate{
\item{programming time (easier to write, read, debug and maintain), and}
\item{compute time (fast and memory efficient).}
\item programming time (easier to write, read, debug and maintain), and
\item compute time (fast and memory efficient).
}
The general form of data.table syntax is:\cr
Expand Down
16 changes: 8 additions & 8 deletions man/fread.Rd
Original file line number Diff line number Diff line change
Expand Up @@ -88,15 +88,15 @@ On Windows, "French_France.1252" is tried which should be available as standard
When \code{quote} is a single character,
\itemize{
\item{Spaces and other whitespace (other than \code{sep} and \code{\\n}) may appear in unquoted character fields, e.g., \code{\dots,2,Joe Bloggs,3.14,\dots}.}
\item Spaces and other whitespace (other than \code{sep} and \code{\\n}) may appear in unquoted character fields, e.g., \code{\dots,2,Joe Bloggs,3.14,\dots}.
\item{When \code{character} columns are \emph{quoted}, they must start and end with that quoting character immediately followed by \code{sep} or \code{\\n}, e.g., \code{\dots,2,"Joe Bloggs",3.14,\dots}.
\item When \code{character} columns are \emph{quoted}, they must start and end with that quoting character immediately followed by \code{sep} or \code{\\n}, e.g., \code{\dots,2,"Joe Bloggs",3.14,\dots}.
In essence quoting character fields are \emph{required} only if \code{sep} or \code{\\n} appears in the string value. Quoting may be used to signify that numeric data should be read as text. Unescaped quotes may be present in a quoted field, e.g., \code{\dots,2,"Joe, "Bloggs"",3.14,\dots}, as well as escaped quotes, e.g., \code{\dots,2,"Joe \",Bloggs\"",3.14,\dots}.

If an embedded quote is followed by the separator inside a quoted field, the embedded quotes up to that point in that field must be balanced; e.g. \code{\dots,2,"www.blah?x="one",y="two"",3.14,\dots}.

On those fields that do not satisfy these conditions, e.g., fields with unbalanced quotes, \code{fread} re-attempts that field as if it isn't quoted. This is quite useful in reading files that contains fields with unbalanced quotes as well, automatically.}
On those fields that do not satisfy these conditions, e.g., fields with unbalanced quotes, \code{fread} re-attempts that field as if it isn't quoted. This is quite useful in reading files that contains fields with unbalanced quotes as well, automatically.
}
To read fields \emph{as is} instead, use \code{quote = ""}.
Expand All @@ -106,11 +106,11 @@ To read fields \emph{as is} instead, use \code{quote = ""}.
Currently, the \code{yaml} setting is somewhat inflexible with respect to incorporating metadata to facilitate file reading. Information on column classes should be stored at the top level under the heading \code{schema} and subheading \code{fields}; those with both a \code{type} and a \code{name} sub-heading will be merged into \code{colClasses}. Other supported elements are as follows:
\itemize{
\item{ \code{sep} (or alias \code{delimiter}) }
\item{ \code{header} }
\item{ \code{quote} (or aliases \code{quoteChar}, \code{quote_char}) }
\item{ \code{dec} (or alias \code{decimal}) }
\item{ \code{na.strings} }
\item \code{sep} (or alias \code{delimiter})
\item \code{header}
\item \code{quote} (or aliases \code{quoteChar}, \code{quote_char})
\item \code{dec} (or alias \code{decimal})
\item \code{na.strings}
}
\bold{File Download:}
Expand Down
36 changes: 18 additions & 18 deletions man/froll.Rd
Original file line number Diff line number Diff line change
Expand Up @@ -64,9 +64,9 @@ frollapply(x, n, FUN, \dots, fill=NA, align=c("right", "left", "center"))
observation has its own corresponding rolling window width. Due to the logic
of adaptive rolling functions, the following restrictions apply:
\itemize{
\item{ \code{align} only \code{"right"}. }
\item{ if list of vectors is passed to \code{x}, then all
vectors within it must have equal length. }
\item \code{align} only \code{"right"}.
\item if list of vectors is passed to \code{x}, then all
vectors within it must have equal length.
}

When multiple columns or multiple windows width are provided, then they
Expand All @@ -93,21 +93,21 @@ frollapply(x, n, FUN, \dots, fill=NA, align=c("right", "left", "center"))
\code{zoo} might expect following differences in \code{data.table}
implementation.
\itemize{
\item{ rolling function will always return result of the same length
as input. }
\item{ \code{fill} defaults to \code{NA}. }
\item{ \code{fill} accepts only constant values. It does not support
for \emph{na.locf} or other functions. }
\item{ \code{align} defaults to \code{"right"}. }
\item{ \code{na.rm} is respected, and other functions are not needed
when input contains \code{NA}. }
\item{ integers and logical are always coerced to double. }
\item{ when \code{adaptive=FALSE} (default), then \code{n} must be a
numeric vector. List is not accepted. }
\item{ when \code{adaptive=TRUE}, then \code{n} must be vector of
length equal to \code{nrow(x)}, or list of such vectors. }
\item{ \code{partial} window feature is not supported, although it can
be accomplished by using \code{adaptive=TRUE}, see examples. \code{NA} is always returned for incomplete windows. }
\item rolling function will always return result of the same length as input.
\item \code{fill} defaults to \code{NA}.
\item \code{fill} accepts only constant values. It does not support
for \emph{na.locf} or other functions.
\item \code{align} defaults to \code{"right"}.
\item \code{na.rm} is respected, and other functions are not needed
when input contains \code{NA}.
\item integers and logical are always coerced to double.
\item when \code{adaptive=FALSE} (default), then \code{n} must be a
numeric vector. List is not accepted.
\item when \code{adaptive=TRUE}, then \code{n} must be vector of
length equal to \code{nrow(x)}, or list of such vectors.
\item \code{partial} window feature is not supported, although it can
be accomplished by using \code{adaptive=TRUE}, see
examples. \code{NA} is always returned for incomplete windows.
}

Be aware that rolling functions operates on the physical order of input.
Expand Down
Loading

0 comments on commit b41a653

Please sign in to comment.