diff --git a/NEWS.md b/NEWS.md index 025a7651b..52333e9b3 100644 --- a/NEWS.md +++ b/NEWS.md @@ -691,7 +691,7 @@ 1. In v1.13.0 (July 2020) native parsing of datetime was added to `fread` by Michael Chirico which dramatically improved performance. Before then datetime was read as type character by default which was slow. Since v1.13.0, UTC-marked datetime (e.g. `2020-07-24T10:11:12.134Z` where the final `Z` is present) has been read automatically as POSIXct and quickly. We provided the migration option `datatable.old.fread.datetime.character` to revert to the previous slow character behavior. We also added the `tz=` argument to control unmarked datetime; i.e. where the `Z` (or equivalent UTC postfix) is missing in the data. The default `tz=""` reads unmarked datetime as character as before, slowly. We gave you the ability to set `tz="UTC"` to turn on the new behavior and read unmarked datetime as UTC, quickly. R sessions that are running in UTC by setting the TZ environment variable, as is good practice and common in production, have also been reading unmarked datetime as UTC since v1.13.0, much faster. Note 1 of v1.13.0 (below in this file) ended `In addition to convenience, fread is now significantly faster in the presence of dates, UTC-marked datetimes, and unmarked datetime when tz="UTC" is provided.`. - At `rstudio::global(2021)`, Neal Richardson, Director of Engineering at Ursa Labs, compared Arrow CSV performance to `data.table` CSV performance, [Bigger Data With Ease Using Apache Arrow](https://www.rstudio.com/resources/rstudioglobal-2021/bigger-data-with-ease-using-apache-arrow/). He opened by comparing to `data.table` as his main point. Arrow was presented as 3 times faster than `data.table`. He talked at length about this result. However, no reproducible code was provided and we were not contacted in advance in case we had any comments. He mentioned New York Taxi data in his talk which is a dataset known to us as containing unmarked datetime. [Rebuttal](https://twitter.com/MattDowle/status/1360073970498875394). + At `rstudio::global(2021)`, Neal Richardson, Director of Engineering at Ursa Labs, compared Arrow CSV performance to `data.table` CSV performance, [Bigger Data With Ease Using Apache Arrow](https://posit.co/resources/videos/bigger-data-with-ease-using-apache-arrow/). He opened by comparing to `data.table` as his main point. Arrow was presented as 3 times faster than `data.table`. He talked at length about this result. However, no reproducible code was provided and we were not contacted in advance in case we had any comments. He mentioned New York Taxi data in his talk which is a dataset known to us as containing unmarked datetime. [Rebuttal](https://twitter.com/MattDowle/status/1360073970498875394). `tz=`'s default is now changed from `""` to `"UTC"`. If you have been using `tz=` explicitly then there should be no change. The change to read UTC-marked datetime as POSIXct rather than character already happened in v1.13.0. The change now is that unmarked datetimes are now read as UTC too by default without needing to set `tz="UTC"`. None of the 1,017 CRAN packages directly using `data.table` are affected. As before, the migration option `datatable.old.fread.datetime.character` can still be set to TRUE to revert to the old character behavior. This migration option is temporary and will be removed in the near future. @@ -2136,7 +2136,7 @@ When `j` is a symbol (as in the quanteda and xgboost examples above) it will con 2. Just to state explicitly: data.table does not now depend on or require OpenMP. If you don't have it (as on CRAN's Mac it appears but not in general on Mac) then data.table should build, run and pass all tests just fine. -3. There are now 5,910 raw tests as reported by `test.data.table()`. Tests cover 91% of the 4k lines of R and 89% of the 7k lines of C. These stats are now known thanks to Jim Hester's [Covr](https://CRAN.R-project.org/package=covr) package and [Codecov.io](https://about.codecov.io/). If anyone is looking for something to help with, creating tests to hit the missed lines shown by clicking the `R` and `src` folders at the bottom [here](https://codecov.io/github/Rdatatable/data.table?branch=master) would be very much appreciated. +3. There are now 5,910 raw tests as reported by `test.data.table()`. Tests cover 91% of the 4k lines of R and 89% of the 7k lines of C. These stats are now known thanks to Jim Hester's [Covr](https://CRAN.R-project.org/package=covr) package and [Codecov.io](https://about.codecov.io/). If anyone is looking for something to help with, creating tests to hit the missed lines shown by clicking the `R` and `src` folders at the bottom [here](https://app.codecov.io/github/Rdatatable/data.table?branch=master) would be very much appreciated. 4. The FAQ vignette has been revised given the changes in v1.9.8. In particular, the very first FAQ. diff --git a/man/data.table.Rd b/man/data.table.Rd index a5da7ebc4..502595d7c 100644 --- a/man/data.table.Rd +++ b/man/data.table.Rd @@ -62,13 +62,13 @@ data.table(\dots, keep.rownames=FALSE, check.names=FALSE, key=NULL, stringsAsFac If \code{i} is a \code{data.table}, the columns in \code{i} to be matched against \code{x} can be specified using one of these ways: \itemize{ - \item{\code{on} argument (see below). It allows for both \code{equi-} and the newly implemented \code{non-equi} joins.} + \item \code{on} argument (see below). It allows for both \code{equi-} and the newly implemented \code{non-equi} joins. - \item{If not, \code{x} \emph{must be keyed}. Key can be set using \code{\link{setkey}}. If \code{i} is also keyed, then first \emph{key} column of \code{i} is matched against first \emph{key} column of \code{x}, second against second, etc.. + \item If not, \code{x} \emph{must be keyed}. Key can be set using \code{\link{setkey}}. If \code{i} is also keyed, then first \emph{key} column of \code{i} is matched against first \emph{key} column of \code{x}, second against second, etc.. If \code{i} is not keyed, then first column of \code{i} is matched against first \emph{key} column of \code{x}, second column of \code{i} against second \emph{key} column of \code{x}, etc\ldots - This is summarised in code as \code{min(length(key(x)), if (haskey(i)) length(key(i)) else ncol(i))}.} + This is summarised in code as \code{min(length(key(x)), if (haskey(i)) length(key(i)) else ncol(i))}. } Using \code{on=} is recommended (even during keyed joins) as it helps understand the code better and also allows for \emph{non-equi} joins. @@ -100,15 +100,15 @@ data.table(\dots, keep.rownames=FALSE, check.names=FALSE, key=NULL, stringsAsFac \item{by}{ Column names are seen as if they are variables (as in \code{j} when \code{with=TRUE}). The \code{data.table} is then grouped by the \code{by} and \code{j} is evaluated within each group. The order of the rows within each group is preserved, as is the order of the groups. \code{by} accepts: \itemize{ - \item{A single unquoted column name: e.g., \code{DT[, .(sa=sum(a)), by=x]}} + \item A single unquoted column name: e.g., \code{DT[, .(sa=sum(a)), by=x]} - \item{a \code{list()} of expressions of column names: e.g., \code{DT[, .(sa=sum(a)), by=.(x=x>0, y)]}} + \item a \code{list()} of expressions of column names: e.g., \code{DT[, .(sa=sum(a)), by=.(x=x>0, y)]} - \item{a single character string containing comma separated column names (where spaces are significant since column names may contain spaces even at the start or end): e.g., \code{DT[, sum(a), by="x,y,z"]}} + \item a single character string containing comma separated column names (where spaces are significant since column names may contain spaces even at the start or end): e.g., \code{DT[, sum(a), by="x,y,z"]} - \item{a character vector of column names: e.g., \code{DT[, sum(a), by=c("x", "y")]}} + \item a character vector of column names: e.g., \code{DT[, sum(a), by=c("x", "y")]} - \item{or of the form \code{startcol:endcol}: e.g., \code{DT[, sum(a), by=x:z]}} + \item or of the form \code{startcol:endcol}: e.g., \code{DT[, sum(a), by=x:z]} } \emph{Advanced:} When \code{i} is a \code{list} (or \code{data.frame} or \code{data.table}), \code{DT[i, j, by=.EACHI]} evaluates \code{j} for the groups in `DT` that each row in \code{i} joins to. That is, you can join (in \code{i}) and aggregate (in \code{j}) simultaneously. We call this \emph{grouping by each i}. See \href{https://stackoverflow.com/a/27004566/559784}{this StackOverflow answer} for a more detailed explanation until we \href{https://github.com/Rdatatable/data.table/issues/944}{roll out vignettes}. @@ -128,10 +128,10 @@ data.table(\dots, keep.rownames=FALSE, check.names=FALSE, key=NULL, stringsAsFac \item{roll}{ When \code{i} is a \code{data.table} and its row matches to all but the last \code{x} join column, and its value in the last \code{i} join column falls in a gap (including after the last observation in \code{x} for that group), then: \itemize{ - \item{\code{+Inf} (or \code{TRUE}) rolls the \emph{prevailing} value in \code{x} forward. It is also known as last observation carried forward (LOCF).} - \item{\code{-Inf} rolls backwards instead; i.e., next observation carried backward (NOCB).} - \item{finite positive or negative number limits how far values are carried forward or backward.} - \item{"nearest" rolls the nearest value instead.} + \item \code{+Inf} (or \code{TRUE}) rolls the \emph{prevailing} value in \code{x} forward. It is also known as last observation carried forward (LOCF). + \item \code{-Inf} rolls backwards instead; i.e., next observation carried backward (NOCB). + \item finite positive or negative number limits how far values are carried forward or backward. + \item "nearest" rolls the nearest value instead. } Rolling joins apply to the last join column, generally a date but can be any variable. It is particularly fast using a modified binary search. @@ -139,8 +139,8 @@ data.table(\dots, keep.rownames=FALSE, check.names=FALSE, key=NULL, stringsAsFac \item{rollends}{ A logical vector length 2 (a single logical is recycled) indicating whether values falling before the first value or after the last value for a group should be rolled as well. \itemize{ - \item{If \code{rollends[2]=TRUE}, it will roll the last value forward. \code{TRUE} by default for LOCF and \code{FALSE} for NOCB rolls.} - \item{If \code{rollends[1]=TRUE}, it will roll the first value backward. \code{TRUE} by default for NOCB and \code{FALSE} for LOCF rolls.} + \item If \code{rollends[2]=TRUE}, it will roll the last value forward. \code{TRUE} by default for LOCF and \code{FALSE} for NOCB rolls. + \item If \code{rollends[1]=TRUE}, it will roll the first value backward. \code{TRUE} by default for NOCB and \code{FALSE} for LOCF rolls. } When \code{roll} is a finite number, that limit is also applied when rolling the ends.} @@ -163,15 +163,16 @@ data.table(\dots, keep.rownames=FALSE, check.names=FALSE, key=NULL, stringsAsFac \item{on}{ Indicate which columns in \code{x} should be joined with which columns in \code{i} along with the type of binary operator to join with (see non-equi joins below on this). When specified, this overrides the keys set on \code{x} and \code{i}. When \code{.NATURAL} keyword provided then \emph{natural join} is made (join on common columns). There are multiple ways of specifying the \code{on} argument: \itemize{ - \item{As an unnamed character vector, e.g., \code{X[Y, on=c("a", "b")]}, used when columns \code{a} and \code{b} are common to both \code{X} and \code{Y}.} - \item{\emph{Foreign key joins}: As a \emph{named} character vector when the join columns have different names in \code{X} and \code{Y}. + \item As an unnamed character vector, e.g., \code{X[Y, on=c("a", "b")]}, used when columns \code{a} and \code{b} are common to both \code{X} and \code{Y}. + \item \emph{Foreign key joins}: As a \emph{named} character vector when the join columns have different names in \code{X} and \code{Y}. For example, \code{X[Y, on=c(x1="y1", x2="y2")]} joins \code{X} and \code{Y} by matching columns \code{x1} and \code{x2} in \code{X} with columns \code{y1} and \code{y2} in \code{Y}, respectively. From v1.9.8, you can also express foreign key joins using the binary operator \code{==}, e.g. \code{X[Y, on=c("x1==y1", "x2==y2")]}. - NB: shorthand like \code{X[Y, on=c("a", V2="b")]} is also possible if, e.g., column \code{"a"} is common between the two tables.} - \item{For convenience during interactive scenarios, it is also possible to use \code{.()} syntax as \code{X[Y, on=.(a, b)]}.} - \item{From v1.9.8, (non-equi) joins using binary operators \code{>=, >, <=, <} are also possible, e.g., \code{X[Y, on=c("x>=a", "y<=b")]}, or for interactive use as \code{X[Y, on=.(x>=a, y<=b)]}.} + NB: shorthand like \code{X[Y, on=c("a", V2="b")]} is also possible if, e.g., column \code{"a"} is common between the two tables. + + \item For convenience during interactive scenarios, it is also possible to use \code{.()} syntax as \code{X[Y, on=.(a, b)]}. + \item From v1.9.8, (non-equi) joins using binary operators \code{>=, >, <=, <} are also possible, e.g., \code{X[Y, on=c("x>=a", "y<=b")]}, or for interactive use as \code{X[Y, on=.(x>=a, y<=b)]}. } See examples as well as \href{../doc/datatable-secondary-indices-and-auto-indexing.html}{\code{vignette("datatable-secondary-indices-and-auto-indexing")}}. } @@ -182,8 +183,8 @@ data.table(\dots, keep.rownames=FALSE, check.names=FALSE, key=NULL, stringsAsFac \code{data.table} builds on base \R functionality to reduce 2 types of time:\cr \enumerate{ - \item{programming time (easier to write, read, debug and maintain), and} - \item{compute time (fast and memory efficient).} + \item programming time (easier to write, read, debug and maintain), and + \item compute time (fast and memory efficient). } The general form of data.table syntax is:\cr diff --git a/man/fread.Rd b/man/fread.Rd index cc96062de..78c8a7628 100644 --- a/man/fread.Rd +++ b/man/fread.Rd @@ -88,15 +88,15 @@ On Windows, "French_France.1252" is tried which should be available as standard When \code{quote} is a single character, \itemize{ - \item{Spaces and other whitespace (other than \code{sep} and \code{\\n}) may appear in unquoted character fields, e.g., \code{\dots,2,Joe Bloggs,3.14,\dots}.} + \item Spaces and other whitespace (other than \code{sep} and \code{\\n}) may appear in unquoted character fields, e.g., \code{\dots,2,Joe Bloggs,3.14,\dots}. - \item{When \code{character} columns are \emph{quoted}, they must start and end with that quoting character immediately followed by \code{sep} or \code{\\n}, e.g., \code{\dots,2,"Joe Bloggs",3.14,\dots}. + \item When \code{character} columns are \emph{quoted}, they must start and end with that quoting character immediately followed by \code{sep} or \code{\\n}, e.g., \code{\dots,2,"Joe Bloggs",3.14,\dots}. In essence quoting character fields are \emph{required} only if \code{sep} or \code{\\n} appears in the string value. Quoting may be used to signify that numeric data should be read as text. Unescaped quotes may be present in a quoted field, e.g., \code{\dots,2,"Joe, "Bloggs"",3.14,\dots}, as well as escaped quotes, e.g., \code{\dots,2,"Joe \",Bloggs\"",3.14,\dots}. If an embedded quote is followed by the separator inside a quoted field, the embedded quotes up to that point in that field must be balanced; e.g. \code{\dots,2,"www.blah?x="one",y="two"",3.14,\dots}. - On those fields that do not satisfy these conditions, e.g., fields with unbalanced quotes, \code{fread} re-attempts that field as if it isn't quoted. This is quite useful in reading files that contains fields with unbalanced quotes as well, automatically.} + On those fields that do not satisfy these conditions, e.g., fields with unbalanced quotes, \code{fread} re-attempts that field as if it isn't quoted. This is quite useful in reading files that contains fields with unbalanced quotes as well, automatically. } To read fields \emph{as is} instead, use \code{quote = ""}. @@ -106,11 +106,11 @@ To read fields \emph{as is} instead, use \code{quote = ""}. Currently, the \code{yaml} setting is somewhat inflexible with respect to incorporating metadata to facilitate file reading. Information on column classes should be stored at the top level under the heading \code{schema} and subheading \code{fields}; those with both a \code{type} and a \code{name} sub-heading will be merged into \code{colClasses}. Other supported elements are as follows: \itemize{ - \item{ \code{sep} (or alias \code{delimiter}) } - \item{ \code{header} } - \item{ \code{quote} (or aliases \code{quoteChar}, \code{quote_char}) } - \item{ \code{dec} (or alias \code{decimal}) } - \item{ \code{na.strings} } + \item \code{sep} (or alias \code{delimiter}) + \item \code{header} + \item \code{quote} (or aliases \code{quoteChar}, \code{quote_char}) + \item \code{dec} (or alias \code{decimal}) + \item \code{na.strings} } \bold{File Download:} diff --git a/man/froll.Rd b/man/froll.Rd index 090b397a9..d6cb75067 100644 --- a/man/froll.Rd +++ b/man/froll.Rd @@ -64,9 +64,9 @@ frollapply(x, n, FUN, \dots, fill=NA, align=c("right", "left", "center")) observation has its own corresponding rolling window width. Due to the logic of adaptive rolling functions, the following restrictions apply: \itemize{ - \item{ \code{align} only \code{"right"}. } - \item{ if list of vectors is passed to \code{x}, then all - vectors within it must have equal length. } + \item \code{align} only \code{"right"}. + \item if list of vectors is passed to \code{x}, then all + vectors within it must have equal length. } When multiple columns or multiple windows width are provided, then they @@ -93,21 +93,21 @@ frollapply(x, n, FUN, \dots, fill=NA, align=c("right", "left", "center")) \code{zoo} might expect following differences in \code{data.table} implementation. \itemize{ - \item{ rolling function will always return result of the same length - as input. } - \item{ \code{fill} defaults to \code{NA}. } - \item{ \code{fill} accepts only constant values. It does not support - for \emph{na.locf} or other functions. } - \item{ \code{align} defaults to \code{"right"}. } - \item{ \code{na.rm} is respected, and other functions are not needed - when input contains \code{NA}. } - \item{ integers and logical are always coerced to double. } - \item{ when \code{adaptive=FALSE} (default), then \code{n} must be a - numeric vector. List is not accepted. } - \item{ when \code{adaptive=TRUE}, then \code{n} must be vector of - length equal to \code{nrow(x)}, or list of such vectors. } - \item{ \code{partial} window feature is not supported, although it can - be accomplished by using \code{adaptive=TRUE}, see examples. \code{NA} is always returned for incomplete windows. } + \item rolling function will always return result of the same length as input. + \item \code{fill} defaults to \code{NA}. + \item \code{fill} accepts only constant values. It does not support + for \emph{na.locf} or other functions. + \item \code{align} defaults to \code{"right"}. + \item \code{na.rm} is respected, and other functions are not needed + when input contains \code{NA}. + \item integers and logical are always coerced to double. + \item when \code{adaptive=FALSE} (default), then \code{n} must be a + numeric vector. List is not accepted. + \item when \code{adaptive=TRUE}, then \code{n} must be vector of + length equal to \code{nrow(x)}, or list of such vectors. + \item \code{partial} window feature is not supported, although it can + be accomplished by using \code{adaptive=TRUE}, see + examples. \code{NA} is always returned for incomplete windows. } Be aware that rolling functions operates on the physical order of input. diff --git a/man/fsort.Rd b/man/fsort.Rd index 6c11022d2..0eba047a1 100644 --- a/man/fsort.Rd +++ b/man/fsort.Rd @@ -20,9 +20,9 @@ fsort(x, decreasing = FALSE, na.last = FALSE, internal=FALSE, verbose=FALSE, \do Process will raise error if \code{x} contains negative values. Unless \code{x} is already sorted \code{fsort} will redirect processing to slower single threaded \emph{order} followed by \emph{subset} in following cases: \itemize{ - \item{data type other than \emph{double} (\emph{numeric})} - \item{data having \code{NA}s} - \item{\code{decreasing==FALSE}} + \item data type other than \emph{double} (\emph{numeric}) + \item data having \code{NA}s + \item \code{decreasing==FALSE} } } \value{ diff --git a/man/fwrite.Rd b/man/fwrite.Rd index ba6eb4751..a4fcf788e 100644 --- a/man/fwrite.Rd +++ b/man/fwrite.Rd @@ -37,18 +37,18 @@ fwrite(x, file = "", append = FALSE, quote = "auto", \item{col.names}{Should the column names (header row) be written? The default is \code{TRUE} for new files and when overwriting existing files (\code{append=FALSE}). Otherwise, the default is \code{FALSE} to prevent column names appearing again mid-file when stacking a set of \code{data.table}s or appending rows to the end of a file.} \item{qmethod}{A character string specifying how to deal with embedded double quote characters when quoting strings. \itemize{ - \item{"escape" - the quote character (as well as the backslash character) is escaped in C style by a backslash, or} - \item{"double" (default, same as \code{write.csv}), in which case the double quote is doubled with another one.} + \item "escape" - the quote character (as well as the backslash character) is escaped in C style by a backslash, or + \item "double" (default, same as \code{write.csv}), in which case the double quote is doubled with another one. }} \item{logical01}{Should \code{logical} values be written as \code{1} and \code{0} rather than \code{"TRUE"} and \code{"FALSE"}?} \item{logicalAsInt}{Deprecated. Old name for `logical01`. Name change for consistency with `fread` for which `logicalAsInt` would not make sense.} \item{scipen}{ \code{integer} In terms of printing width, how much of a bias should there be towards printing whole numbers rather than scientific notation? See Details. } \item{dateTimeAs}{ How \code{Date}/\code{IDate}, \code{ITime} and \code{POSIXct} items are written. \itemize{ - \item{"ISO" (default) - \code{2016-09-12}, \code{18:12:16} and \code{2016-09-12T18:12:16.999999Z}. 0, 3 or 6 digits of fractional seconds are printed if and when present for convenience, regardless of any R options such as \code{digits.secs}. The idea being that if milli and microseconds are present then you most likely want to retain them. R's internal UTC representation is written faithfully to encourage ISO standards, stymie timezone ambiguity and for speed. An option to consider is to start R in the UTC timezone simply with \code{"$ TZ='UTC' R"} at the shell (NB: it must be one or more spaces between \code{TZ='UTC'} and \code{R}, anything else will be silently ignored; this TZ setting applies just to that R process) or \code{Sys.setenv(TZ='UTC')} at the R prompt and then continue as if UTC were local time.} - \item{"squash" - \code{20160912}, \code{181216} and \code{20160912181216999}. This option allows fast and simple extraction of \code{yyyy}, \code{mm}, \code{dd} and (most commonly to group by) \code{yyyymm} parts using integer div and mod operations. In R for example, one line helper functions could use \code{\%/\%10000}, \code{\%/\%100\%\%100}, \code{\%\%100} and \code{\%/\%100} respectively. POSIXct UTC is squashed to 17 digits (including 3 digits of milliseconds always, even if \code{000}) which may be read comfortably as \code{integer64} (automatically by \code{fread()}).} - \item{"epoch" - \code{17056}, \code{65536} and \code{1473703936.999999}. The underlying number of days or seconds since the relevant epoch (1970-01-01, 00:00:00 and 1970-01-01T00:00:00Z respectively), negative before that (see \code{?Date}). 0, 3 or 6 digits of fractional seconds are printed if and when present.} - \item{"write.csv" - this currently affects \code{POSIXct} only. It is written as \code{write.csv} does by using the \code{as.character} method which heeds \code{digits.secs} and converts from R's internal UTC representation back to local time (or the \code{"tzone"} attribute) as of that historical date. Accordingly this can be slow. All other column types (including \code{Date}, \code{IDate} and \code{ITime} which are independent of timezone) are written as the "ISO" option using fast C code which is already consistent with \code{write.csv}.} + \item "ISO" (default) - \code{2016-09-12}, \code{18:12:16} and \code{2016-09-12T18:12:16.999999Z}. 0, 3 or 6 digits of fractional seconds are printed if and when present for convenience, regardless of any R options such as \code{digits.secs}. The idea being that if milli and microseconds are present then you most likely want to retain them. R's internal UTC representation is written faithfully to encourage ISO standards, stymie timezone ambiguity and for speed. An option to consider is to start R in the UTC timezone simply with \code{"$ TZ='UTC' R"} at the shell (NB: it must be one or more spaces between \code{TZ='UTC'} and \code{R}, anything else will be silently ignored; this TZ setting applies just to that R process) or \code{Sys.setenv(TZ='UTC')} at the R prompt and then continue as if UTC were local time. + \item "squash" - \code{20160912}, \code{181216} and \code{20160912181216999}. This option allows fast and simple extraction of \code{yyyy}, \code{mm}, \code{dd} and (most commonly to group by) \code{yyyymm} parts using integer div and mod operations. In R for example, one line helper functions could use \code{\%/\%10000}, \code{\%/\%100\%\%100}, \code{\%\%100} and \code{\%/\%100} respectively. POSIXct UTC is squashed to 17 digits (including 3 digits of milliseconds always, even if \code{000}) which may be read comfortably as \code{integer64} (automatically by \code{fread()}). + \item "epoch" - \code{17056}, \code{65536} and \code{1473703936.999999}. The underlying number of days or seconds since the relevant epoch (1970-01-01, 00:00:00 and 1970-01-01T00:00:00Z respectively), negative before that (see \code{?Date}). 0, 3 or 6 digits of fractional seconds are printed if and when present. + \item "write.csv" - this currently affects \code{POSIXct} only. It is written as \code{write.csv} does by using the \code{as.character} method which heeds \code{digits.secs} and converts from R's internal UTC representation back to local time (or the \code{"tzone"} attribute) as of that historical date. Accordingly this can be slow. All other column types (including \code{Date}, \code{IDate} and \code{ITime} which are independent of timezone) are written as the "ISO" option using fast C code which is already consistent with \code{write.csv}. } The first three options are fast due to new specialized C code. The epoch to date-part conversion uses a fast approach by Howard Hinnant (see references) using a day-of-year starting on 1 March. You should not be able to notice any difference in write speed between those three options. The date range supported for \code{Date} and \code{IDate} is [0000-03-01, 9999-12-31]. Every one of these 3,652,365 dates have been tested and compared to base R including all 2,790 leap days in this range. \cr \cr This option applies to vectors of date/time in list column cells, too. \cr \cr @@ -73,17 +73,17 @@ To save space, \code{fwrite} prefers to write wide numeric values in scientific The following fields will be written to the header of the file and surrounded by \code{---} on top and bottom: \itemize{ - \item{ \code{source} - Contains the R version and \code{data.table} version used to write the file } - \item{ \code{creation_time_utc} - Current timestamp in UTC time just before the header is written } - \item{ \code{schema} with element \code{fields} giving \code{name}-\code{type} (\code{class}) pairs for the table; multi-class objects (e.g. \code{c('POSIXct', 'POSIXt')}) will have their first class written. } - \item{ \code{header} - same as \code{col.names} (which is \code{header} on input) } - \item{ \code{sep} } - \item{ \code{sep2} } - \item{ \code{eol} } - \item{ \code{na.strings} - same as \code{na} } - \item{ \code{dec} } - \item{ \code{qmethod} } - \item{ \code{logical01} } + \item \code{source} - Contains the R version and \code{data.table} version used to write the file + \item \code{creation_time_utc} - Current timestamp in UTC time just before the header is written + \item \code{schema} with element \code{fields} giving \code{name}-\code{type} (\code{class}) pairs for the table; multi-class objects (e.g. \code{c('POSIXct', 'POSIXt')}) will have their first class written. + \item \code{header} - same as \code{col.names} (which is \code{header} on input) + \item \code{sep} + \item \code{sep2} + \item \code{eol} + \item \code{na.strings} - same as \code{na} + \item \code{dec} + \item \code{qmethod} + \item \code{logical01} } } diff --git a/man/openmp-utils.Rd b/man/openmp-utils.Rd index 71e469ed7..df942009c 100644 --- a/man/openmp-utils.Rd +++ b/man/openmp-utils.Rd @@ -37,18 +37,18 @@ Internally parallelized code is used in the following places: \itemize{ - \item{\file{between.c} - \code{\link{between}()}} - \item{\file{cj.c} - \code{\link{CJ}()}} - \item{\file{coalesce.c} - \code{\link{fcoalesce}()}} - \item{\file{fifelse.c} - \code{\link{fifelse}()}} - \item{\file{fread.c} - \code{\link{fread}()}} - \item{\file{forder.c}, \file{fsort.c}, and \file{reorder.c} - \code{\link{forder}()} and related} - \item{\file{froll.c}, \file{frolladaptive.c}, and \file{frollR.c} - \code{\link{froll}()} and family} - \item{\file{fwrite.c} - \code{\link{fwrite}()}} - \item{\file{gsumm.c} - GForce in various places, see \link{GForce}} - \item{\file{nafill.c} - \code{\link{nafill}()}} - \item{\file{subset.c} - Used in \code{\link[=data.table]{[.data.table}} subsetting} - \item{\file{types.c} - Internal testing usage} + \item\file{between.c} - \code{\link{between}()} + \item\file{cj.c} - \code{\link{CJ}()} + \item\file{coalesce.c} - \code{\link{fcoalesce}()} + \item\file{fifelse.c} - \code{\link{fifelse}()} + \item\file{fread.c} - \code{\link{fread}()} + \item\file{forder.c}, \file{fsort.c}, and \file{reorder.c} - \code{\link{forder}()} and related + \item\file{froll.c}, \file{frolladaptive.c}, and \file{frollR.c} - \code{\link{froll}()} and family + \item\file{fwrite.c} - \code{\link{fwrite}()} + \item\file{gsumm.c} - GForce in various places, see \link{GForce} + \item\file{nafill.c} - \code{\link{nafill}()} + \item\file{subset.c} - Used in \code{\link[=data.table]{[.data.table}} subsetting + \item\file{types.c} - Internal testing usage } } \examples{ diff --git a/man/setops.Rd b/man/setops.Rd index 395cdab33..dfa2572c7 100644 --- a/man/setops.Rd +++ b/man/setops.Rd @@ -23,16 +23,12 @@ fsetequal(x, y, all = TRUE) \arguments{ \item{x, y}{\code{data.table}s.} \item{all}{Logical. Default is \code{FALSE} and removes duplicate rows on the result. When \code{TRUE}, if there are \code{xn} copies of a particular row in \code{x} and \code{yn} copies of the same row in \code{y}, then: - \itemize{ - - \item{\code{fintersect} will return \code{min(xn, yn)} copies of that row.} - - \item{\code{fsetdiff} will return \code{max(0, xn-yn)} copies of that row.} - - \item{\code{funion} will return \code{xn+yn} copies of that row.} - - \item{\code{fsetequal} will return \code{FALSE} unless \code{xn == yn}.} - } + \itemize{ + \item\code{fintersect} will return \code{min(xn, yn)} copies of that row. + \item\code{fsetdiff} will return \code{max(0, xn-yn)} copies of that row. + \item\code{funion} will return \code{xn+yn} copies of that row. + \item\code{fsetequal} will return \code{FALSE} unless \code{xn == yn}. + } } } \details{ diff --git a/man/special-symbols.Rd b/man/special-symbols.Rd index c96cbef5c..9fb3cb45a 100644 --- a/man/special-symbols.Rd +++ b/man/special-symbols.Rd @@ -19,12 +19,12 @@ These symbols used in \code{j} are defined as follows. \itemize{ - \item{\code{.SD} is a \code{data.table} containing the \bold{S}ubset of \code{x}'s \bold{D}ata for each group, excluding any columns used in \code{by} (or \code{keyby}).} - \item{\code{.BY} is a \code{list} containing a length 1 vector for each item in \code{by}. This can be useful when \code{by} is not known in advance. The \code{by} variables are also available to \code{j} directly by name; useful for example for titles of graphs if \code{j} is a plot command, or to branch with \code{if()} depending on the value of a group variable.} - \item{\code{.N} is an integer, length 1, containing the number of rows in the group. This may be useful when the column names are not known in advance and for convenience generally. When grouping by \code{i}, \code{.N} is the number of rows in \code{x} matched to, for each row of \code{i}, regardless of whether \code{nomatch} is \code{NA} or \code{NULL}. It is renamed to \code{N} (no dot) in the result (otherwise a column called \code{".N"} could conflict with the \code{.N} variable, see FAQ 4.6 for more details and example), unless it is explicitly named; e.g., \code{DT[,list(total=.N),by=a]}.} - \item{\code{.I} is an integer vector equal to \code{seq_len(nrow(x))}. While grouping, it holds for each item in the group, its row location in \code{x}. This is useful to subset in \code{j}; e.g. \code{DT[, .I[which.max(somecol)], by=grp]}. If used in \code{by} it corresponds to applying a function rowwise. } - \item{\code{.GRP} is an integer, length 1, containing a simple group counter. 1 for the 1st group, 2 for the 2nd, etc.} - \item{\code{.NGRP} is an integer, length 1, containing the number of groups. } + \item \code{.SD} is a \code{data.table} containing the \bold{S}ubset of \code{x}'s \bold{D}ata for each group, excluding any columns used in \code{by} (or \code{keyby}). + \item \code{.BY} is a \code{list} containing a length 1 vector for each item in \code{by}. This can be useful when \code{by} is not known in advance. The \code{by} variables are also available to \code{j} directly by name; useful for example for titles of graphs if \code{j} is a plot command, or to branch with \code{if()} depending on the value of a group variable. + \item \code{.N} is an integer, length 1, containing the number of rows in the group. This may be useful when the column names are not known in advance and for convenience generally. When grouping by \code{i}, \code{.N} is the number of rows in \code{x} matched to, for each row of \code{i}, regardless of whether \code{nomatch} is \code{NA} or \code{NULL}. It is renamed to \code{N} (no dot) in the result (otherwise a column called \code{".N"} could conflict with the \code{.N} variable, see FAQ 4.6 for more details and example), unless it is explicitly named; e.g., \code{DT[,list(total=.N),by=a]}. + \item \code{.I} is an integer vector equal to \code{seq_len(nrow(x))}. While grouping, it holds for each item in the group, its row location in \code{x}. This is useful to subset in \code{j}; e.g. \code{DT[, .I[which.max(somecol)], by=grp]}. If used in \code{by} it corresponds to applying a function rowwise. + \item \code{.GRP} is an integer, length 1, containing a simple group counter. 1 for the 1st group, 2 for the 2nd, etc. + \item \code{.NGRP} is an integer, length 1, containing the number of groups. } \code{.EACHI} is defined as \code{NULL} but its value is not used. Its usage is \code{by=.EACHI} (or \code{keyby=.EACHI}) which invokes grouping-by-each-row-of-i; see \code{\link{data.table}}'s \code{by} argument for more details.