Friday, 1 July 2016

R data.table Tips

Cited from Introduction to data.table

Within the frame of a data.table, columns can be referred to as if they are variables.

We can use “-” on a character columns within the frame of a data.table to sort in decreasing order.

We wrap the variables (column names) within list(), which ensures that a data.table is returned. In case of a single column name, not wrapping with list() returns a vector instead.

data.table also allows using .() to wrap columns with. It is an alias to list(); they both mean the same. Feel free to use whichever you prefer.

Since .() is just an alias for list(), we can name columns as we would while creating a list.

For example,
ans <- flights[, .(delay_arr = arr_delay, delay_dep = dep_delay)]

Speical symbol .N is a special in-built variable that holds the number of observations in the current group.

Setting with=FALSE disables the ability to refer to columns as if they are variables.

We can also deselect columns using - or !.

A change 'by' to 'keyby' automatically orders the result by the grouping variables in increasing order.

Special symbol .SD. It stands for Subset of Data. It by itself is a data.table that holds the data for the current group defined using by.

.SD would contain all the columns other than the grouping variables by default.

Using the argument .SDcols. It accepts either column names or column indices. For example, .SDcols = c("arr_delay", "dep_delay") ensures that .SD contains only these two columns for each group.

######################################
Cited from Keys and fast binary search based subset

We can set keys on multiple columns and the column can be of different types. Uniqueness is not enforced.

Setting a key does two things:
  1. reorders the rows of the data.table by the column(s) provided by reference, always in increasing order.
  2. marks those columns as key columns by setting an attribute called sorted to the data.table.
Since the rows are reordered, a data.table can have at most one key because it can not be sorted in more than one way.

setkey() and setkeyv() modify the input data.table by reference. They return the result invisibly.

In data.table, the := operator and all the set* (e.g., setkey, setorder, setnames etc..) functions are the only ones which modify the input object by reference.

In addition to ordering, keyby also sets the key column.

######################################
Cited from Reference semantics

:= returns the result invisibly. Sometimes it might be necessary to see the result after the assignment. We can accomplish that by adding an empty [] at the end of the query, like flights[hour == 24L, hour := 0L][].

The copy() function deep copies the input object and therefore any subsequent update by reference operations performed on the copied object will not affect the original object.

######################################
Cited from Efficient reshaping using data.tables

By default, variable column is of type factor. Set variable.factor argument to FALSE if you’d like to return a character vector instead.

No comments:

Post a Comment