Thursday 1 June 2017

R ggplot2 Book Reading Notes

Cited from the book "ggplot2 - Elegant Graphics for Data Analysis"

Grammer?
"In brief, the grammar tells us that a statistical graphic is a mapping from data to aesthetic attributes (colour, shape, size) of geometric objects (points, lines, bars)." Faceting can be used to generate the same plot for different  subsets of the dataset. It is the combination of these independent components  that make up a graphic.

"qplot" Function
qplot() accepts functions of variables as arguments. qplot(log(carat), log(price), data = diamonds)

Aesthetic
For every aesthetic attribute, there is a function, called a scale, which maps data values to valid values for that aesthetic. 

You can also manually set the aesthetics using I(), e.g., colour = I("red") or size = I(2).

A
esthetics for the following plots
jittered and scatter plots: "size", "colour" and "shape";
boxplots: outline "colour", the internal "fill" colour and the "size" of the lines;
density plots: "adjust" argument controls the degree of smoothness (high values of adjust produce smoother plots);
histogram: the "binwidth" argument controls the amount of smoothing by setting the bin size. (Break points can also be specified explicitly, using the breaks argument.).
 
Mapping a categorical variable to an aesthetic will automatically split up  the geom by that variable.

The bar  geom counts the number of instances of each class so that you don’t need to  tabulate your values beforehand, as with barchart in base R. If the data has  already been tabulated or if you’d like to tabulate class members in some other  way, such as by summing up a continuous variable, you can use the weight geom.

Line and path plots are typically used for time series data. Line plots join the points from left to right, while path plots join them in the order that they  appear in the dataset (a line plot is just a path plot of the data sorted by x value).

The group aesthetic to a variable encoding the group membership of each observation.

Faceting
a faceting formula which looks like row var ∼ col var.

To facet on only one of columns or rows, use '.' as a place holder. For example, 'row var ∼ .'.

qplot(carat, ..density.., data = diamonds, facets = color ~ ., geom = "histogram", binwidth = 0.1, xlim = c(0, 3))

Using ..density.. tells ggplot2 to map the density to the y-axis instead of the default use of count.

Other Options for Qplot
log: a character vector indicating which (if any) axes should be logged. For example, log="x" will log the x-axis, log="xy" will log both. 

Scaling
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Cited from the book "R for Data Science"

ggplot2 will automatically assign a unique level of the aesthetic (here a unique color) to each unique value of the variable, a process known as scaling.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The process of converting data units to physical units that a computer can display, and it is performed by 'scales'.

Colours are represented by a six-letter hexadecimal string, sizes by a number and shapes by an integer.

Scaling positions horizontally and vertically is the process of mapping the range of the data to [0, 1]. [0, 1] is used instead of exact pixels because the drawing system that ggplot2 uses, grid, takes care of that final conversion for us.

The coordinate system (Cartesian coordinates, polar coordinates or others) determines how the two positions (x and y) are combined to form the final location on the plot.

With colors, scaling involves mapping the data values to points in three-dimensional colour space. A categorical variable was mapped evenly spaced hues on the colour wheel.

Plot Grammer
To create a complete plot we need to combine graphical objects from three sources: the data, represented by the point geom; the scales and coordinate system, which generate axes and legends so that we can read values from the graph; and plot annotations, such as the background and plot title.

After mapping the data to aesthetics, the data is passed to a statistical transformation, or stat, which manipulates the data in some useful way. Stats include 1 and 2d binning, group means, quantile regression and contouring.

Scale transformation occurs before statistical transformation so that statistics are computed on the scale-transformed data.

After the statistics are computed, each scale is trained on every dataset from all the layers and facets. The training operation combines the ranges of the individual datasets to get the range of the complete data. Without this step, scales could only make sense locally and we wouldn’t be able to overlay different layers because their positions wouldn’t line up. Sometimes we do want to vary position scales across facets (but never across layers). 

Finally the scales map the data values into aesthetic values. This is a local operation: the variables in each dataset are mapped to their aesthetic values producing a new dataset that can then be rendered by the geoms.

Together, the data, mappings, stat, geom and position adjustment form a layer.

Additionally, scaling is performed before statistical transformation, while coordinate transformations occur afterward.

A plot object is a list with components data, mapping (the default aesthetic mappings), layers, scales, coordinates, facet and options.


Layer
In the ggplot() function, pairs of aesthetic attribute and variable name are wrapped in the aes() function.

The ggplot() function creates a plot object. The plot object can't be displayed until another layer is added.

p <- ggplot(diamonds, aes(carat, price, colour = cut))
p <- p + layer(geom = "point")

"+" is to add a layer to the plot.

A more fully specified layer can take any or all of these arguments:
layer(geom, geom_params, stat, stat_params, data, mapping, position)

p <- ggplot(diamonds, aes(x = carat))
p <- p + layer(
geom = "bar",
geom_params = list(fill = "steelblue"),
stat = "bin",
stat_params = list(binwidth = 2)
)

Every geom is associated with a default statistic and position, and every statistic with a default geom. This means that you only need to specify one of stat or geom to get a completely specified layer.

geom_XXX(mapping, data, ..., geom, position)
stat_XXX(mapping, data, ..., stat, position)

...: Parameters for the geom or stat, such as bin width in the histogram or bandwidth for a loess smoother. You can also use aesthetic properties as parameters. When you do this you set the property to a fixed value, not map it to a variable in the dataset.

p %+% newdataframe























No comments:

Post a Comment