Shortcuts in Science: June 2017

Friday 30 June 2017

Function vs Method in Python

Cited from the book "Data Structure and Algorithms in Python"

The general term function to describe a traditional, stateless function that is invoked without the context of a particular class or an instance of that class.

The term method to describe a member function that is invoked upon a specific object using an object-oriented message passing syntax.

Python Set Object

Cited from the book "Data Structure and Algorithms"

Only instances of immutable types can be added to a Python set.

Python Unicode Characters

Cited from the book "Data Structure and Algorithms in Python"

Unicode characters can be included, such as 20\u20AC for the string 20 .

Tuple of Length One in Python

Cited from the book "Data Structure and Algorithms in Python"

To express a tuple of length one as a literal, a comma must be placed after the element, but within the parentheses. For example, (17,) is a one-element tuple. The reason for this requirement is that, without the trailing comma, the expression (17) is viewed as a simple parenthesized numeric expression.

Python list() Function

Cited from the book "Data Structure and Algorithms in Python"

The constructor list() will accept any parameter that is of an iterable type (list, tuple, string, set and dictionary).

Python int() Function

Cited from the book "Data Structure and Algorithms in Python"

If conversion from a different base is desired, that base can be indicated as a second, optional, parameter. For example, the expression int( 7f , 16) evaluates to the integer 127.

Binary, Octal and Hexadecimal Integer in Python

Cited from the book "Data Structure and Python"

In some contexts, it is convenient to express an integral value using binary, octal, or hexadecimal. That can be done by using a prefix of the number 0 and then a character to describe the base. Example of such literals are respectively 0b1011, 0o52, and 0x7f.

Bool() Function

Cited from the book "Data Structure and Algorithms in Python"

bool(var): the use of a nonboolean value (represented by variable named "var") as a condition (the returned value from bool(var)) in a control structure.

Mutability of a Class in Python

Cited from the book "Data Structure and Algorithms in Python"

A class is immutable if each object of that class has a fixed value upon instantiation that cannot subsequently be changed.

Python Accessor or Mutator Methods

Cited from the book "Data Structure and Algorithms in Python"

Some methods return information about the state of an object, but do not change that state. These are known as accessors.

Other methods, such as the sort method of the list class, do change the state of an object. These methods are known as mutators or update methods.

Python Member Functions/Methods

Cited from the Book "Data Structure and Algorithms in Python"

Python’s classes may also define one or more methods (also known as member functions), which are invoked on a specific instance of a class using the dot (“.”).

Python Format

Cited from https://www.python.org/dev/peps/pep-3101/

"The story of {0}, {1}, and {c}".format(a, b, c=d)

    Within a format string, each positional argument is identified
    with a number, starting from zero, so in the above example, 'a' is
    argument 0 and 'b' is argument 1.  Each keyword argument is
    identified by its keyword name, so in the above example, 'c' is
    used to refer to the third argument.

Useful Functions in Python

python -i script.py

Developing Python Package

The Hitchhiker’s Guide to Python

Python Package

Thursday 29 June 2017

Pipes in Xargs

piping commands after a piped xargs

Wednesday 28 June 2017

Installing ESS in GNU

Installing ESS on Ubuntu

Promoting transcription over long distances

Tuesday 27 June 2017

Managment of R Memory and Nested expressions

Cited from options {base}

option(expressions=int); Valid int values are 25...500000

##################################
Cited from Memory {base}

--max-ppsize=int; Currently the maximum int value accepted is 500000.

How Much Memory Is Currently Used By R?

Monday 26 June 2017

Prerelease

A prerelease means to inform the users that it's not ready for production, but they can still download and test it.

Semantic Versioning Scheme

Cited from Semantic Versioning 2.0.0

A version number consists of "MAJOR.MINOR.PATCH".

• The MAJOR version when you make incompatible API changes.
• The MINOR version when you add functionality in a backwards-compatible manner.
• The PATCH version when you make backwards-compatible bug fixes.

Saturday 24 June 2017

重庆到纽约直飞

重庆—纽约航线计划于10月20日正式开通，每周2班。去程每周3/5于北京时间22:00从重庆江北国际机场起飞，于当地时间次日00:50抵达纽约肯尼迪国际机场；回程每周4/6于当地时间02:50从纽约肯尼迪国际机场起飞，于北京时间次日06:35抵达重庆江北国际机场。

Thursday 22 June 2017

Markdown Resources

Instant Markdown

Markdown Cheatsheet

Learning Notes of Github Essentials

Cited from the Book "Github Essentials"

git config [--global] user.email "email"
git config [--global] user.name "username"
git log
git show

The Raw, Blame and History Buttons
The Raw button, like the name suggests, opens the fle in a raw form, meaning that any HTML formatting disappears.

The Blame button makes use of Git's blame function. Basically, for each line of a fle, Git informs you about who modifed that line and when that line was modifed.
git blame filename

The History button is nothing more than Git's log function for a particular file.

The Watch, Star and Fork Buttons

The Watch button manages the level of subscription in a repository. GitHub notifies you with an e-mail whenever an action takes place in a repository you follow and, at the same time, lists them in the notifications area where you can later mark them as read. (https://github.com/notifications)

The Star button is a way to show your appreciation to a repository and its creator. It depicts the popularity of a project. Whenever you star a repository, it gets added to the list of your starred ones. You can see all your starred repositories at https://github.com/stars.

The main use of Fork button is when one wants to contribute to a project. When you fork a repository, it gets copied in your own namespace and that way you have full ownership in that copy; thus, you are able to modify anything you want.

Markdown is a text-to-HTML conversion tool, so that you can write text that contains structural information and then automatically get converted to valid HTML.

Bash Special Variables

What are the special dollar sign shell variables?

Tuesday 20 June 2017

HTML To Text

html2text.py

Monday 19 June 2017

Regular Expression

VAR=';[^_]+_'
if [[ $A =~ $VAR ]]; then
OUT=`echo $A | perl -pe 's|(.+);.+(_[^_]+)$|\1\2|g'`;
elif [[ $A =~ ';' ]]; then
OUT=`echo $A | perl -pe 's|;|_|g'`;
else
OUT=$A
fi

Thursday 15 June 2017

Insulation Score

Cited from the review Insulated Neighborhoods: Structural and Functional Units of Mammalian Gene Control

The insulation score of a neighborhood is calculated as the percentage of enhancer-promoter interactions that are fully contained within the neighborhood.

Cohesin and CTCF

Cited from News & Review Genome Organization: Cohesin on the Move

Cohesin complexes co-localize with CCCTC binding factor (CTCF) at most of their binding sites in mammalian genomes. CTCF is a sequence-specific DNA-binding protein involved in the formation of chromatin loops, while cohesin associates with chromatin independently of the DNA sequence.

Computal Pipeline

Nextflow: Data-driven computational pipelines

A Brief Introduction To Scientific Workflows (Nextflow Demo)

OMICS

Bazel

Awesome Pipeline

Wednesday 14 June 2017

Motif Formats

Motif Conversion Utilities

Convert HOMER motif matrix to pfm?

Tuesday 13 June 2017

Running Singularity in SGE

Running Containers under SGE

Singularity Wiki

Monday 12 June 2017

Ubuntu Version

R.Version()

Thursday 8 June 2017

R Useful Functions

mtcars <- transform(mtcars, mpg = mpg ^ 2)

col(matrix/data.frame) or row(matrix/data.frame): to indicate row/col index of a matrix or a data.frame

file.info(dout)$isdir: boolean variable indicating whether the directory exist.

I(): A copy of the object with class "AsIs" prepended to the class(es).

gl() function generates factors by specifying the pattern of their levels.

interaction() computes a factor which represents the interaction of the given factors. The result of interaction is always unordered.

dput() and dget()

dplyr::near(float.num1,float.num2): to compare equality of two floating numbers.

dplyr::select_(data, "year", "month", "day")
col_vector <- c("year", "month", "day")
dplyr::select_(data, .dots = col_vector)
select_(data, 'year:day')
select_(data, 'year:day', '-month')
select_(data, '-(year:day)')
select_(data, 'starts_with("arr")')
select_(data, '-ends_with("time")')

select_(data, .dots = c('starts_with("arr")', '-ends_with("time")'))

ggplot2::cut_width: makes groups of width ‘width’

ggplot2::cut_interval: makes ‘n’ groups with equal range

ggplot2::cut_number: makes ‘n’ groups with (approximately) equal numbers of observations

geom_freqpoly(): instead of geom_histogram(). geom_freqpoly() performs the same calculation as geom_histogram(), but instead of displaying the counts with bars, uses lines instead.

dplyr::between(vector, left, right): a shortcut for ‘x >= left & x <= right’

stats::reorder(): treats its first argument as a categorical variable, and reorders its levels based on the values of a second variable, usually numeric.

relevel(): is a special case of simply calling factor(x, levels = levels(x)[....])

dplyr::count() is a short-hand for group_by() + tally()

geom_count
geom_tile

geom_bin2d() and geom_hex() divide the coordinate plane into 2d bins and then use a fill color to display how many points fall into each bin. geom_bin2d() creates rectangular bins. geom_hex()creates hexagonal bins.

modelr: modelling functions that work with the pipe.

as_tibble: to convert data.frame to tibble. tibble() does much less: it never changes the type of the inputs (e.g. it never converts strings to factors!), it never changes the names of variables, and it never creates row names.

tribble(), short for transposed tibble. tribble() is customised for data entry in code: column headings are defined by formulas (i.e. they start with ~).

lubridate::now()

lubridate::today()

With tibble, you can explicitly print() the data frame and control the number of rows (n) and the widthof the display. width = Inf will display all columns.

options(tibble.print_max = n, tibble.print_min = m): if more than m rows, print only nrows. Use options(dplyr.print_min = Inf) to always show all rows.
Use options(tibble.width = Inf) to always print all columns, regardless of the width of the screen.

You can browse at the package help with package?xml2.

parse_logical

parse_integer

parse_date

charToRaw("Hadley") #> [1] 48 61 64 6c 65 79
Each hexadecimal number represents a byte of information: 48 is H, 61 is a, and so on.

jsonlite::fromJSON
jsonlite:toJSON

gather() converts data from wide format to long format.

spread() converts data from long format to wide format.

tidyr::separate() pulls apart one column into multiple columns, by splitting wherever a separator character appears.

tidyr::unite() takes a data frame, the name of the new variable to create, and a set of columns to combine.

tidyr::complete() Turns implicit missing values into explicit missing values. This is a wrapper around ‘expand()’, ‘left_join()’ and ‘replace_na’ that's useful for completing missing combinations of data. It takes a set of columns, and finds all unique combinations. It then ensures the original dataset contains all those values, filling in explicit NAs where necessary.

tidyr::fill(). It takes a set of columns where you want missing values to be replaced by the most recent non-missing value (sometimes called last observation carried forward).

writeLines() to see the raw content of a string.

?"'" to see the complete list of special characters.

Tuesday 6 June 2017

Convert Homer Motif Format to MEME Motif Format

MoVRs_Motif2meme.R

Monday 5 June 2017

Background Model for Motif Analysis

Background Models for MEME

MEME Background Model Format

Improving analysis of transcription factor binding sites within ChIP-Seq data based on topological motif enrichment (Paper to assess the effect of background on motif discovery)

BiasAway Bias Matched Background Generator

Saturday 3 June 2017

Recipes for Families and Friends 04/06/2017

红烧牛尾

五味魷魚

鸡肉粥

肉末腐竹粉丝

紫菜蛋花汤

韭菜炒鸡蛋

Thursday 1 June 2017

R ggplot2 Book Reading Notes

Cited from the book "ggplot2 - Elegant Graphics for Data Analysis"

Grammer?
"In brief, the grammar tells us that a statistical graphic is a mapping from data to aesthetic attributes (colour, shape, size) of geometric objects (points, lines, bars)." Faceting can be used to generate the same plot for different subsets of the dataset. It is the combination of these independent components that make up a graphic.

"qplot" Function
qplot() accepts functions of variables as arguments. qplot(log(carat), log(price), data = diamonds)

Aesthetic
For every aesthetic attribute, there is a function, called a scale, which maps data values to valid values for that aesthetic.

You can also manually set the aesthetics using I(), e.g., colour = I("red") or size = I(2).

Aesthetics for the following plots
jittered and scatter plots: "size", "colour" and "shape";
boxplots: outline "colour", the internal "fill" colour and the "size" of the lines;
density plots: "adjust" argument controls the degree of smoothness (high values of adjust produce smoother plots);
histogram: the "binwidth" argument controls the amount of smoothing by setting the bin size. (Break points can also be specified explicitly, using the breaks argument.).

Mapping a categorical variable to an aesthetic will automatically split up the geom by that variable.

The bar geom counts the number of instances of each class so that you don’t need to tabulate your values beforehand, as with barchart in base R. If the data has already been tabulated or if you’d like to tabulate class members in some other way, such as by summing up a continuous variable, you can use the weight geom.

Line and path plots are typically used for time series data. Line plots join the points from left to right, while path plots join them in the order that they appear in the dataset (a line plot is just a path plot of the data sorted by x value).

The group aesthetic to a variable encoding the group membership of each observation.

Faceting
a faceting formula which looks like row var ∼ col var.

To facet on only one of columns or rows, use '.' as a place holder. For example, 'row var ∼ .'.

qplot(carat, ..density.., data = diamonds, facets = color ~ ., geom = "histogram", binwidth = 0.1, xlim = c(0, 3))

Using ..density.. tells ggplot2 to map the density to the y-axis instead of the default use of count.

Other Options for Qplot

log: a character vector indicating which (if any) axes should be logged. For example, log="x" will log the x-axis, log="xy" will log both.

Scaling
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Cited from the book "R for Data Science"

ggplot2 will automatically assign a unique level of the aesthetic (here a unique color) to each unique value of the variable, a process known as scaling.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The process of converting data units to physical units that a computer can display, and it is performed by 'scales'.

Colours are represented by a six-letter hexadecimal string, sizes by a number and shapes by an integer.

Scaling positions horizontally and vertically is the process of mapping the range of the data to [0, 1]. [0, 1] is used instead of exact pixels because the drawing system that ggplot2 uses, grid, takes care of that final conversion for us.

The coordinate system (Cartesian coordinates, polar coordinates or others) determines how the two positions (x and y) are combined to form the final location on the plot.

With colors, scaling involves mapping the data values to points in three-dimensional colour space. A categorical variable was mapped evenly spaced hues on the colour wheel.

Plot Grammer
To create a complete plot we need to combine graphical objects from three sources: the data, represented by the point geom; the scales and coordinate system, which generate axes and legends so that we can read values from the graph; and plot annotations, such as the background and plot title.

After mapping the data to aesthetics, the data is passed to a statistical transformation, or stat, which manipulates the data in some useful way. Stats include 1 and 2d binning, group means, quantile regression and contouring.

Scale transformation occurs before statistical transformation so that statistics are computed on the scale-transformed data.

After the statistics are computed, each scale is trained on every dataset from all the layers and facets. The training operation combines the ranges of the individual datasets to get the range of the complete data. Without this step, scales could only make sense locally and we wouldn’t be able to overlay different layers because their positions wouldn’t line up. Sometimes we do want to vary position scales across facets (but never across layers).

Finally the scales map the data values into aesthetic values. This is a local operation: the variables in each dataset are mapped to their aesthetic values producing a new dataset that can then be rendered by the geoms.

Together, the data, mappings, stat, geom and position adjustment form a layer.

Additionally, scaling is performed before statistical transformation, while coordinate transformations occur afterward.

A plot object is a list with components data, mapping (the default aesthetic mappings), layers, scales, coordinates, facet and options.

Layer

In the ggplot() function, pairs of aesthetic attribute and variable name are wrapped in the aes() function.

The ggplot() function creates a plot object. The plot object can't be displayed until another layer is added.

p <- ggplot(diamonds, aes(carat, price, colour = cut))

p <- p + layer(geom = "point")

"+" is to add a layer to the plot.

A more fully specified layer can take any or all of these arguments:

layer(geom, geom_params, stat, stat_params, data, mapping, position)

p <- ggplot(diamonds, aes(x = carat))
p <- p + layer(
geom = "bar",
geom_params = list(fill = "steelblue"),
stat = "bin",
stat_params = list(binwidth = 2)
)

Every geom is associated with a default statistic and position, and every statistic with a default geom. This means that you only need to specify one of stat or geom to get a completely specified layer.

geom_XXX(mapping, data, ..., geom, position)
stat_XXX(mapping, data, ..., stat, position)

...: Parameters for the geom or stat, such as bin width in the histogram or bandwidth for a loess smoother. You can also use aesthetic properties as parameters. When you do this you set the property to a fixed value, not map it to a variable in the dataset.

p %+% newdataframe

How to Run XeLatex

Cited from Compile XeLaTeX tex file with latexmk

latexmk -pdf -e '$pdflatex=q/xelatex %O %S/' sth.tex

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
How to run TeX from the command line interface in Linux

Latex '-interaction' Parameter

Cited from Run Pdflatex Quietly

Execute latex with either the -interaction=nonstopmode or -interaction=batchmodeswitches for non-halting behaviour even in the case of a syntax error. nonstopmode will print all usual lines, it just won't stop. batchmode will suppress all but a handful of declarative lines ("this is pdfTeX v3.14...").