Shortcuts in Science: July 2016

Tuesday 26 July 2016

Generate Background Sequences Using Markov Model

Cited from GimmeMotifs documentation

"
generate_background_sequences.py

Generate random sequences according to one of two methods: random or matched_genomic. With the argument type set to random, and an input file in FASTA format, this script will generate sequences with the same dinucleotide distribution as the input sequences according to a 1st order Markov model trained on the input sequences. The -n options is set to 10 by default. The length distribution of the sequences in the output file will be similar as the inputfile. The Markov model can be changed with option -m. If the type is specified as matched_genomic the inputfile needs to be in BED format, and the script will select genomic regions with a similar distribution relative to the transcription start of genes as the input file. Make sure to select the correct genome. The length of the sequences in the output file will be set to the median of the features in the input file.
"

What Is The Appropriate Order For A Background Model In Motif Searches?

Implemention of fasta-get-markov on GUI

Motif Analysis

Using Weeder, Pscan, and PscanChIP for the Discovery of Enriched Transcription Factor Binding Site Motifs in Nucleotide Sequences

A review of ensemble methods for de novo motif discovery in ChIP-Seq data

THiCweed: fast sensitive motif finding via clustering of big data sets

Monday 25 July 2016

Python Classes Explained

Modules, Classes, and Objects

Friday 22 July 2016

ChIA-PET Experiments and Analysis Information

ChIA-PET Tools

MICC: an R package for identifying chromatin interactions from ChIA-PET data

A statistical model of ChIA-PET data for accurate detection of chromatin 3D interactions

Review of ChIA-PET Experiments and Analyses

PolII ChIA-PET by Yijun Ruan

ChIA-PET by Richard Young

ChIA-PET Explained in Wiki

Wednesday 20 July 2016

Atom Editor Init Scripts

Exploring the Power of Atom Init Scripts

TomTom Motif Comparison Tools

Algorithms of MEME

Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling

EXTREME: an online EM algorithm for motif discovery

Saturday 16 July 2016

Perl: Oneliner in Shell

Perl One Liners

Change Case in Perl Regular Expression

Friday 8 July 2016

Line Ending Conversion in Bash

Finessing Excel's stupid line endings

Wednesday 6 July 2016

Evaluation of Differnetial ChIP-seq Tools

A comprehensive comparison of tools for differential ChIP-seq analysis

Tuesday 5 July 2016

R Data Table .I and J()

Cited from Understanding .I in data.table in r

.I is a vector representing the row numbers

Using .I to return row numbers with data.table package

######################################
Cited from How is J() function implemented in data.table?

J(.) is deprecated and simply replaced with list(.).

Friday 1 July 2016

R data.table Tips

Cited from Introduction to data.table

Within the frame of a data.table, columns can be referred to as if they are variables.

We can use “-” on a character columns within the frame of a data.table to sort in decreasing order.

We wrap the variables (column names) within list(), which ensures that a data.table is returned. In case of a single column name, not wrapping with list() returns a vector instead.

data.table also allows using .() to wrap columns with. It is an alias to list(); they both mean the same. Feel free to use whichever you prefer.

Since .() is just an alias for list(), we can name columns as we would while creating a list.

For example,
ans <- flights[, .(delay_arr = arr_delay, delay_dep = dep_delay)]

Speical symbol .N is a special in-built variable that holds the number of observations in the current group.

Setting with=FALSE disables the ability to refer to columns as if they are variables.

We can also deselect columns using - or !.

A change 'by' to 'keyby' automatically orders the result by the grouping variables in increasing order.

Special symbol .SD. It stands for Subset of Data. It by itself is a data.table that holds the data for the current group defined using by.

.SD would contain all the columns other than the grouping variables by default.

Using the argument .SDcols. It accepts either column names or column indices. For example, .SDcols = c("arr_delay", "dep_delay") ensures that .SD contains only these two columns for each group.

######################################
Cited from Keys and fast binary search based subset

We can set keys on multiple columns and the column can be of different types. Uniqueness is not enforced.

Setting a key does two things:

reorders the rows of the data.table by the column(s) provided by reference, always in increasing order.
marks those columns as key columns by setting an attribute called sorted to the data.table.

Since the rows are reordered, a data.table can have at most one key because it can not be sorted in more than one way.

setkey() and setkeyv() modify the input data.table by reference. They return the result invisibly.

In data.table, the := operator and all the set* (e.g., setkey, setorder, setnames etc..) functions are the only ones which modify the input object by reference.

In addition to ordering, keyby also sets the key column.

######################################
Cited from Reference semantics

:= returns the result invisibly. Sometimes it might be necessary to see the result after the assignment. We can accomplish that by adding an empty [] at the end of the query, like flights[hour == 24L, hour := 0L][].

The copy() function deep copies the input object and therefore any subsequent update by reference operations performed on the copied object will not affect the original object.

######################################
Cited from Efficient reshaping using data.tables

By default, variable column is of type factor. Set variable.factor argument to FALSE if you’d like to return a character vector instead.

Shortcuts in Science