Friday 11 December 2015

Divide Each Row of a Matrix by Elements of a Vector in R

How to divide each row of a matrix by elements of a vector in R

Tuesday 1 December 2015

Stem Cell States: Naive to Primed Pluripotency

Stem cell states: naive to primed pluripotency

Pre-implantation development and Blastocyst

Pre-implantation development

Cited from the book "Human Embryology and Developmental Biology"

The subdivision of the inner cell mass ultimately results in an embryonic body that contains the three primary embryonic germ layers: the ectoderm (outer layer), mesoderm (middle layer), and endoderm (inner layer). The process by which the germ layers are formed through cell movements is called gastrulation.

Monday 23 November 2015

DMNT3A and DMNT3B

  1. Addition of a methyl group to cytosine in the context of C-G dinucleotide
  2. DMNTS are associated with chromatin remodelling.
  3. 3A and 3B: de novo DNA methyltransferase
  4. Double KO lose differentiation capacity with passage.  

Thursday 19 November 2015

R plot function

Cited from How to Change Plot Options in R

"bty" is the plot function parameter that specifies the type of b round the plot area, use the option bty (box type):
  • "o": The default value draws a complete rectangle around the plot.
  • "n": Draws nothing around the plot.
=====================================
Cited from 15 Questions All R Users Have About Plots

Setting "xaxt" and "yaxt" parameter values equal to "n" removes the axis values of a plot. Any other character set for these arguments specifies the x-axis or y-axis values to be plotted.

Setting "ann" to "FALSE" removes the plotting of axes titles from plotting.

=====================================
Cited from Graphics with R

By default, the specified ranges of "xlim" and "ylim" are enlarged by 6%, so that values are not localised to the edges of a plot. In this case, "xaxs" and "yaxs" are set to the default value of "r" ("regular"). In contrary, setting "xaxs" and "yaxs" arguments to the character of "i" ("internal") specifies the limits at the edges of a plot.

Monday 16 November 2015

SCDE: Q&A

Cited from Definition of columns of scde diff DE matrix

These should provide estimates of the limma values.

logFC == mle

P.value == pnorm(Z)

adj.P.val == pnorm(cZ)

=====================================
Cited from Normalized read counts

scde.expression.magnitude returns FPM (not normalized by transcript length).

=====================================
p.self.fail=scde.failure.probability(models=o.ifm,counts=cd)




Saturday 14 November 2015

Mangaging Bash Processes

Cited from "Bioinformatics Data Skills Reproducible and Robust Research with Open Source Tools"

What is a shell process?

"When we run programs through the Unix shell, they become processes until they successfully finish or terminate with an error."

Background Processes

To run a program in the background, an ampersand (&) can be appended to the end of the command.

To check what processes have been running in the background, the command of "jobs" can be run.


Tee Command

Cited from "Bioinformatics Data Skills Reproducible and Robust Research with Open Source Tools"

"The Unix program tee diverts a copy of your pipeline’s standard output stream to an intermediate file while still passing it through its standard output."

For example,
program1 input.txt | tee intermediate-file.txt | program2 > results.txt









Pandoc Commands

Pandoc: a universal document converter

Markdown Syntax

Cited from Markdown Tutorial

Italic and Bold

To make a phrase italic in Markdown, the words can be surrounded by underscores ("_"), such as "_italic_".

To make phrases bold in Markdown, the words can be surrounded with two asterisks ("**"), such as "**bold**".

Headers

There are six types of headers, in decreasing sizes. The same number of hash marks before a header specified the size of the header in decreasing size.

 A header can not be made bold, but certain words can be italicized.

Links to Websites

To create an inline link, the link text is wrapped in brackets "[ ]", and then you wrap the link in parenthesis "( )". For example, "[Visit GitHub!](www.github.com)".

The reference link is a reference to another place in the document. For example, "[text][reference]" in the text, and at the bottom of the markdown document "[reference]:url".

Images

To create an inline image, "!(alt text)[url]".

Blockquote

A blockquote is a sentence or paragraph that's been specially formatted to draw attention to the reader.

To create a block quote, preface a paragraph or several paragraphs can be prefaced with the "greater than" caret (>).

Lists

To create an unordered list, each item in the list must be prefaced with an asterisk and space ("* "). Each item must be listed in its own line.

An ordered list is prefaced with numbers, instead of asterisks.

With a nested list, the sub-item must be indented one space more compared to the preceding item.

Graphs

If a new line was forcefully inserted, the togetherness may be broken. This would be the case of a hard break. Two spaces after each new line may be inserted to create a soft break. 




Thursday 12 November 2015

Data Type in Java

The set of values for each data type is known as the domain of that type.
Cited from "HKUSTx: COMP102.1x Introduction to Java Programming - Part 1"

Adding the keyword "final" to the declaration of a variable making the variable constant. For example, "final double bodyWeight;". A value can be assigned to a "final" variable once only.

Camel Case in Java

Cited from "HKUSTx COMP102.1x Introduction to Java Programming - Part 1"

Lower camelCase for names of variables and methods. For example, "double areaOfCircle".

Upper CamelCase for names of classes. For example, "public class HelloWorld".

 

Expectation-Maximization Algorithm Lecture Slides

Harvard Stat 211: Statistical Computing and Visualization

MIT OCW Machine Learning

JavaScript Library D3.js

Cited from Data-Driven Documents

D3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life using HTML, SVG, and CSS.

Wednesday 11 November 2015

R bquote Function Explained

Cited from An R bquote example

When a plot is annotated with mathematics symbols in R, the use of expression may be required.

For example,
text(0, height[2], labels=expression(Y[med] ~ "=" ~ B*x^2), cex=3).

Contents surrounded by square ("[" and "]") brackets appear in subscript.

The tilde "~" operates as a separator, and does not show up in a plot.

If we wish to introduce variables in the annotation along with the mathematics symbols, the bquote function may be used.

For example,
text(0, height[i], labels=bquote(Y[.(z2[i])] ~ "=" ~ .(z1[i])*x^2), cex=3)

.(variable_name) retrieves the value stored in the variable and place the value inside the expression.


R ggplot2 Violin Plot

ggplot2 violin plot : Easy function for data visualization using ggplot2 and R software

Word Cloud Fundamentals in R

Text mining and word cloud fundamentals in R : 5 simple steps you should know

Friday 6 November 2015

Bayesian Inference Basics

Cited from the book "Statistical Rethinking: A Bayesian Course with Examples in R and Stan"

Maximum a posteriori (MAP) is the mode of the posterior distribution.

Binomial,Geometric,Hypergeometric,Poisson,NegB Distributions

Cited from the Youtube video  Overview of Some Discrete Probability Distributions (Binomial,Geometric,Hypergeometric,Poisson,NegB) 

Binomial, negative binomial and geometric distributions depends on the assumption of independent Bernoulli trials. 

Binomial distribution:
The number of trials is fixed, and the number of successes is the random variable.

Bernoulli distribution:
A special case of binomial distribution, the number of trail is fixed as 1, and the number of successes is the random variable.

Negative binomial distribution:
The number of successes is fixed, and the number of trial is the random variable.

Geometric distribution:
A special case of negative binomial distribution, the number of successes is fixed as 1, and the number of trials is the random variable.

Hypergeometric distribution depends on the assumption of non-independent trials. The drawing is without replacement from a source that contains a certain a certain number of successes and a certain number of failures.

Hypergeometric distribution:
Similar to binomial distributions, the number of trials is fixed, and the number of successes is the random variable.

If objects were sampled from a large population without replacement, the inter-dependence has a small effect. Then the binomial distribution closely approximates the hypergeometric distribution.

Poisson districtuion :
The Poisson distribution models the number of events (the random variable) in a given time, length, area or volume, etc, if these events occur randomly and independently.

The Poisson distribution approximates the Binomial distribution, when the number of trials (n) is large, and p the probability of successes is very small.



Zero-Inflated Models Explained

Do We Really Need Zero-Inflated Models?

Zero-In ated Poisson Regression An Introduction to ZIP Regression

Tuesday 3 November 2015

Differential ChIP-seq

ChIPComp: A novel statistical method for quantitative comparison of multiple ChIP-seq datasets

Tutorial of Downloading SRA Data with Aspera

Download SRA data with Aspera command line utility

BioMart Tutorial

Some basics of biomaRt

Bash Tutorial

Better Bash Scripting in 15 Minutes

The Accurate Estimation of RNA Concentration from RNA-Seq Data

Mix² – A software tool for the accurate estimation of RNA concentration from RNA-Seq data

Comparison of GENCODE and RefSeq Gene Annotation

Comparison of GENCODE and RefSeq gene annotation and the impact of reference geneset on variant effect prediction

Variant and Pathogenicity

SSCM: A method to analyze and predict the pathogenicity of sequence variants

Transposon Quantification in RNA-seq

TEtranscripts – A package for including transposable elements in differential expression analysis of RNA-Seq datasets

RPKM, FPKM and TPM

RPKM, FPKM and TPM, clearly explained

Analysis of DNA Methylation MeRIP-seq Data Overview

FET-HMM – for spatially enhanced detection of differentially methylated region from MeRIP-Seq data

Hadoop Tutorial

Hadoop Tutorial For Beginners

Apache Spark

Beginners Guide: Apache Spark Machine Learning Scenario With A Large Input Dataset

Omics Analysis of Time Course Data

A Linear Mixed Model Spline Framework for Analysing Time Course ‘Omics’ Data

FunPat – function-based pattern analysis on RNA-seq time series data

Hi-C Analysis Review

Analysis methods for studying the 3D architecture of the genome

Monday 5 October 2015

Axel and Prozilla

Linux ultra fast command line download accelerator

Parallel: Installation and Tutorial

Cited from http://git.savannah.gnu.org/cgit/parallel.git/tree/README

"Full installation of GNU Parallel is as simple as:
wget http://ftpmirror.gnu.org/parallel/parallel-20150922.tar.bz2 bzip2 -dc parallel-20150922.tar.bz2 | tar xvf - cd parallel-20150922 ./configure && make && sudo make install"

===================================

Parallel Tutorial

Tool: Gnu Parallel - Parallelize Serial Command Line Programs Without Changing Them

Wednesday 16 September 2015

Irreproducible discovery rate (IDR)

Chip-seq data analysis: from quality check to motif discovery and more

homer-idr

Irreproducible Discovery Rate (IDR) in Python3

Bash: -depth

-depth Process each directory’s contents before the directory itself.

Recursively Renaming Directories

Recursively rename directories in bash

Bash: shopt -s globstar

Cited from The Shopt Builtin

shopt: change shell optional behaviour.

-s set the specified option.
-u disable the specified option.

option:

globstar

"If set, the pattern ‘**’ used in a filename expansion context will match all files and zero or more directories and subdirectories. If the pattern is followed by a ‘/’, only directories and subdirectories match."



Saturday 29 August 2015

Histone Modification Code

Cited from the paper "Conserved epigenomic signals in mice and humans reveal immune basis of Alzheimer’s disease"

H3K4me3 (associated primarily with active promoters); H3K4me1 (enhancers); H3K27ac (enhancer/promoter activation); H3K27me3 (Polycomb repression); H3K36me3 and H4K20me1 (transcription); and H3K9me3 (heterochromatin).

Ren Lab: Mouse Encode Data

Mouse Encode Project at Ren Lab

Thursday 27 August 2015

Monday 24 August 2015

FeatureCounts: Strandedness

From the source code of featureCounts

   """
    0: unstranded 1: stranded 2: reverse stranded
    """
    strand_flag = {"unstranded": "0",
                   "firststrand": "2",
                   "secondstrand": "1"}
    stranded =  get_in(config, ("algorithm", "strandedness"),
                       "unstranded").lower()

Tuesday 18 August 2015

XSLT: Output "&"

"&& \" produces " && \".

XSLT: Mode in and

Modal XSLT

XSLT: Rule Execution Order

Cited from XSLT Tutorial - Basics

"""
You have to understand that XSLT works down "depth-first" the XML tree, i.e.
  • it first deals with the rule for the root element,
  • then with the first instruction within this rule.
  • If the first instruction says "find other rules" it will then apply the first rule found for the first child element and so forth...
  • The rule of the root element is also the last one be finished (since it must deal step-by-step with everything that is found inside) !!!
"""

"By default the first one is applied. Since the XSLT processor only will apply one rule per element and also the most complex one."

Monday 17 August 2015

RNA-seq, ChIP-seq, ATAC-seq Paper

Chromatin state dynamics during blood formation

Tissue-Resident Macrophage Enhancer Landscapes Are Shaped
by the Local Microenvironment

Sunday 16 August 2015

Make: Empty Command

Cited from Commands

"Empty commands are most often used to prevent a pattern rule from matching the target and executing commands you don’t want."

Make: Multiline Macro

Cited from Commands

"When a multiline macro is expanded, each line is inserted into the command script with a leading tab and make treats each line independently. The lines of the macro are not executed in a single subshell. So you willneed to pay attention to command-line continuation in macros as well."

Saturday 15 August 2015

Differences Between Fork and Exec

Cited from Differences between exec and fork

"A process is an execution environment that consists of instruction, user-data, and system-data segments, as well as lots of other resources acquired at runtime, whereas a program is a file containing instructions and data that are used to initialize the instruction and user-data segments of a process."

"""
  • fork() creates a duplicate of the current process
  • exec() replaces the program in the current process with another program
"""

Archive File

Cited from Managing Modularity: Makefiles and Libraries 

"When we have a collection of functions which often use, it is convenient to collect their compiled versions into a library archive file."

Friday 14 August 2015

Make: Eval Function

Cited from Functions

"Using eval resolves the parsing issue because eval handles the multiline macro expansion and itself expands to zero lines."

"The argument to eval is expanded twice: once when when make pre-pares the argument list for eval, and once again by eval."

Make: Export Multiple Target-specific Variables

all: export A=TEST
all: export B=OK

all:
    @echo A is $$A
    @echo B is $$B

Bash: Remove Non-printable ASCII Characters From a File

Remove non-printable ASCII characters from a file with this simple Unix command

Thursday 13 August 2015

Make: Environment Variables

Cited from Variables from the Environment

"Variables in make can come from the environment in which make is run. Every environment variable that make sees when it starts up is transformed into a make variable with the same name and value. However, an explicit assignment in the makefile, or with a command argument, overrides the environment. (If the ‘-e’ flag is specified, then values from the environment override assignments in the makefile. See Summary of Options. But this is not recommended practice.)"

=======================================
Cited from The Basics: Getting environment variables into GNU Make

"The override directive beats the command line which beats environment overrides (-e option) which beats macros defined in a Makefile file which beats the original environment."

 

Bash: Shell Globbing

Globs

Linux: Date Command

7 Linux Date Command Examples to Display and Set System Date Time

Bash: Multiple Commands in One Line

Cited from Which one is better: using ; or && to execute multiple commands in one line?

"
A; B = Run A and then B, regardless of success of A
A && B = Run B if A succeeded
A || B = Run B if A failed
A & = Run A in background.
"


Wednesday 12 August 2015

make -f-

Cited from make

"-f makefile
Use the description file makefile. If the pathname is the dash character (-), the standard input is used. If there are multiple instances of this option, they are processed in the order specified."

For example,
make -f- FOO=bar <<< 'goal:;@echo $(MAKECMDGOALS)'

====================================
Cited from Variables and Macros

'The stdin is redirected from a command-line string using bash's
here string, “<<<”, syntax.'

Make: MAKEFILE_LIST

Cited from Variables and Macros

"A makefile can always determine its own name by examining the lastword of the list stored in the variable of MAKEFILE_LIST."

Make: Set Default Goal Using .DEFAULT_GOAL

Cited from Other Special Variables

".DEFAULT_GOAL: Sets the default goal to be used if no targets were specified on the command line. Note that assigning more than one target name to .DEFAULT_GOAL is invalid and will result in an error."

Make: Goal

Cited from Arguments to Specify the Goals

"The goals are the targets that make should strive ultimately to update. Other targets are updated as well if they appear as prerequisites of goals, or prerequisites of prerequisites of goals, etc."

"By default, the goal is the first target in the makefile (not counting targets that start with a period). Therefore, makefiles are usually written so that the first target is for compiling the entire program or programs they describe. If the first rule in the makefile has several targets, only the first target in the rule becomes the default goal, not the whole list. You can manage the selection of the default goal from within your makefile using the .DEFAULT_GOAL variable"

"You can also specify a different goal or goals with command line arguments to make. Use the name of the goal as an argument. If you specify several goals, make processes each of them in turn, in the order you name them."

"Make will set the special variable MAKECMDGOALS to the list of goals you specified on the command line."


Make: Phony Targets

Cited from Phony Targets

"A phony target should not be a prerequisite of a real target file; if it is, its recipe will be run every time make goes to update that file. As long as a phony target is never a prerequisite of a real target, the phony target recipe will be executed only when the phony target is a specified goal."

For example, the phony target of "clean" is a not specified goal, and therefore not executed.

din:=/home/cornell/

.PHONY : listfile clean

listfile: $(din)
    ls -lt $^

clean :
    -rm ./test/test.txt

Phony targets can have prerequisites. For example, when the prerequisites are individual programs, the call to an overall phony target will cause the execution of individual programs.

For example, both rules of action1 and action2 will be executed.

d1:=/home/cornell/
d2:=/home/cornell/test

all: action1 action2
.PHONY : all

action1: $(d1)
    ls -lt $^

action2: $(d2)
    rm -rf $^

Tuesday 11 August 2015

Make: Patterm Matching Stem

Cited from How Patterns Match

"When the target pattern does not contain a slash (and it usually does not), directory names in the file names are removed from the file name before it is compared with the target prefix and suffix. After the comparison of the file name to the target pattern, the directory names, along with the slash that ends them, are added on to the prerequisite file names generated from the pattern rule’s prerequisite patterns and the file name. The directories are ignored only for the purpose of finding an implicit rule to use, not in the application of that rule. Thus, ‘e%t’ matches the file name src/eat, with ‘src/a’ as the stem. When prerequisites are turned into file names, the directories from the stem are added at the front, while the rest of the stem is substituted for the ‘%’. The stem ‘src/a’ with a prerequisite pattern ‘c%r’ gives the file name src/car."

Friday 31 July 2015

Notes from Mangaging Project with GNU Make

Cited from the book "Managing Projects with GNU Make"

"The target is the file or thing that must be made. The prerequisites or dependants are those files that must exist before the target can be successfully created. And the commands are those shell commands that will exist before the target can be successfucreate the target from the prerequisites."

"When make is asked to evaluate a rule, it begins by finding the files indicated by the prerequisites and target. If any of the prerequisites has an associated rule, make attempts to update those first. Next, the target file is considered. If any prerequisite is newer than the target, the target is remade by executing the commands."

" To update a line: different target (or to update more than one target) include the target name with make. such as make target"

" --just-print (or -n) tells make to display the commands it would execute for a particular target without actually executing them."

" To set almost any makefile variable on the command line to override the default value or the value set in the makefile. For example:
make mytarget FOO=BAR"

"If no prerequisites are listed to the right, then only the target(s) that do not exist are updated."

"Each command must begin with a tab character. This (obscure) syntax tells make that the characters that follow the tab are to be passed to a subshell for execution. If you accidentally insert a tab as the first character of a noncommand line, make will interpret the following text as a command under most circumstances."

=================================================
Explicit Rules:

"Pattern rules use wildcards instead of explicit filenames. get file matching the pattern needs to updated. Implicit rule."

"Implicit rules are either pattern rules or suffix built-in database of rules makes writing makefile."

"A variable is either a dollar sign followed by a single character or a dollar sign followed by a word in."

Wildcards:

"Make's wildcards are identical to the Bourne shell's: ~, *, ?, [...], and [^...]."

================================================= 
the automatic variable $?: the set of prerequisites that are newer than the target.

$@: the name of the current target.

"You can look at make's default set of rules (and variables) by running make --print-data-base."

"The percent character can be placed anywhere within the pattern but can occur only once."















XSLT Notes

Cited from the book "XSLT for Dummies"

"apply-templates doesn’t include the tags of the element—only what’s inside the tags"

"Namespaces were developed to avoid this name collision by linking a namespace identifier with a URI (Uniform Resource Identifier)."

"the primary purpose of xsl:copy is to carry over the element tags. However, if you combine it with xsl:apply-templates, you copy both the tags and its content"

"xsl:copy-of duplicates everything inside the current node. "

 "The select attribute of the xsl:copy-of element determines what is copied to the result tree."

"<xsl:value-of select="expression"/>"

==================================================
matching element nodes
<xsl:template match="*|/">
<xsl:apply-templates/>
</xsl:template>

matching text and attribute nodes

<xsl:template match="text()|@*">
<xsl:value-of select="."/>
</xsl:template>

matching processing instructions and comments
<xsl:template match="processing-instruction()|comment()"/>

"An XPath expression for matching a namespace node doesn’t exist."
==================================================
child axis: child:: or omitted by default.

attribute axis: attribute:: or @ by shorthand.

node() is a node test that matches any node whatever kind it is.

chapter[position()=1]

chapter[last()]
==================================================







Monday 20 July 2015

Python: Static Method, Class Method and Instance Method

Cited from Really Understanding Python @staticmethod and @classmethod

Python: __slots__

Cited from Python __slots__

"The proper use of __slots__ is to save space in objects. Instead of having a dynamic dict that allows adding attributes to objects at anytime, there is a static structure which does not allow additions after creation."

Python: *args and **kwargs

*args and **kwargs in python explained

Python Underscore "_" in Method and Variable Names

Cited from Python: Why do some functions have underscores “__” before and after the function name?

One underline in the beginning:

"Python doesn't have real private methods, so one underline in the start of a method or attribute means you shouldn't access this method."

Two underlines in the beginning:

"So, when you create a method starting with __ it means that you don't want to anyone can override it, it will be accessible only from inside the own class."

Two underlines in the beginning and in the end:

"When we see a method like __this__, don't call it. Because it means it's a method which Python calls, not by you."


Friday 17 July 2015

ChIP-seq Plots

Metaseq

Data Structure in Python

Cited from What Are Linear Structures?

"What distinguishes one linear structure from another is the way in which items are added and removed, in particular the location where these additions and removals occur."

=================================================
Cited from What is a Stack?

'A stack (sometimes called a “push-down stack”) is an ordered collection of items where the addition of new items and the removal of existing items always takes place at the same end. This end is commonly referred to as the “top.” The end opposite the top is known as the "base."'

'The base of the stack is significant since items stored in the stack that are closer to the base represent those that have been in the stack the longest. The most recently added item is the one that is in position to be removed first.'

=================================================


 

Python: Pass Command Line Options

Argparse Tutorial

argparse option for passing a list as option

ArgumentParser

Friday 10 July 2015

Configure Custom Connection Options for SSH Client

Cited from How To Configure Custom Connection Options for your SSH Client

"The only difference is that depending on the option and value, using the equal sign with no spaces can allow you to specify an option on the command line without quoting."

Sunday 28 June 2015

Tuesday 23 June 2015

SAM File 1-based Leftmost Mapping Position in Relation to Strand

Cited from Re: [Samtools-help] Questions about SAM format

"It's always the smaller of the two "end"-coordinates, on the positive strand (the strand that is given in your reference fasta). So, in a 100bp reference, if your 25bp read came from / is mapped to the negative strand right up against its 5'-end, the position in the SAM line would be 76. If you have another read that came from the positive strand right up against its 3'-end, the position in the SAM line would *also* be 76. Use the strand flag to distinguish between the two cases."

Parse a BAM File Using Perl

How to parse a BAM file using Perl

Identify Ambiguously Mapped Reads in SAM/BAM

Cited from Wiki of PoPOOLationWalkthrough

"Filtering by a mapping qualiy of 20 removes the ambiguously mapped reads

samtools view -q 20 -b -S dmel.sam"

Monday 22 June 2015

Make Manual

Cited from 2.2 A Simple Makefile

When a target is a file, it needs to be recompiled or relinked if any of its prerequisites change.

Targets that do not refer to files but are just actions are called phony targets.
====================================================================
Cited from 2.3 How make Processes a Makefile

By default, make starts with the first target. This is called the default goal.

====================================================================
Cited from 2.6 Another Style of Makefile

When the objects of a makefile are created only by implicit rules, an alternative style of makefile is possible. In this style of makefile, you group entries by their prerequisites instead of by their targets.
====================================================================
Cited from 3.1 What Makefiles Contain

Makefiles contain five kinds of things: explicit rules, implicit rules, variable definitions, directives, and comments.

An implicit rule says when and how to remake a class of files based on their names. It describes how a target may depend on a file with a name similar to the target and gives a recipe to create or update such a target.

A directive is an instruction for make to do something special while reading the makefile. These include:
====================================================================
Cited from 3.1.1 Splitting Long Lines

The way in which backslash/newline combinations are handled depends on whether the statement is a recipe line or a non-recipe line. Handling of backslash/newline in a recipe line is discussed later (see Splitting Recipe Lines).

====================================================================
Cited from 3.2 What Name to Give Your Makefile

By default, when make looks for the makefile, it tries the following names, in order: GNUmakefile, makefile and Makefile.

====================================================================
Cited from 3.3 Including Other Makefiles

The include directive tells make to suspend reading the current makefile and read one or more other makefiles before continuing. The directive is a line in the makefile that looks like this:

include filenames…

filenames can contain shell file name patterns. If filenames is empty, nothing is included and no error is printed. If the file names contain any variable or function references, they are expanded.

If the specified name does not start with a slash, and the file is not found in the current directory, several other directories are searched. First, any directories you have specified with the ‘-I’ or ‘--include-dir’ option are searched (see Summary of Options). Then the following directories (if they exist) are searched, in this order: prefix/include (normally /usr/local/include)/usr/gnu/include, /usr/local/include, /usr/include.

If you want make to simply ignore a makefile which does not exist or cannot be remade, with no error
message, use the -include directive instead of include, like this:

-include filenames…

For compatibility with some other make implementations, sinclude is another name for -include.

====================================================================
Cited from 3.7 How make Reads a Makefile

Conditional directives are parsed immediately. This means, for example, that automatic variables cannot be used in conditional directives, as automatic variables are not set until the recipe for that rule is invoked. If you need to use automatic variables in a conditional directive you must move the condition into the recipe and use shell conditional syntax instead.

====================================================================
Cited from 3.8 Secondary Expansion

If that special target is defined then in between the two phases mentioned above, right at the end of the read-in phase, all the prerequisites of the targets defined after the special target .SECONDEXPANSION are expanded a second time.


Makefile Assignment Operators

The variants of GNU Makefile assignment operators

Export Parameters to Makefile

Passing additional variables from command line to make

The Basics: Getting environment variables into GNU Make

Sunday 21 June 2015

Creating Makefile from Json

In Python, aLib/webForm/json2make.py, and aLib package is described in aLib a sets of software tools to do basic analysis of Illumina sequencers

Json and jsvelocity utility for creating makefile, and separately, XML and XSL-based makefile creation pipeline are described in XML+XSLT = #Makefile -based #workflows for #bioinformatics


GitHub md File

What file uses .md extension and how should I edit them?

JSON Format

Excerpts from JSON Tutorial

"JSON is a syntax for storing and exchanging data.
JSON is language independent."

==============================================================
Excerpts from JSON Syntax

"JSON syntax is part of JavaScript syntax:
  • Data is in name/value pairs
  • Data is separated by commas
  • Curly braces hold objects
  • Square brackets hold arrays
JSON data is written as name/value pairs. A name/value pair consists of a field name (in double quotes), followed by a colon, followed by a value."

JSON values can be:
  • A number (integer or floating point)
  • A string (in double quotes)
  • A Boolean (true or false)
  • An array (in square brackets)
  • An object (in curly braces)
  • null"

Bash: Make and Makefile

The Makefile

Bash: Configuration Files in Shell Scripting

Using Configuration Files With Shell Scripts

Thursday 11 June 2015

Conversion Between Objects of GRanges, RangedData, RangesList, RleList or RleViewsList Classes

GRanges objects

Retrieve Archived Ensembl Dataset Using R biomaRt

biomaRt: load archived Ensembl Genes database anymore

Peak Annotation With biomaRt using Chippeakanno

Custom Annotation With Chippeakanno

ChIPpeakAnno Biomart annotation

ChIPpeakAnno Analysis Pipeline

R biomaRt

Cited from The biomaRt user’s guide

  1. A first step is to check which BioMart web services are available. The function listMarts will display all available BioMart web services.
  2. The useMart function can now be used to connect to a specified BioMart database, this must be a valid name given by listMarts.
  3. BioMart databases can contain several datasets, for Ensembl every species is a different dataset. In a next step, the datasets are available in the selected BioMart database can be visualised by using the function listDatasets.
  4. To select a dataset we can up date the Mart object using the function useDataset.
  5. Or alternatively if the dataset one wants to use is known in advance, one can select a BioMart database and dataset in one step by useMart("database",dataset="dataset").
  6. The getBM function is the main query function in biomaRt. For some frequently used queries to Ensembl, wrapper functions are available: getGene and getSequence. biomaRt has four main arguments:
  • ˆ attributes: is a vector of attributes that one wants to retrieve (= the output of the query).
  • ˆ filters: is a vector of filters that one wil use as input to the query.
  • ˆ values: a vector of values for the filters. In case multple filters are in use, the values argument requires a list of values where each position in the list corresponds to the position of the filters in the filters argument.
  • ˆ mart: is and object of class Mart, which is created by the useMart function.
=============================================================

Sunday 31 May 2015

Differences Between Ensembl, Gencode, RefSeq and UCSC

Cited from "Question: Difference Between Ensembl Databases In Ucsc Table Browser"

"Vega is a browser of the manually curated Havana gene set. Ensembl also perform automatic annotation of genes using protein and nucleotide sequence databases, such as EMBL, Uniprot and RefSeq. Ensembl use the GENCODE gene set, which is made up of the Havana and Ensembl automatic gene set. Genes within the GENCODE set are labelled as being either Ensembl (from the automatic annotation), Havana (from the manual annotation) or merged (exact match between the automatic and manual annotation)."
=======================================================================

Monday 18 May 2015

R EOF

Cited from R Command Line Processing

cat > printargs.R << EOF args = commandArgs() print(args) q() EOF

R --no-save < printargs.R

****************************************************************************

Sunday 17 May 2015

CLIP-seq Analysis Tools

An integrative resource of CLIP-seq studies

CLIPper-Home

RIP-seq/CLIP-seq software tools

CLIP-seq Analysis Example Papers

Transcriptome-wide identification of RNA binding sites by CLIP-seq

Mapping in vivo protein-RNA interactions at single-nucleotide resolution from HITS-CLIP data

PIPE-CLIP: a comprehensive online tool for CLIP-seq data analysis

HITS-CLIP yields genome-wide insights into brain alternative RNA processing

Antagonistic regulation of mRNA expression and splicing by CELF and MBNL proteins


Bash PS Command

The ps Command

Bash the Difference Between PATH and LD_LIBRARY_PATH

What is the difference between PATH and LD_LIBRARY_PATH?

Bash Indirect Expansion

bash: indirect expansion, please explain?

Bash: Substring Removal

Cited from "How do I parse command line arguments in bash?"

To better understand ${i#*=} search for "Substring Removal" in this guide. It is functionally equivalent to `sed 's/[^=]*=//' <<< "$i"` which calls a needless subprocess or `echo "$i" | sed 's/[^=]*=//'` which calls two needless subprocesses.

*******************************************************************

Bash Command: Xargs

10 Xargs Command Examples in Linux / UNIX

Safe and powerful use of xargs with bash and grep

Things you (probably) didn’t know about xargs

Tuesday 5 May 2015

Message Passing

Cited from OO field guide

"With message-passing, messages (methods) are sent to objects and the object determines which function to call."

Monday 4 May 2015

Function Names With Leading Dots

Cited from What does the dot mean in R – personal preference, naming convention or more?

"Function names with leading dots are somewhat hidden from general view. Functions that are meant to be purely internal to a package sometimes use this.

In this context, "somewhat hidden" simply means that the variable (or function) won't normally show up when you list object with ls(). To force ls to show these variables, use ls(all.names=TRUE). By using a dot as first letter of a variable, you change the scope of the variable itself. "

Thursday 30 April 2015

Complements and Intersections of VCF Files




Cited from bcftools

bcftools isec [OPTIONS] A.vcf.gz B.vcf.gz […]

Creates intersections, unions and complements of VCF files. Depending on the options, the program can output records from one (or more) files which have (or do not have) corresponding records with the same position in the other files.

Bash Dereference Concatenated Variable Name

Cited from Dereference concatenated variable name

FRUITS="BANANA APPLE ORANGE"

BANANA_COLOUR="Yellow"
APPLE_COLOUR="Green or Red"
ORANGE_COLOUR="Blue"

for fruit in $FRUITS ;do
    eval echo $fruit is \$${fruit}_COLOUR
done

'The eval simply tells bash to make a second evaluation of the following statement (ie. one more that its normal evaluation).. The \$ survives the first evaluation as $, and the next evaluation then treats this $ as the start of a variable name, which resolves to "Yellow", etc..'.

The Executation of Pipe with a Bash Find Command

How to use pipe within -exec in find

Looping through files with spaces in the names?

Tuesday 21 April 2015

Identifying De-novo Mutations From Trio Data

Trio-Analysis Pipeline Script

Demultiplexing Fastq According to Barcodes

How to demultiplex fastq files with a dedicated, separate barcode file

Regular Expression: The Order of Lookaheads

Cited from The Order of Lookaheads Doesn't Matter… Almost

"While the order of lookaheads doesn't matter on a logical level, keep in mind that it may matter for matching speed. If one lookahead is more likely to fail than the other two, it makes little sense to place it in third position and expend a lot of energy checking the first two conditions. Make it first, so that if we're going to fail, we fail early—an application of the design to fail principle from the regex style guide."

"The negative lookbehind (?<!.) asserts that what precedes the current position is not any character—therefore the position must be the beginning of the string."


Regular Expression: DOTALL mode

Cited from DOTALL (Dot Matches Line Breaks): s (with exceptions)

"By default, the dot . doesn't match line break characters such as line feeds and carriage returns. If you want patterns such as BEGIN .*? END to match across lines, we need to turn that feature on."

"This mode is sometimes called single-line (hence the s) because as far as the dot is concerned, it turns the whole string into one big line—.* will match from the first character to the last, no matter how many line breaks stand in between."

"In Perl, apart from the (?s) inline modifier, Perl lets you add the s flag after your pattern's closing delimiter. For instance, you can use:
if ($the_subject =~ m/BEGIN .*? END/s) { … }"

Regular Expression: Non-greedy Matching

Cited from Regular Expression Tutorial Part 5: Greedy and Non-Greedy Quantification


To make the quantifier non-greedy you simply follow it with a '?'

symbol:

my $string = 'bcdabdcbabcd';

$string =~ m/^(.*?)ab/;

Regular Expression Possessive: Don't Give Up Characters

Cited from Possessive: Don't Give Up Characters

"As you'll see in the table below, a quantifier is made possessive by appending a + plus sign to it. Therefore, A++ is possessive—it matches as many characters as needed and never gives any of them back."


Monday 20 April 2015

Regular Expression Anchors

"Regex anchors force the regex engine to start or end a match at an absolute position. The start of string anchor (\A) dictates that any match must start at the beginning of the string."

"The end of line string anchor (\Z) requires that a match end at the end of a line within the string."

"The word boundary anchor (\b) matches only at the boundary between a word character (\w) and a non-word character (\W)."

Cited from Regular Expressions and Matching

##################################################
✽ In .NET, Perl and Ruby, \Z is allowed to match before a final line feed. Therefore, e\Z will match the final e in the string "apple\norange\n".

Cited from Regex Anchors

 


Regular Expression: The Use of (?

Named capture in Perl:

'Perl uses (?<NAME>pattern) to specify names captures. You have to use the %+ hash to retrieve them.

$variable =~ /(?<count>\d+)/;
print "Count is {count}";'

Cited from Can I use named groups in a Perl regex to get the results in a hash?
##################################################
"The normal capturing (pattern) has the property of capturing and group. Capturing means that the text matches the pattern inside will be captured so that you can use it with back-reference, in matching or replacement. The non-capturing group (?:pattern) doesn't have the capturing property."

"Atomic grouping (?>pattern) also has the non-capturing property, so the position of the text matched inside will not be captured." 

Cited from Confusion with Atomic Grouping - how it differs from the Grouping in regular expression of Ruby?



Tuesday 14 April 2015

The European Molecular Biology Open Software Suite (EMBOSS)

EMBOSS

Paper "Charting a dynamic DNA methylation landscape of the human genome"

Excerpts from Charting a dynamic DNA methylation landscape of the human genome

"Most cell types, except germ cells and pre-implantation embryos3, 4, 5, display relatively stable DNA methylation patterns, with 70–80% of all CpGs being methylated."

CpG Island and Shores

Excerpts from "CpG site"

""CpG" is shorthand for "—C—phosphate—G—", that is, cytosine and guanine separated by only one phosphate; phosphate links any two nucleosides together in DNA. The "CpG" notation is used to distinguish this linear sequence from the CG base-pairing of cytosine and guanine."

#################################################
Excerpts from "Question: Find Cpg Islands"

"CpG islands were predicted by searching the sequence one base at a time, scoring each dinucleotide (+17 for CG and -1 for others) and identifying maximally scoring segments. Each segment was then evaluated for the following criteria: GC content of 50% or greater, length greater than 200 bp, ratio greater than 0.6 of observed number of CG dinucleotides to the expected number on the basis of the number of Gs and Cs in the segment.

The CpG count is the number of CG dinucleotides in the island. The Percentage CpG is the ratio of CpG nucleotide bases (twice the CpG count) to the length. The ratio of observed to expected CpG is calculated according to the formula (cited in Gardiner-Garden et al. (1987)): Obs/Exp CpG = Number of CpG * N / (Number of C * Number of G)   where N = length of sequence."
##################################################
Excerpts from "What is a CpG shore and how to I get them all?"

"CpG shores are the regions immediately flanking and up to 2 kbp away from CpG islands. These regions are interesting because methylation they are variably methylated in cancer and development."








Monday 13 April 2015

Paper "Targeted disruption of DNMT1, DNMT3A and DNMT3B in human embryonic stem cells"

Excerpts from "Targeted disruption of DNMT1, DNMT3A and DNMT3B in human embryonic stem cells"

"Human ESC methylation patterns are most unique at hypomethylated regulatory elements that are enriched for binding of pluripotency-associated master regulators, such as OCT4, SOX2 and NANOG."

Hemimethylated DNA

Exerpts from What is hemimethylated DNA?

"DNA-hemimethylation is when only one of two (complementary) strands is methylated. A hemi-methylated site is a single CpG that is methylated on one strand, but not on the other. This is not the same thing as allele-specific methylation, which is common in imprinting. In hemi-methylation, we’re talking about 2 strands from the same parent. Hemimethylation is important because it directly identifies de novo methylation events, allowing you to differentiation between de novo vs. maintenance factors. Because DNA methylation is faithfully propagated during DNA replication (by DNMT1), any hemimethylated sites must have arisen during the last replication round, either because: 1) failure to faithfully propagate a parental methylation signal; or, 2) a de novo methylation event. You can differentiate between the two if you know the methylation status of the parent: if the parent strand was entirely methylated, then hemimethylation indicates failure of maintenance. Vice versa, if the parent straned was unmethylated, hemimethylation indicates de novo methylation."

Friday 3 April 2015

TMM Normalisation

Excerpts from "NormalizationAndDifferentialExpression"

tmm <- calcNormFactors(geneCounts.dgelist)

# equation from the edgeR documentation for estimating normalized absolute expression from their scaling factors
tmmScaleFactors <- geneCounts.dgelist$samples$lib.size * tmm$samples$norm.factors
tmmExp <- round(t(t(tmm$counts)/tmmScaleFactors) * mean(tmmScaleFactors))

#################################################
Excerpts from "Question: After Getting Normalization Factor Via Edger, What To Do For Normalization?"

The TMM counts are: count / (library size * normalization factor)

Then multiply that by a million to get CPM.

Not count / normalization factor

And DESeq doesn't just do a simple division by library size. It takes the median of the ratio of the count to the geometric mean of the expression values as the scaling factor for each library. 

Thursday 26 March 2015

Circos Plots: Tick Marks - Basics

Excerpts from  Tick Marks, Grids and Labels

"Ticks, tick labels and grids are defined in the <ticks> block, which can contain any number of <tick> blocks, each defining ticks with a different spacing."

"Ticks refers to the radial lines that show progression of distance along the ideogram. Tick labels are the accompanying text elements that mark the position of the tick."

"The radius specifies the radial position of the tick marks, which you generally want to set to the outer ideogram radius."

"The label multiplier is the constant used to multiply the tick value to obtain the tick label. For example, if the multiplier is 1e-6, then the tick mark at position 10,000,000 will have a label of 10. The multiplier is applied to the raw tick value, regardless of the value of chromosomes_unit."

"The orientation controls whether the ticks and labels face out (orientation=out) or in (orientation=in)."

"By referencing the position relative to the image, and not the ideogram, you decouple the position of the tick from the position of the ideogram. This absolute placement is useful if you know you want the ticks at a specific image position, regardless of the position of the ideograms. radius=dims(image,radius)-25p."

"Typically, one defines several sets of ticks by using <tick> blocks. Each set defines the display of ticks at a given spacing. For example, one could have three sets of ticks spaced at 1Mb, 5Mb and 10Mb, respectively, and formatted so that the 1Mb ticks are small and without labels whereas the 5Mb and 10Mb be larger and with labels. The 10Mb ticks might use a bolder font, for example, to give them greater visual weight."

"Unless force_display is set for a tick set, ticks at smaller spacing are not drawn at a position that already has another tick. In other words, the formatting of a tick mark is defined by the block associated with the spacing value that defines the largest divisor of the tick value."

"When tick size is expressed in relative terms, the comparator is the tickness of the ideogram. Therefore ticks with size=0.1r will have a length that is 1/10th of the ideogram thickness. Tick thickness, on the other hand, uses the tick size as the comparator. Thus, ticks with thickness=0.1r will have a width that is 1/10th the size of their length. Similarly, if tick label size is defined relatively, it will be scaled by tick size."

"
When chromosomes_display_default=yes, you do not need to define which ideograms ticks appear on because tick mark visibility is on by default and you only need to define where tick marks are not shown. If chromosomes_display_default=no, then things get a little bit more complicated, because you now need to define where tick marks will be shown and these definitions can contain regions of exclusion."

Reporting Unwanted Sexual Behaviour in a Black Cab, Minicab, or on Public Transport in the UK

Quoted from an online source.

"If you would like to report any unwanted sexual behaviour in a black cab, minicab, or on public transport, please report it by calling 101 or texting 61016.

For further information or support please follow the links or call the numbers for the charities below.

Rape Crisis (England & Wales)
Website: www.rapecrisis.org.uk
Telephone Number: 08088029999

Victim Support
Website: https://www.victimsupport.org.uk/
Telephone Number: 08081689111

hollaback
Website: http://www.ihollaback.org/about/

In an emergency always call 999."

Monday 16 March 2015

Circos Plots

################################################
#  chromosomes_units
Excerpts from "Drawing Ideograms"

"For example, chromosomes_units = 1000000 chromosomes = hs1:0-100;hs2:50-150;hs3:50-100;hs4;hs5;hs6;hs7;hs8

Will draw all 8 chromosomes, but only 0-100 Mb of hs1, 50-150Mb of hs2 and 50-100 Mb of hs3. The start and end ranges are given in units of chromosomes_units."

################################################
# karyotype file
Excerpts from "Karyotypes"

"The karyotype file defines the axes. In biological context, these are typically chromosomes, sequence contigs or clones.

Each axis (e.g. chromosome) is defined by unique identifier (referenced in data files), label (text tag for the ideogram seen in the image), size and color."


"Chromosome definitions are formatted as follows
chr - ID LABEL START END COLOR"

'The first two fields are always "chr", indicating that the line defines a chromosome, and "-". The second field defines the parent structure and is used only for band definitions.'

" Consider using the conventional chromosome color scheme as defined in the etc/color.conf configuration file. Colors are defined for each human chromosome and are named similiarly: chr1, chr2, ... chrx, chry, chrun. Colors must be in lowercase."


################################################
# external imports
 Excerpts from "Configuration Files - Syntax, Colors, Fonts and Units"

"Two files should always be imported from etc/ in the Circos distribution. These are
# colors, fonts and fill patterns
<<include etc/colors_fonts_patterns.conf>>
# system and debug parameters
<<include etc/housekeeping.conf>>"

#################################################
# <image> block
Excerpts from "PNG Output"

"I suggest that you always import the default image settings.
<image>
# import defaults from Circos distribution
<<include etc/image.conf>>
</image>

The settings define the output file to be 3,000 x 3,000 pixels, with white background, named circos.png, which will be placed in the current directory."

"If you would like to overwrite any of these parameters, use the * suffix syntax.
# circos.conf
<image>
<<include etc/image.conf>>
file* = myfile.png
radius* = 1000p
</image>


"Output image directory and filename are defined in the dir and file parameters of the <image> block. The produced image is always square, and its size set by the radius parameter (this is the size of the inscribed circle). If radius=1500p, then the image will be 3,000 x 3,000 pixels in size."

#################################################
# Ticks & Labels
 Excerpts from "Ticks & Labels"

'The radial position of the labels can be adjusted using label_radius. The quantity used as the reference for relative units depends on which parameter is defined. It is usually defined as the "parent container" of the element. For example, when definition ideogram position, the reference is image radius. When using track position, the reference is ideogram radius. As a result, when the parent element is moved (e.g. ideogram), all other elements move with it (e.g. data tracks).'

"Ticks are defined by group. You can have absolute or relatively spaced ticks, as well as ticks at specific positions. The primary parameter in each <tick> block is spacing. This defines the distance between adjacent ticks in this group. Typically, this value is defined in terms of chromosomes_units parameter — the suffix u is used for this — to keep the number legible. If a tick belongs to multiple groups, the group with largest spacing is prefered. Thus, the tick at 50 Mb will take its formatting from the spacing=25u group, not the spacing=5u group."




Empty the Contents of a File

Empty the contents of a file

.bashrc and .bash_profile

Excerpts from "What is the purpose of .bashrc and how does it work?"

".bashrc is a shell script that Bash runs whenever it is started interactively. You can put any command in that file that you could type at the command prompt. You put commands here to set up the shell for use in your particular environment, or to customize things to your preferences."

"Contrast .bash_profile and .profile which are only run at the start of a new login shell. (bash -l) You choose whether a command goes in .bashrc vs .bash_profile depending on on whether you want it to run once or for every interactive shell start."

Friday 13 March 2015

ENCODE Tier 1, Tier 2 and Tier 3 Cells

Excerpts from "ENCODE Cell Types 2007 - 2012"

"Tier1 cells are of higher priority, and should be used within experiments before Tier2 cells. Additional cell types beyond the designated Tier1 and Tier2 could be used for ENCODE production; these are selected at the discretion of individual data production groups, and are designated Tier3."

===============================================
Excerpts from "ENCODE Project Common Cell Types"

"These common cell types include both cell lines and primary cell types, and plans are being made to explore the use of primary tissues and embryonic stem (ES) cells.

Cell types were selected largely for practical reasons, including their wide availability, the ability to grow them easily, and their capacity to produce sufficient numbers of cells for use in all technologies being used by ENCODE investigators. Secondary considerations were the diversity in tissue source of the cells, germ layer lineage representation, the availability of existing data generated using the cell type, and coordination with other ongoing projects. Effort was also made to select at least some cell types that have a relatively normal karyotype."

Detailed descriptions of tier 1 and 2 cells were included in the link above.

PRO-seq

Excerpts from "Precise Maps of RNA Polymerase Reveal How Promoters Direct Initiation and Pausing"

"PRO-seq uses biotin-labeled ribonucleotide triphosphate analogs (biotin-NTP) for nuclear run-on reactions, allowing the efficient affinity purification of nascent RNAs for high throughput sequencing from their 3’ ends (Figs. 1A, S1A). Supplying only one of the four biotin-A/C/G/UTP restricts Pol II to incorporate a single or at most a few identical bases, resulting in sequence reads that have the same 3’ end base within each library (table S1). Moreover, the incorporation of the first biotin-base inhibits further transcript elongation, ensuring base-pair resolution (fig. S2)."
===============================================
Excerpts from "Genome-Wide Control of RNA Polymerase II Activity by Cohesin"

"PRO-seq varies from GRO-seq in that biotin-labeled ribonucleotides are used to allow run-on for a nucleotide or two, instead of the longer run-on with BrUTP used in GRO-seq. PRO-seq, like GRO-seq [17], is highly sensitive, and unlike ChIP, does not depend on crosslinking efficiency or antibody specificity, and detects elongation-competent Pol II regardless of the phosphorylation status. Nuclei were isolated under conditions of ribonucleotide depletion to halt transcription, but leave Pol II transcriptionally engaged. The nascent RNA transcripts produced upon restart of transcription were used to generate a cDNA library for high-throughput sequencing. Inclusion of sarkosyl in the run-on transcription reaction prevents new transcription initiation, so that only Pol II that is already transcriptionally engaged is detected, and gene body and promoter paused Pol II are detected with equal efficiency [17]"