Sunday, 25 December 2016

Vi Editor: Exit ^X Mode

Ctrl+q

Thursday, 22 December 2016

Modifiy Python matplotlib Backend Settings

To find the location of configuration file:
>>> import matplotlib
>>> matplotlib.matplotlib_fname()

Edit the matplotlib configuration file:
Modify "backend : tkagg" to "backend : Agg", for example.

More of Customizing matplotlib.



Friday, 16 December 2016

Detecting hierarchical 3 - D genome domain reconfiguration with network modularity

Detecting hierarchical 3 - D genome domain reconfiguration with network modularity

Impact of regulatory variation across human iPSCs and differentiated cells

Impact of regulatory variation across human iPSCs and differentiated cells

Zynda et al. SOFTWARE Repliscan: a tool for classifying replication timing regions

Zynda et al. SOFTWARE Repliscan: a tool for classifying replication timing regions

Mixture modeling of single-cell RNA-seq data to indentify genes with differential distributions

Mixture modeling of single-cell RNA-seq data to indentify genes with differential distributions

CRISPRi-based genome-scale identification of functional long noncoding RNA loci in human cells

CRISPRi-based genome-scale identification of functional long noncoding RNA loci in human cells

dbSUPER: a database of super-enhancers in mouse and human genome

dbSUPER: a database of super-enhancers in mouse and human genome

Tuesday, 6 December 2016

Bash Sponge Command

Combining many files columnwise, use first column only once

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Cited from Can bash wildcards specify negative matches?

If the extglob shell option is enabled using the shopt builtin, several
extended pattern matching operators are recognized. In the following 
description, a pattern-list is a list of one or more patterns separated 
by a |.  Composite patterns may be formed using one or more of the following
sub-patterns:

          ?(pattern-list)
                 Matches zero or one occurrence of the given patterns
          *(pattern-list)
                 Matches zero or more occurrences of the given patterns
          +(pattern-list)
                 Matches one or more occurrences of the given patterns
          @(pattern-list)
                 Matches one of the given patterns
          !(pattern-list)
                 Matches anything except one of the given patterns

Thursday, 10 November 2016

S-adenosyl-L-methionine (SAMe)

Cited from SAMe Safety

"SAMe is likely safe when taken by mouth in doses of 400-600 milligrams daily for up to two years; when taken by mouth at doses of 800-1,600 milligrams daily for up to 42 days; and when given through IV in doses up to 800 milligrams daily for up to 21 days."

Read Command

Cited from Getting User Input Via Keyboard

While Read Loop

Cited from For and Read-While Loops in Bash

while read line
do
  echo "$line"
done < list-of-dirs.txt

or 

while read line
do
  echo "$line"
done < <(cat list-of-dirs.txt)

Cited from While loop

while IFS= read -r line
do
  command1 on $line
  command2 on $line
  ..
  ....
  commandN
done < "/path/to/filename"

or

while IFS= read -r field1 filed2 field3 ... fieldN
do
  command1 on $field1
  command2 on $field1 and $field3
  ..
  ....
  commandN on $field1 ... $fieldN
done < "/path/to dir/file name with space"

"IFS is used to set field separator (default is while space). The -r option to read command disables backslash escaping (e.g., \n, \t). This is failsafe while read loop for reading text files."


Differences Between $@ and $* as Positional Parameters

Cited from $IFS
  • $@ expanded as "$1" "$2" "$3" ... "$n" 
  • $* expanded as "$1y$2y$3y...$n", where y is the value of IFS variable i.e. "$*" is one long string and $IFS act as an separator or token delimiters.

IFS

Cited from Bash: Show IFS value

To show IFS value,
printf %q "$IFS"

What is the meaning of IFS=$'\n' in bash scripting?

Cited from Getting User Input Via Keyboard

cat -etv <<<"$IFS"

Sample outputs:
  ^I$
$

Where,
  • $ - end of line i.e. newline 
  • ^I$ - tab and newline

Arguments for "Format"

Cited from The printf command

%f       Interpret and print the associated argument as floating point number
%e      Interpret the associated argument as double, and print it in <N>±e<N> format
%g      Interprets the associated argument as double, but prints it like %f or %e

Friday, 21 October 2016

RNA-seq 5-prime to 3-prime Bias

Source of 3-prime Bias in PolyA-enriched RNA-seq

Cited from Figure 8: The use of 3′ bias as a quality control assay for cDNA.

"Total (bulk) RNA derived from tissue is confirmed to have a high RIN score before isolation of nuclei. Partial degradation of the RNA might occur during the preparation of nuclei by Dounce homogenization (nuclei prep) or FACS of the individual nuclei. If the mRNA is degraded by hydrolysis, shearing or RNases, truncated mRNA species could be created, and those containing the polyA sequence at the 3′ end of the transcripts might produce cDNA. This would generate greater RNA-seq coverage of the 3′ end of transcripts (3′-bias) compared with the high-quality bulk RNA."

======================================

Cited from NGS Quality Control in RNA Sequencing- Some Free Tools

To visualise 5-prime or 3-prime bias, use the tools of Picard or RSeQC.

======================================

Cited from How to understand median 5 prime to 3 prime bias ratio from Picard?

It is up to the analysis software to deal with 5-prime or 3-prime bias.

======================================

Cited from  Salmon Doc

"--seqBias" to learn sequence bias

Salmon uses a variable-length Markov Model (VLMM) to model the sequence specific biases at both the 5’ and 3’ end of sequenced fragments. This methodology generally follows that of Roberts et al. [2], though some details of the VLMM differ.


Bash Parameter Substitution

Cited from How to tell if a string is not defined in a bash shell script?
  • ${var+blahblah}: if var is defined, 'blahblah' is substituted for the expression, else null is substituted
  • ${var-blahblah}: if var is defined, it is itself substituted, else 'blahblah' is substituted
  • ${var?blahblah}: if var is defined, it is substituted, else the function exists with 'blahblah' as an error message.

Thursday, 13 October 2016

Wednesday, 7 September 2016

Coefficient of Variation as a Measure for Transcriptional Stability

"To measure transcriptional stability, we computed the coefficient of variation for gene expression over 12 developmental time points."

Tuesday, 6 September 2016

Mini Review "Cohesin Loading and Sliding"

Cohesin Loading and Sliding

MS2 Tagging

Cited from In Vivo RNA Visualization in Plants Using MS2 Tagging

"This technique involves the tagging of the RNA of interest with repeats of an RNA stem-loop (SL) that is derived from the origin of assembly of the bacteriophage MS2 and recruits the MS2 coat protein (MCP). Thus, expression of MCP fused to a fluorescent marker allows the specific visualization of the SL-carrying RNA."


Saturday, 3 September 2016

Condensins, Topoisomerases and Cohesion

Cited from the review "Chromosome  Condensation and Cohesion"

"Early research on chromosome structure demonstrated the
existence of a nonhistone protein scaffold that runs along the chromatids. This scaffold  is composed mainly of two proteins topoisomerase IIa  and the condensin subunit SMC2."

"Topoisomerases modify the topology of DNA by transiently introducing nicks in a single strand to allow relaxation of supercoils (topoisomerases I and III) or breaking a  double strand to allow passage of another DNA duplex  through the opening (topoisomerase II). The latter  reaction allows the catenation or decatenation of two DNA molecules and is essential for chromosome individualisation and condensation, as well as for sister-chromatid  resolution and segregation."

"Condensins contain two Structural Maintenance of Chromosomes (SMC) proteins, SMC2 and SMC4. They form long coiled-coil rods joined by a hinge region and containing an adenosine triphosphatase (ATPase) head at the free end."

"Eukaryotic SMCs are found in  three types of complexes. Condensins consisting of the SMC2/4 heterodimer are involved in chromosome condensation. The cohesin complex containing SMC1/3  mediates sister-chromatid cohesion. The SMC5/6 complex is involved in DNA repair and telomere maintenance."

"Two forms of condensin complexes exist: condensins I and II. Both complexes are pentamers that contain SMC2 and SMC4, but differ in their non-SMC subunits. Condensin I contains the non-SMC subunits CAP-D2, CAP-G and CAP-H, whereas condensin II contains CAP-D3, CAP-G2 and CAP-H2."

"These two complexes might play  different roles in the condensation process, since depletion  of non-SMC subunits of condensin I results in ‘puffed’  chromosomes while depletion of those in condensin II leads  to ‘curly’ chromosomes."

"It is generally accepted that both condensin and topoisomerase IIa (Topo IIa) are important for chromosome condensation."

"Both Topo IIa and condensin associate with chromosomes in late G2 primarily at centromeres.Topo IIa decorates the chromosome scaffold during prophase, but condensin enrichment occurs later, in prometaphase."

"Condensin II is present in the nucleus during interphase
while condensin I is cytoplasmic and comes into contact
with chromosomes only after nuclear envelope breakdown.
Selective depletion of condensin II, but not condensin I,
by depleting their non-SMC subunits showed delayed
chromosome condensation in prophase."

"Interestingly, cells depleted of both condensins  I and II were still able to condense their chromosomes."

"Maintenance of chromosome condensation, therefore, seems to rely more on condensin."

"Cohesin binds to chromatin during early G1 and before  DNA replication, however, suggesting that cohesin binding to chromatin does not equal sister-chromatid cohesion."

"Approximately 90% of cohesin can be depleted  from human cells without substantial defects in sister chromatid cohesion."

"In higher eukaryotes, cohesin removal occurs  through two pathways. In the prophase pathway, Plk1- mediated phosphorylation of SA1/2 triggers its removal by  the Wapl–Pds5 complex. This pathway removes most cohesin from the chromosome arms. Centromeric cohesin is protected from  the prophase pathway by the Sgo1–PP2A complex. In the metaphase pathway, the Scc1 subunit  of the centromeric pool of cohesin is cleaved by separase  to allow anaphase onset."

"Centromeric cohesion is actively protected  during mitosis by two mechanisms: protection against  cohesin removal by the Sgo1–PP2A complex and inhibition of separase activation by the spindle checkpoint."

"Thus, Topo IIa is  required not only for chromosome individualisation and  condensation during early mitosis, but also for sister chromatid separation during anaphase."

"Cleavage of centromeric cohesin by separase  promotes DNA decatenation by Topo IIa, presumably  because cohesin removal increases the access of Topo IIa to  catenated DNA."




Mitosis-specific Histone Modifications

Cited from the review "Chromosome  Condensation and Cohesion"

"H3-S10 phosphorylation is mediated by the Aurora B kinase.It initiates at centromeres in late G2 and extends to the whole chromosome by early mitosis."

"Phosphorylation of H3-S10 dissociates chromatin-bound proteins such as heterochromatin protein 1 (HP1) and splicing factors SRp20 and ASF/SF2 during mitosis, suggesting that this phosphorylation event might contribute to chromosome condensation by removing chromatin-bound proteins ."

"In addition to histone H3, the linker histone H1 is also
heavily phosphorylated during mitosis, which has been
implicated in chromosome condensation."



Tuesday, 23 August 2016

Co-localization of Interval Sets in ChIP-seq

ColoWeb: a resource for analysis of colocalization of genomic features

LOLA: enrichment analysis for genomic region sets and regulatory elements in R and Bioconductor

Cited from COPS: Detecting Co-Occurrence and Spatial Arrangement of Transcription Factor Binding Motifs in Genome-Wide Datasets

"In order to compare the in vivo overlap with the expected (background) overlap, an overlap analysis was performed for the frequent motif patterns for which genome-wide data was available. The expected overlap was measured by randomly permuting (1000 times) the same number of regions bound by one TF through the genome and the mean overlap was subsequently calculated. The significance of the observed compared to the expected overlap was calculated by assuming that the overlap follows a Poisson distribution."

CGATOxford Tools

Monday, 1 August 2016

Fluorescence Recovery After Photobleaching (FRAP)

Youtube Video on Fluorescence recovery after photobleaching (FRAP)

HaloTagging Protein for Purifcation,Interactions and Imaging

Cited from HaloTag Technology for Protein Purification, Protein Interactions and Imaging

  1. The HaloTag protein can be fused with protein of interest. 
  2. A family of HaloTag ligands come with different functionalities. 
  3. Ligands consist of two parts: a reactive linker and a functional group, such as fluorescent dye or biotin.
  4. Binding of the ligand to the HaloTag protein is rapid and irreversible.
  5. The HaloTag protein is genetically modified hydrolase that covalently binds hydrolase substrate like the HaloTag ligands.

Chromatin Digestion by Micrococcal Nuclease

Cited from the paper "Assays of nucleosome assembly and the inhibition of histone acetyltransferase activity. (11) Digestion of chromatin; and (12) Purification and characterization of DNA after digestion of chromatin"

"Digestion of chromatin by micrococcal nuclease (MNase) provides a relatively simple method for obtaining information about the locations of nucleosomes along DNAstrands. When nuclei in permeabilized cells are exposed to MNase in the presence of a divalent cation, the enzyme makes double-stranded cuts between nucleosomes. Treatment of chromatin substrates with very high concentrations of MNase yields mononucleosome-length DNA prodominantly, while lower concentrations of the enzyme generate one double-stranded cut at intervals of 10 to 50 nucleosomes, depending on the concentration of the enzyme and the substrate. MNase can also make single-stranded DNA cuts at the sites of histone octamers, and, thus, attempts to map the positions of nucleosomes are usually performed with native double-stranded DNA."

Cell Line: G1E ER4

Cited from the Paper "Tissue-Specific Mitotic Bookmarking
by Hematopoietic Transcription Factor GATA1"

"To monitor GATA1 localization on a global scale in living, unsynchronized erythroid cells, GATA1-YFP fusion constructs were stably introduced into G1E cells."

"G1E cells are erythroid precursors that lack GATA1 and consequently fail to mature (Weiss et al., 1997). Introduction of
a conditional form of GATA1 (GATA1 fused to the ligand binding
domain of the estrogen receptor [ER]) conveys estradiol (E2)-
dependent erythroid maturation in a manner faithfully reproducing  that of normal erythroid cells."

"GATA1-ER target gene occupancy  and expression closely match that of endogenous GATA1 in  primary erythroblasts, providing a physiological assay for GATA1 function. Both N-terminal and C-terminal YFP fusions of GATA1-ER were
generated to account for potential effects of YFP on GATA1-ER
function. YFP-GATA1-ER and GATA1-ER-YFP were expressed
at levels similar to endogenous GATA1 and were  equally capable of inducing erythroid differentiation when compared to wild-type GATA1."


Tuesday, 26 July 2016

Generate Background Sequences Using Markov Model

Cited from GimmeMotifs documentation

"
generate_background_sequences.py

Generate random sequences according to one of two methods: random or matched_genomic. With the argument type set to random, and an input file in FASTA format, this script will generate sequences with the same dinucleotide distribution as the input sequences according to a 1st order Markov model trained on the input sequences. The -n options is set to 10 by default. The length distribution of the sequences in the output file will be similar as the inputfile. The Markov model can be changed with option -m. If the type is specified as matched_genomic the inputfile needs to be in BED format, and the script will select genomic regions with a similar distribution relative to the transcription start of genes as the input file. Make sure to select the correct genome. The length of the sequences in the output file will be set to the median of the features in the input file.
"

What Is The Appropriate Order For A Background Model In Motif Searches?

Implemention of fasta-get-markov on GUI

Motif Analysis

Using Weeder, Pscan, and PscanChIP for the Discovery of Enriched Transcription Factor Binding Site Motifs in Nucleotide Sequences

A review of ensemble methods for de novo motif discovery in ChIP-Seq data

THiCweed: fast sensitive motif finding via clustering of big data sets

Tuesday, 5 July 2016

R Data Table .I and J()

Cited from Understanding .I in data.table in r

.I is a vector representing the row numbers

Using .I to return row numbers with data.table package

######################################
Cited from How is J() function implemented in data.table?

J(.) is deprecated and simply replaced with list(.).

Friday, 1 July 2016

R data.table Tips

Cited from Introduction to data.table

Within the frame of a data.table, columns can be referred to as if they are variables.

We can use “-” on a character columns within the frame of a data.table to sort in decreasing order.

We wrap the variables (column names) within list(), which ensures that a data.table is returned. In case of a single column name, not wrapping with list() returns a vector instead.

data.table also allows using .() to wrap columns with. It is an alias to list(); they both mean the same. Feel free to use whichever you prefer.

Since .() is just an alias for list(), we can name columns as we would while creating a list.

For example,
ans <- flights[, .(delay_arr = arr_delay, delay_dep = dep_delay)]

Speical symbol .N is a special in-built variable that holds the number of observations in the current group.

Setting with=FALSE disables the ability to refer to columns as if they are variables.

We can also deselect columns using - or !.

A change 'by' to 'keyby' automatically orders the result by the grouping variables in increasing order.

Special symbol .SD. It stands for Subset of Data. It by itself is a data.table that holds the data for the current group defined using by.

.SD would contain all the columns other than the grouping variables by default.

Using the argument .SDcols. It accepts either column names or column indices. For example, .SDcols = c("arr_delay", "dep_delay") ensures that .SD contains only these two columns for each group.

######################################
Cited from Keys and fast binary search based subset

We can set keys on multiple columns and the column can be of different types. Uniqueness is not enforced.

Setting a key does two things:
  1. reorders the rows of the data.table by the column(s) provided by reference, always in increasing order.
  2. marks those columns as key columns by setting an attribute called sorted to the data.table.
Since the rows are reordered, a data.table can have at most one key because it can not be sorted in more than one way.

setkey() and setkeyv() modify the input data.table by reference. They return the result invisibly.

In data.table, the := operator and all the set* (e.g., setkey, setorder, setnames etc..) functions are the only ones which modify the input object by reference.

In addition to ordering, keyby also sets the key column.

######################################
Cited from Reference semantics

:= returns the result invisibly. Sometimes it might be necessary to see the result after the assignment. We can accomplish that by adding an empty [] at the end of the query, like flights[hour == 24L, hour := 0L][].

The copy() function deep copies the input object and therefore any subsequent update by reference operations performed on the copied object will not affect the original object.

######################################
Cited from Efficient reshaping using data.tables

By default, variable column is of type factor. Set variable.factor argument to FALSE if you’d like to return a character vector instead.

Thursday, 30 June 2016

HI-C Terminology

Cited from Paper "A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping"

'We define the ‘‘matrix resolution’’ of a Hi-C map as the locus size used to construct a particular contact matrix and the ‘‘map resolution’’ as the smallest locus size such that 80% of loci have at least 1,000 contacts. The map resolution is meant to reflect the finest scale at which one can reliably discern local features.'

'We began by probing the 3D partitioning of the genome. In our earlier experiments at 1 Mb map resolution (Lieberman-Aiden et al., 2009), we saw large squares of enhanced contact frequency tiling the diagonal of the contact matrices. These squares partitioned the genome into 5–20 Mb intervals, which we call ‘‘megadomains.’’We also found that individual 1 Mb loci could be assigned to one of two long-range contact patterns, which we called compartments A and B, with loci in the same compartment showing more frequent interaction. Megadomains—and the associated squares along the diagonal—arise when all of the 1 Mb loci in an interval exhibit the same genome-wide contact pattern. Compartment A is highly enriched for open chromatin; compartment B is enriched for closed chromatin.'

'Two of the five interaction patterns are correlated with loci in compartment A (Figure S4E). We label the loci exhibiting these patterns as belonging to subcompartments A1 and A2. Both A1 and A2 are gene dense, have highly expressed genes, harbor activating chromatin marks such as H3K36me3, H3K79me2, H3K27ac, and H3K4me1 and are depleted at the nuclear lamina and at nucleolus-associated domains (NADs) (Figures 2D, 2E, and S4I; Table S3). While both A1 and A2 exhibit early replication times, A1 finishes replicating at the beginning of S phase, whereas A2 continues replicating into the middle of S phase. A2 is more strongly associated with the presence of H3K9me3 than A1, has lower GC content, and contains longer genes (2.4-fold).'

'The other three interaction patterns (labeled B1, B2, and B3) are correlated with loci in compartment B (Figure S4E) and show very different properties. Subcompartment B1 correlates positively with H3K27me3 and negatively with H3K36me3, suggestive of facultative heterochromatin (Figures 2D and 2E). Replication of this subcompartment peaks during the middle of S phase. Subcompartments B2 and B3 tend to lack all of the above-noted marks and do not replicate until the end of S phase (see Figure 2D). Subcompartment B2 includes 62% of pericentromeric heterochromatin (3.8-fold enrichment) and is enriched at the nuclear lamina (1.8-fold) and at NADs (4.6-fold). Subcompartment B3 is enriched at the nuclear lamina (1.6-fold), but strongly depleted at NADs (76-fold).'

'Upon closer visual examination, we noticed the presence of a sixth pattern on chromosome 19 (Figure 2F). Our genome-wide clustering algorithm missed this pattern because it spans only 11 Mb, or 0.3% of the genome. When we repeated the algorithm on chromosome 19 alone, the additional pattern was detected. Because this sixth pattern correlates with the Compartment B pattern, we labeled it B4. Subcompartment B4 comprises a handful of regions, each of which contains many KRAB-ZNF superfamily genes. (B4 contains 130 of the 278 KRAB-ZNF genes in the genome, a 65-fold enrichment). As noted in previous studies (Vogel et al., 2006; Hahn et al., 2011), these regions exhibit a highly distinctive chromatin pattern, with strong enrichment for both activating chromatin marks, such as H3K36me3, and heterochromatin-associated marks, such as H3K9me3 and H4K20me3.'

Definition of In situ Hi-C

Cited from Paper "A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping"

In situ Hi-C: DNA-DNA proximity ligation is performed in intact nuclei.

Monday, 20 June 2016

Multiscale analysis of genome-wide replication timing profiles using a wavelet-based signal-processing algorithm

"Replication starts from a set of initiation loci, called replication origins, where two replication forks are assembled and begin replicating DNA while proceeding in opposite directions, away from the loci; fork progression continues until two converging forks 'collide' at a terminus of replication."

"The DNA replication program in a cell is defined as the temporal sequence of locus replication events during the S phase. The program depends on the locations of the replication origins, their activation times and the speed at which replication forks move along the DNA double helix."




DNA Replication Timing

  1. Replication of eukaryotic chromosomes takes place in segments.
  2. The rate of elongation of replication forks varies little throughout S phase.
  3. It is the temporal order  of replication, not the sites of initiation, that is  conserved among species;
  4. In multicellular  but not unicellular organisms, early replication  is correlated with transcriptional activity and is
    developmentally regulated.
  5. The importance  of large-scale chromatin folding in the regulation of replication timing in both yeasts and  mammals.
  6.  

Friday, 3 June 2016

hESNet: Human Embryonic Stem Cell Transcription Network

http://wanglab.ucsd.edu/star/hESnet/

Differences Between Epiblast and Embryonic Stem Cells

Cited from Collection: Naive Pluripotency

"Ground-state naive pluripotency is established in the epiblast of the mature blastocyst and may be captured in vitro in the form of embryonic stem cells. Although rodent cells can exist in both primed and naive pluripotent states, establishing a naive state in human cells has been difficult to obtain."

Cell Identity Markers

Gata4, a primitive endoderm marker (Paper: Control of ground-state pluripotency by allelic regulation of Nanog)
Fgf4, pluripotency-associated genes (Paper: Control of ground-state pluripotency by allelic regulation of Nanog)
Pecam1, a non-pluripotency transmembrane protein on cell surface expressed in mouse embroynic stem cells.
Bmp4, a non-pluripotency factor expressed in mouse embroynic stem cells, a member of the bone morphogenetic protein family which is part of the transforming growth factor-beta superfamily.



Naive Epiblast Explanation

Cited from the paper "Nanog Is the Gateway to the Pluripotent Ground State"

" After fertilization, mammalian zygotes follow a program of cleavage divisions and elaborate two extraembryonic lineages, trophoblast and hypoblast (Selwood and Johnson, 2006). This preparatory phase of development culminates in creation of the embryo founder tissue, a population of unrestricted pluripotent cells known as the epiblast (Gardner and Beddington, 1988 and Nichols and Smith, 2009). The epiblast proliferates to provide the substrate for axis formation, germlayer specification, and gastrulation. Naive early epiblast cells can be immortalized in culture in the form of embryonic stem (ES) cells (Brook and Gardner, 1997, Evans and Kaufman, 1981 and Martin, 1981). Pluripotent cells can also be created outside the embryo by reprogramming somatic cells, either by fusion with pre-existing pluripotent cells (Miller and Ruddle, 1976, Tada et al., 1997, Tada et al., 2001 and Takagi et al., 1983) or, more compellingly, by transfection with regulatory transcription factors (Takahashi and Yamanaka, 2006)."

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Cited from the paper "Control of ground-state pluripotency by allelic regulation of Nanog"

"The ICM (inner cell mass) of the late blastocyst contains two lineages: the extra-embryonic primitive endoderm, and the ‘ground-state’ pluripotent epiblast6, 8, which gives rise to the embryo. Inner cells expressing Nanog biallelically also express Oct4 but not Gata4, a primitive endoderm marker9, and therefore are epiblast cells."

Wednesday, 1 June 2016

How to Import narrowPeak into R

How to import narrowPeak files

GenomicRanges: 'queryHits' and 'subjectHits' in findOverlaps

Cited from IRanges - minoverlap

> hits = findOverlaps(ir, minoverlap=100L)

It returns an object that tells which queries overlap which subjects, where query and subject are in effect the same ranges.

Saturday, 28 May 2016

Gene Targeting

Cited from Gene targeting

"Gene targeting requires the creation of a specific vector for each gene of interest. However, it can be used for any gene, regardless of transcriptional activity or gene size."

"To target genes in mice, this construct is then inserted into mouse embryonic stem cells in culture. After cells with the correct insertion have been selected, they can be used to contribute to a mouse's tissue via embryo injection. Finally, chimeric mice where the modified cells made up the reproductive organs are selected for via breeding. After this step the entire body of the mouse is based on the previously selected embryonic stem cell."

RNAi siRNA and shRNA

Cited from RNAi (RNA interference) defined

  1. Synthetic (siRNA) or single stranded RNA (ssRNA) containing two complementary sequences separated by a non-complementary sequence, which folds back on itself to form a synthetic short hairpin RNA (shRNA).
  2. Expressed from a DNA construct which encodes an shRNA molecule. This is the dd (DNA-directed) RNAi approach.
siNT: non-target siRNA

Cited from the paper "A Novel Multiplex Cell Viability Assay for High-Throughput RNAi Screening"


"Nucleus or DNA stain using fluorescent molecules, such as Hoechst 33342, Hoechst 33258, DAPI or other dyes have been long-serving and commonly applied indicators of cellular viability."

Friday, 20 May 2016

Pausing an R Script for User Input

Cited from Pausing an R script: a generic pause function

pause = function(){
    if (interactive()) {
        invisible(readline(prompt = "Press <Enter> to continue..."))
    }
    else {
        cat("Press <Enter> to continue...")
        invisible(readLines(file("stdin"), 1))
    }
}

Thursday, 19 May 2016

Define Bivalent Promoters Computationally

Cited from Lossof the PolycombMark from Bivalent Promoters Leads to Activation of Cancer-Promoting Genes in Colorectal Tumors

A promoter was defined as bivalent if it contained overlapping H3K4me3 and H3K27me3 peaks at expanded promoter areas (2.4 kb < TSS < 0.6 kb).

Thursday, 12 May 2016

High-Throughput (HT) SELEX combines SELEX (Systematic Evolution of Ligands by EXponential Enrichment)

Cited from SELEX experiments: new prospects, applications and data analysis in inferring regulatory pathways

"Systematic Evolution of Ligands by EXponential enrichment (SELEX) is an experimental procedure that allows extraction, from an initially random pool of oligonucleotides, of the oligomers with a desired binding affinity for a given molecular target."

####################################
Cited from Large scale analysis of the mutational landscape in HT-SELEX improves aptamer discovery

Aptamers: short (20–100 nucleotides), synthetic, single-stranded (ribo)-nucleic molecules.

####################################
Cited from Multiplexed massively parallel SELEX for characterization of human transcription factor binding specificities

We describe here a high-throughput method for analyzing transcription factor binding specificity that is based on systematic evolution of ligands by exponential enrichment (SELEX) and massively parallel sequencing.



Nature Reviews Molecular Cell Biology: Naive to Primed Pluripotency

Stem cell states: naive to primed pluripotency

MyLabStocks

MyLabStocks

Monday, 9 May 2016

TF Target Genes Overlap with Active/Repressive Histone Marks

Cited from Supplementary Figure 4 of "Concerted genomic targeting of H3K27 demethylase REF6 and chromatin-remodeling ATPase BRM in Arabidopsis".

The Axoneme of Cilia

Cited from The Axoneme of Cilia

A cilium, like a flagellum, is composed of a central core (the axoneme), which contains two central microtubules that are surrounded by an outer ring of nine pairs of microtubules. 

Sunday, 1 May 2016

FUCCI and Nocodazole in Studying Cell Cycle

FUCCI Cell Cycle Sensor

Cdt1 is a DNA replication factor. It licenses for the formation of pre-replication complex.

Geminin is a DNA replication inhibitor.

############################################################################
Cited from Aphidicolin

Aphidicolin is an antiviral and antimitotic antibiotic extracted from a fungus, and it is produced as the secondary metabolite. It reversibly inhibits DNA polymerase A,D in eukaryotic cells and therefore inhibits eukaryotic nuclear DNA replication. It arrests cells in early S phase. 

############################################################################
Cited from G1 versus G2 cell cycle arrest after adriamycin-induced damage in mouse Swiss3T3 cells

Adriamycin is a DNA damaging agent, inducing DNA intercalating. Adriamycin is known to arrest cells in G1 or G2 phase.

############################################################################
Cited from Nocodazole

Nocodazole interferes with polymerization of microtubules, and cells treated with nocodazole arrests in G2- or M- phase. Prolonged nocodazole leads to apoptosis.





Monday, 25 April 2016

The Differences Between "<-" and "="

Cited from Difference between assignment operators in R

To reduce ambiguity, we should use either <- or = as assignment operator, and only use = as named-parameter specifier for functions.

In conclusion, for better readability of R code, I suggest that we only use <- for assignment and = for specifying named parameters.

Friday, 15 April 2016

Inferring Direct DNA Binding From ChIP-seq

Inferring Direct DNA Binding From ChIP-seq

Position Weight Matrix (PWM)

Position Weight Matrix Wiki

Position Weight Matrix Tutorial

Position Weight Matrix From Sequence Alignment

MEME Suite

MEME Online Suite

Individual Tools in MEME Suite Explained

Latex Beamer Display "Sections"

Sections and subsections

Tissue Specficity and Shannon entropy

Promoter features related to tissue specificity as measured by Shannon entropy

Mini-tutorial on Shannon Entropy

Dave Tang's Journal Club Wiki

Dave Tang's Journal Club Wiki

Single-Cell Omics and Chromatin Topology

Genom Biology: Single-Cell Omics

Genome Biology: The three dimensional organization of the nucleus

LaTeX Figures Side by Side

LaTeX figures side by side

Wednesday, 13 April 2016

Monday, 11 April 2016

Non-standard Evaluation in R (Meta-programming)

Cited from Non-standard evaluation

substitute() looks at a function argument and instead of seeing the value, it sees the code used to compute the value. substitute() returns an expression.

substitute() works because function arguments are represented by a special type of object called a promise. A promise captures the expression needed to compute the value and the environment in which to compute it.

substitute() is often paired with deparse(). That function takes the result of substitute(), an expression, and turns it into a character vector.

One important feature of deparse() to be aware of when programming is that it can return multiple strings if the input is too long.

eval() takes an expression and evaluates it in the specified environment.

quote(). It captures an unevaluated expression like substitute(), but doesn’t do any of the advanced transformations that can make substitute() confusing. quote() always returns its input as is.

So if you only provide one argument, it will evaluate the expression in the current environment. This makes eval(quote(x)) exactly equivalent to x, regardless of what x is.

eval()’s second argument need not be limited to an environment: it can also be a list or a data frame.

=====================================
Cited from Tips on non-standard evaluation in R

In fact, eval(expr, envir, enclos) basically follows the following logic to evaluate a quoted expression:
  1. If envir is an environment, then evaluate expr in envir by looking for symbols all the way along envir and its parent environments until found.
  2. If envir is a list, then evaluate expr given the symbols defined in the list; Whenever a symbol is not found in the list, the function will go to enclos environment to find along the chain until found.
  3. If a symbol is not found until the empty environment (the only environment having no parent) is reached, an error occurs. 
=====================================
Non standard evaluation from another function in R
 





Thursday, 7 April 2016

Tab "\t" in Bash

Cited from Bash Join Command


$'\t' for the tab character, not just -t \t. Bash does not interpret \t unless in $' ' quotes.
join -t $'\t' ...

Saturday, 2 April 2016

What does "canonical" mean in biology?

Most likely, "canonical" in biology means "consensus".

Cited from Canonical sequence

"A canonical sequence is a sequence of DNA, RNA, or amino acids that reflects the most common choice of base or amino acid at each position."

Monday, 28 March 2016

Gawk: Count The Number of Upper or Lower Cases in a String

To count the number of upper case letters in a string,
echo 'ERica' | gawk '{print gsub("[A-Z]", "",$0)}'

Replacement Text Case Conversion in Regular Expression

Replacement Text Case Conversion

For example,
to change the '\2' to the uppercase,
nd=`dirname $f | perl -pe "s|(.+/)([^/]+)/?$|\1\U\2|g"`

Wget (The Non-interactive Network Downloader) Options

--content-disposition

If this is set to on, experimental (not fully-functional) support for "Content-Disposition" headers is enabled. This can currently result in extra round-trips to the server for a "HEAD" request, and is known to suffer from a few bugs, which is why it is not currently enabled by default.

This option is useful for some file-downloading CGI programs that use "Content-Disposition" headers to describe what the name of a downloaded file should be.

--no-check-certificate

Don't check the server certificate against the available certificate authorities. Also don't require the URL host name to
match the common name presented by the certificate.

As of Wget 1.10, the default is to verify the server's certificate against the recognized certificate authorities, breaking the SSL handshake and aborting the download if the verification fails. Although this provides more secure downloads, it does break interoperability with some sites that worked with previous Wget versions, particularly those using self-signed, expired, or otherwise invalid certificates. This option forces an "insecure" mode of operation that turns the certificate verification errors into warnings and allows you to proceed.

If you encounter "certificate verification" errors or ones saying that "common name doesn't match requested host name", you
can use this option to bypass the verification and proceed with the download. Only use this option if you are otherwise convinced of the site's authenticity, or if you really don't care about the validity of its certificate. It is almost always a bad idea not to check the certificates when transmitting confidential or important data.

Friday, 25 March 2016

R ggplot2 vjust and hjust

What do hjust and vjust do when making a plot using ggplot?

Imagine that the text is bordered within a box.

hjust=0 places the reference position coinciding with the left side of the box. hjust=n (n>0) shifts the box to the left by n*(box width) in relation to the reference position. hjust=n (n<0) shifts the box to the right by n*(box width)  from the reference position.

vjust=0 place the reference position coinciding with the bottom side of the box. vjust=n (n>0) shifts the box down in relation to the reference position by n*(box height). vjust=n  (n<0) shifts the box up  from the reference position by n*(box height).

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Cited from the book "ggplot2 Elegant Graphics for Data Analysis"

Justification of a string (or legend) defines the location within the string that is placed at the given position. There are two values for horizontal and vertical justification. The values can be:
  • A string: "left", "right", "centre", "center", "bottom", and "top".
  • A number between 0 and 1, giving the position within the string (from bottom-left corner).

Thursday, 24 March 2016

Deconvolute R Package UpSetR

Functions located in Helper.funcs.R:

## Finds the columns that represent the sets
FindStartEnd

## Finds the n largest sets if the user hasn't specified any sets
FindMostFreq

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Functions located in MainBar.R:
Counter
Make_main_bar

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Functions located in Matrix.R
Create_matrix
Create_layout
MakeShading

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Functions located in SizeBar.R
FindSetFreqs

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Functions located in General.query.funcs.R
General.query.funcs.R

Monday, 21 March 2016

Perl Debugger

Perl Debugger Tutorial: 10 Easy Steps to Debug Perl Program

Perl: the Input Record Separator

Cited from slurp mode - reading a file in one step

The $/ variable is the Input Record Separator in Perl. When we put the read-line operator in scalar context, for example by assigning to a scalar variable $x = <$fh>, perl will read from the file up-to and including the Input Record Separator which is, by default, the new-line \n.

What we did here is we assigned undef to $/. So the read-line operator will read the file up-till the first time it encounters undef in the file. That never happens so it reads till the end of the file. This is what is called slurp mode, because of the sound the file makes when we read it.

Perl: The Difference Between My and Local Variables

Cited from The difference between my and local

'local' temporarily changes the value of the variable, but only within the scope it exists in.

'my' creates a variable that does not appear in the symbol table, and does not exist outside of the scope that it appears in.

$::a refers to $a in the 'global' namespace.

use local when:
  • you want to amend a special Perl variable, eg $/ when reading in a file. my $/; throws a compile-time error

Perl Repetition Operator "x"

Cited from How can I repeat a string N times in Perl?

Binary "x" is the repetition operator. In scalar context or if the left operand is not enclosed in parentheses, it returns a string consisting of the left operand repeated the number of times specified by the right operand. In list context, if the left operand is enclosed in parentheses or is a list formed by "qw/STRING/", it repeats the list. If the right operand is zero or negative, it returns an empty string or an empty list, depending on the context.

say ’-’ x 80;   # print row of dashes
my @ones = (1) x 80; # a list of 80 1’s
@ones = (5) x @ones;        # set all elements to 5



Perl qw() Function

Cited from Using the Perl qw() function

Any non-alphanumeric, non-whitespace delimiter can be used to surround the qw() string argument.

The following are equivalent:
@names = qw(Kernighan Ritchie Pike);
@names = qw/Kernighan Ritchie Pike/;
@names = qw'Kernighan Ritchie Pike';
@names = qw{Kernighan Ritchie Pike};

No interpolation is possible in the string you pass to qw().

Thursday, 10 March 2016

Git Commands

Cited from Ry’s Git Tutorial

git --verison

to turn a directory into a Git repository

cd [dirname]; git init

A .git directory stores all the tracking data for our repository.

An untracked file is one that is not under version control.

You should only track source files and omit anything that can be generated from those files.

git add command tells Git to add the file to the repository.

A snapshot represents the state of your project at a given point in time.

Git’s term for creating a snapshot is called staging.

The git status command will only show us uncommitted changes. To view our project history, git log.

To tell Git who we are,
git config --global user.name "Your Name"
git config --global user.email your.email@example.com

The --global flag tells Git to use this configuration as a default for all of your repositories. Omitting it lets you specify different user information for individual repositories.

Another useful configuration is to pass a filename to git log filename to display file-specific history.

git checkout <commit-id>
View a previous commit.

Tags are convenient references to official releases and other significant milestones in a software project. It lets developers easily browse and check out important revisions. For example, we can now use the v1.0 tag to refer to the third commit instead of its random ID. To view a list of existing tags, execute git tag without any arguments.

git tag -a v1.0 -m "message"

Never make changes directly to a previous revision.

When using git revert, remember to specify the commit that you want to undo—not the stable commit that you want to return to. It helps to think of this command as saying “undo this commit” rather than “restore this version.”

In Git, a branch is an independent line of development.

The HEAD is Git’s internal way of indicating the snapshot that is currently checked out.

To create a new branch,
git branch branch-name

To checkout a branch,
git checkout branch-name

When the history of two branches diverges, a dedicated commit is required to combine the branches. This situation may also give rise to a merge conflict, which must be manually resolved before anything can be committed to the repository.

Conflicts occur when we try to merge branches that have edited the same content.

###################################################################
 







Friday, 19 February 2016

R gtable Package (Top,Bottom,Left and Right Extent)

Cited from Index Position in the gtable

"tlrb" refers to the index position in the gtable (think of it as a matrix): t=2, b=5 means that the grob will be placed from the second to the fifth row (inclusive).

R gtable Wiki

Constructing a gtable

R Grid Coordinates

Cited from grid Graphics

Each viewport has a number of coordinate systems available. There are four main types: absolute coordinates (e.g.,"inches", "cm") allow locations and sizes in terms of physical coordinates -- there is no dependence on the size of the page; normalised coordinates (e.g., "npc") allow locations and sizes as a proportion of the page size (or the current viewport); relative coordinates (i.e.,"native") allow locations and sizes relative to a user-de ned set of x- and y-ranges; referential coordinates (e.g., "strwidth") where locations and sizes are based on the size of some other graphical object.

"R Graphics" Book R Code

"R Graphics" R Code

R readPNG Function

readPNG (png package)

R Grid Package Introduction

Cited from the paper "Fun with the R Grid Package"

By default, the coordinates of the lower left corner of a viewport are (0, 0), and the upper right corner has coordinates 1.

upViewport(2)

The argument in brackets determines the number of generations to move up the viewport tree.

The use of col=NA prevents the outlines from being drawn.

The clip="on" makes it possible to “spill” an graphic object outside the viewport region.

Two ways to interact with a grob (graphic object):
Directly,
grid."shape"()

Indirectly,
"shape"Grob()

If modify we want a grob to draw by using one the of these functions grobs, we could use the and grid.draw() function. We can modify a grob by using the functions grid.edit() and editGrob().

The function gList() allows us to create a list of grobs. It facilitates the construction of several items in one plotting region together.

The function gTree() creates a tree-structure which can be used to organise the components of more complicated graphic objects. Such a tree-structure contains several grobs nested together. In a tree-structure, a grob can contain other grobs. The "children" argument specifies the components of the gTree. The children component is usually a list, constructed by gList.





Tuesday, 16 February 2016

Thursday, 11 February 2016

R: Environment and Frame

Cited from "R in a Nutshell"

" An environment is is an R object that contains the set of symbols available in a given context, the objects associated with those symbols, and a pointer to a parent environment. The symbols and associated objects are called a frame."

"The parent environment of a function is the environment in which the function was created."

======================================
Cited from How R Searches and Finds Stuff




R: Rle or RleList Objects

Cited from http://kasperdanielhansen.github.io/genbioconductor/html/GenomicRanges_Rle.html

The Rle (run length encoding) class in R is intended for representation genome-wide sequence coverage.

The Wig and BigWig files are used to store coverage data.

The run-length-encoded representation of a vector, represents the vector as a set of distinct runs with their own value. This class is integrated in the IRanges package. A base class called "rle" implements much less functionality.

runLength(), runValue() and as.numeric() function takes in the "Rle" class object.

RleList represents a list of Rles. It stores a genome wide coverage track where each element of the list is a different chromosome.

======================================
Cited from IRanges and GenomicRanges An introduction

aggregate() allows you to apply functions to the Rle inside an IRanges

aggregate(Rle_object, IRange_object, FUN=func_name)







Wednesday, 10 February 2016

How to Check If Folder Is Empty or Have Folder File Use Shell Script?

Cited from

if [ "$(ls -A $DIR 2> /dev/null)" == "" ];
then
    # The directory is empty
fi