Shortcuts in Science: June 2014

Monday 30 June 2014

Sunday 29 June 2014

R ggplot2 with correlaiton coefficient as a text

ggplot2: Adding Regression Line Equation and R2 on graph

Wednesday 25 June 2014

Simple Module Tutorial

What is @ISA array in perl?

Exporting symbols

Creating Perl Module

=================================================
# cited from the book "Perl Cookbook"

One of the differences between require and use is that use performs an implict import on the included module's package. An imported symbol no longer have to use the fully qualified by package name (or declared with our or the oder use vars if a variable, or with use subs if a subroutine).You can use importe dviarables as though they were part of your package. If you imported $English::OUTPUT_AUTOFLUSH in the current package, you could refer to it as $OUTPUT_AUTOFLUSH.

If the module name itself contains any double colons, these are translated into your system's directory separator. That means that the file File::Find module resides in the file File/Find.pm under most file systems.

# The following 2 lines need to be enclosed in each module.
use Exporter;
@ISA=("Exporter");

The Exporter module manages the module's public interface. The special, per-package array @ISA was initialized to contain the word "Exporter".

When a user says use Cards::Poker, Perl implicitly calls a special method, Cards::Poker->import( ). You don't have an import method in your package, but that's ok, because the Exporter package does, and you are inheriting it from it, because of the assignment to @ISA. Perl look at the package's @ISA for resolution of undefined methods.

@EXPORT = qw (&shuffle @card_deck);

The list ('&shuffle', '@card_deck') was assigned to the special, per-package array @EXPORT. When someone imports this module, variables and functions listed in that array are aliased into the caller's own package. That way they don't have to call the function Cards::Poke::shuffle(23) after the import. They can just write shuffle(23) instead. This won't happen if they load Cards::Poker with require Cards::Poker; only a use imports.

If the last evaluated expression in the module does not produce a true value, an exception will be raised. Therefore, the last of a module is a simple 1.

A library is a collection of loosely related functions designed to be used by other programs. It lacks the rigorous semantics of a Perl module. The file extension .pl indicates that it is a Perl library file.

Perl libraries -- or in fact, any arbitrary file with Perl code in it -- can be loaded in using do "file.pl" or with require "file.pl". The latter is preferred in most situations, because unlike do, require does implicit error checking.

A .ph is used for C header files that have been translated into Perl libraries using the h2ph tool. A .xs indicate an augmented C source file, possibly created by the h2xs tool, which will be complied by the xsubpp tool and your C compiler into native machine code.

An object-orientated module seldom uses the import-export mechanism at all. Instead, it provides an object-orientated interface full of constructors, destructors, methods, inheritance, and operator overloading.

Saturday 21 June 2014

Perl: the difference beween "our" and "my" variable declaration

http://stackoverflow.com/questions/845060/what-is-the-difference-between-my-and-our-in-perl

Thursday 19 June 2014

R: S4 class OOP

setClass(): Create a new class
setMethod(): Create a new method
setGeneric(): Create a new generic function
new(): Generate a new object for a given class
getClass(): Get the class denition
getMethod(): Get the method denition
getSlots(): Get the name and class of each slot
@: Get or replace the contents of a slot
validObject(): Test the validity of an object

Cited from http://www.pitt.edu/~njc23/Lecture4.pdf

An example of OOP in GNU R using S4 Classes

The S4 object system

Wednesday 18 June 2014

DiffReps

Perl Packages contained in the diffReps Collection

Perl Library: MyShortRead

MyShortRead::SRBed object:
sub new{
    my $class = shift;
    my $self = {};
    $self->{file_name} = undef;
    $self->{bin_size} = 1000;
    $self->{window_size} = 1000;
    $self->{frag_size} = 100;
    $self->{chr_list} = {};    # hash table: chrom -> ChromBed
    bless($self, $class);
    # User supplied bed file name.
    if(@_ >= 1) {$self->{file_name} = $_[0];}
    if(@_ >= 2) {$self->{bin_size} = $_[1];}
    if(@_ >= 3){$self->{frag_size} = $_[2];}
    return $self;
}

MyShortRead::ChromBed object:
sub new{
    my $class = shift;
    my $self = {};
    $self->{chrom_name} = undef;
    $self->{chrom_len} = undef;
    $self->{file_name} = undef;    # chromosome file name.
    $self->{bin_num} = undef;    # number of bins for the chromosome.
    $self->{bin_count} = [];    # vector for bin read count.
    $self->{bin_count_F} = [];    # bin Forward strand read count.
    $self->{bin_count_R} = [];    # bin Reverse strand read count.
    $self->{window_num} = undef;    # number of sliding windows.
    $self->{window_count} = [];    # vector for window read count.
    $self->{window_count_F} = [];    # window read count for Forward strand.
    $self->{window_count_R} = [];    # window read count for Reverse strand.
    bless($self, $class);
    # User supplied chromosome name, length and file name.
    if(@_ >= 3){
        $self->{chrom_name} = $_[0];
        $self->{chrom_len} = $_[1];
        $self->{file_name} = $_[2];
    }
    return $self;
}
my $signRegion = new diffReps::ChromaModSite;

diffReps::SlideWindow object:
sub new{
    my $class = shift;
    my $self = {
        chrom => undef,
        winN => undef,        # window number to retrieve count from SRBed class.
        maxN => undef,        # maximum window number.
        retrieved => 0,        # boolean tag: whether count data have been retrieved?
        rep_cnt => [],        # normalized replicate counts.
        rep_cnt_ => [],        # normalized treatment counts using #reads.
        brep_cnt => [],        # normalized background treatment counts.
        rsumRep => undef,    # sum of raw replicate counts.
        dirn => undef,        # change direction: 'Up' or 'Down'.
        logFC => undef,        # log2 fold change.
        pval => undef        # P-value.
    };
    bless $self, $class;
    return $self;
}
my $slideWindow = new diffReps::SlideWindow;

diffReps::RegionList object:
sub new{
    my $class = shift;
    my $self = {
        regList => [],    # region list.
        adjusted => 0    # bool tag for P-value adjustment.
    };
    bless $self, $class;
    return $self;
}
my $regList = new diffReps::RegionList;

WindowCoverage object defined in diffReps.pl:
sub new{
   die "Not enough arguments to set up window coverage class! Stop.\n" if @_ < 3;
   my($class,$winSize,$step) = @_;
   my $self = {
       movingCov => 0,
       windowCov => 0,
       winSize => $winSize,
       step => $step
   };
   bless $self, $class;
   return $self;
}

diffReps::ChromaModSite object:
sub new{
    my $class = shift;
    my $self = {
       chrom => undef,
       start => undef,
       end => undef,
       siteCenter => undef,
       retrieved => 0,       # boolean tag: whether count data have been retrieved?
       rep_cnt => [],       # region normalized replicate counts.
       rep_cnt_ => [],       # normalized replicate counts using #reads.
       brep_cnt => [],       # normalized background replicate counts.
       #repEnr => undef,       # replicate enrichment vs. background.
       rsumRep => undef,   # sum of raw replicate counts.
       dirn => undef,       # change direction: 'Up' or 'Down'.
       logFC => undef,       # log2 fold change.
       winSta => undef,   # best window start.
       #winRep => [],       # window normalized replicate counts.
       winFC => undef,       # window log2 fold change.
       winP => undef,       # window P-value.
       winQ => undef       # window adjusted P-value.
    };
    bless $self, $class;
    return $self;
}

Perl: parsing command line parameters with Getopt::Long

http://www.perlhowto.com/parsing_command_line_parameters_with_getopt_long

Linux: extract a range of rows from a text files

To extract row 1 and 2, use the following command:

sed -n 1~2p filename

Tuesday 17 June 2014

Whole-genome DNA sequencing: reference genome sequence for alignment

Use 1000 Genomes reference assembly (http://www.1000genomes.org/category/reference; "phase2_reference_assembly_sequence" leads to "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/"; download "hs37d5.fa.gz" ) to circumvent the possible unmapped contigs as illustrated below.

##### ERROR MESSAGE: Input files /project/jferrer/EL/Gencode/gatk/dbsnp_138.b37.vcf and reference have incompatible contigs: No overlapping contigs found.
##### ERROR ~/EL/Gencode/gatk/dbsnp_138.b37.vcf contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT]
##### ERROR reference contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM, GL877870.2, GL877872.1, GL383535.1, JH159133.1, KB663606.1, KB021647.1, KE332497.1, JH159131.1, GL949745.1, JH720447.1, GL582968.1, JH720446.1, JH591181.2, JH720443.2, JH159134.2, JH636052.4, JH636053.3, JH636054.1, JH636056.1, JH636058.1, JH636057.1, KE332505.1, JH720451.1, JH720452.1, JH720453.1, JH720454.3, JH159136.1, JH806587.1, JH806588.1, JH806589.1, JH806590.2, JH806591.1, JH806592.1, JH806593.1, JH806594.1, JH806595.1, JH806596.1, JH806597.1, JH720448.1, JH806598.1, JH806599.1, JH806600.2, JH806601.1, JH806602.1, JH806573.1, JH806574.2, JH806575.1, JH806580.1, JH806583.1, JH806584.1, JH806585.1, JH806603.1, JH159150.3, GL582969.1, JH806577.1, JH806578.1, JH806579.1, JH159137.1, KE332502.1, KB021645.1, KB663607.2, KE332500.1, KE332496.1, GL383558.1, GL582976.1, GL383523.1, KE332498.1, GL949743.1, KE332506.1, GL383536.1, JH591184.1, JH636061.1, JH806576.1, GL383524.1, GL582973.1, JH159138.1, KB021648.1, JH159139.1, JH159140.1, JH591182.1, JH159132.1, JH720449.1, JH591183.1, JH720444.2, JH159141.2, KB663604.1, JH720455.1, KB021646.2, JH159142.2, JH159143.1, JH806582.2, JH159135.2, KE332499.1, GL877877.2, JH806586.1, GL582979.2, KB663605.1, GL582975.1, GL949744.1, GL383543.1, GL877871.1, GL582967.1, JH159149.1, GL582977.2, GL582970.1, GL383559.2, JH159144.1, JH591186.1, GL383560.1, GL339450.1, GL582971.1, GL582974.1, JH806581.1, JH636060.1, JH591185.1, JH159145.1, GL877873.1, KB663608.1, GL582972.1, KB663603.1, KE332495.1, JH636059.1, JH720445.1, KE332501.1, GL383561.2, GL949741.1, GL383562.1, GL383525.1, GL383544.1, GL383548.1, GL383537.1, GL383538.1, GL383516.1, GL383517.1, GL383545.1, GL383546.1, GL383547.1, GL877875.1, GL383549.1, GL383550.1, GL383551.1, GL877876.1, GL383552.1, GL383553.2, GL383554.1, GL383555.1, GL383556.1, GL383557.1, GL000258.1, GL383563.2, GL383564.1, GL383565.1, GL383566.1, JH159146.1, JH159147.1, JH159148.1, GL383567.1, GL383568.1, GL383569.1, GL383570.1, GL383571.1, GL383572.1, GL949746.1, GL949747.1, GL949748.1, GL949749.1, GL949750.1, GL949751.1, GL949752.1, GL949753.1, GL383573.1, GL383574.1, GL383575.2, GL383576.1, GL383518.1, GL383519.1, GL383520.1, GL383577.1, GL383578.1, GL383579.1, GL383580.1, GL383581.1, GL383582.2, GL383583.1, KB663609.1, GL383521.1, GL383522.1, GL582966.2, JH636055.1, GL383526.1, GL000257.1, GL383527.1, GL383528.1, GL383529.1, GL339449.2, GL383530.1, GL383531.1, GL383532.1, GL949742.1, GL383533.1, KB021644.1, GL000250.1, GL000251.1, GL000252.1, GL000253.1, GL000254.1, GL000255.1, GL000256.1, GL383534.2, GL383539.1, GL383540.1, GL383541.1, GL383542.1, GL000191.1, GL000192.1, GL000193.1, GL000194.1, GL000195.1, GL000196.1, GL000197.1, GL000198.1, GL000199.1, GL000200.1, GL000201.1, GL000202.1, GL000203.1, GL000204.1, GL000205.1, GL000206.1, GL000207.1, GL000208.1, GL000209.1, GL000210.1, GL000211.1, GL000212.1, GL000213.1, GL000214.1, GL000215.1, GL000216.1, GL000217.1, GL000218.1, GL000219.1, GL000220.1, GL000221.1, GL000222.1, GL000223.1, GL000224.1, GL000225.1, GL000226.1, GL000227.1, GL000228.1, GL000229.1, GL000230.1, GL000231.1, GL000232.1, GL000233.1, GL000234.1, GL000235.1, GL000236.1, GL000237.1, GL000238.1, GL000239.1, GL000240.1, GL000241.1, GL000242.1, GL000243.1, GL000244.1, GL000245.1, GL000246.1, GL000247.1, GL000248.1, GL000249.1]

Monday 16 June 2014

GATK: need to run a step with FixMateInformation after realignment step?

The indel realigner fixes mates itself, so the file is valid after realignment. A previous version of the realigner did require you to run two passes, but that was before the magic code was written.

Cited from http://gatkforums.broadinstitute.org/discussion/1562/need-to-run-a-step-with-fixmateinformation-after-realignment-step

Selections

Cited from "Human Evolutionary Genetics: Origins, Peoples&Disease"

Selections can occur at
a) survival into reproductive age - viability and mortaility;
b) success in attracting a mate - sexual selection;
c) ability to fertilize - fertility and gamete selection;
d) number of progeny - fecundity.

The sum of these is the ability of an individual genotype to survive and reproduce, its fitness, which which partly dependent on the environment.

The relative fitness of a genotype compared to other genotypes competing for the same resources is measured by a selection coefficient (s), which compares a genotype to the fittest genotype in the population. A selection coefficient of 0.1 represents a 10% decrease in fitness compared to the fittest genotype.

Mutation that reduce the fitness of the carrier are subject to negative selection, also known as purifying selection, whereas mutations that increase fitness undergo positive selection.

To understand the dynamics of selection at diploid loci, we must consider the impact of mutants on the fitness of the genotypes, and not on the individual alleles.

Codominant selection: a novel deleterious allele will be eliminated more rapidly from the population if it reduces the fitness of a heterozygote as well as the homozygote.

Overdominant selection: a new allele may increase the fitness of a heterozygote relative to that of both homozygotes. The two homozygous genotypes may exhibit different reduction in fitness (s1 and s2). This creates a balanced polymorphism.

Underdominant selection: new alleles reduce the fitness of the heterozygote alone.

Overdominant selection is no the only mechanism by which balanced polymorphisms can be generated, but is one of a number of processes described collectively as balancing selection. One example of an alternative mechanism for balancing selection is that of frequency-dependent selection., whereby the the frequency of a genotype determines its fitness. If a genotype has a higher fitness at low frequencies but lower fitness at higher frequencies, an intermediate equilibrium value will be reached over time.

Neutral theory of molecular evolution: The theory that the majority of mutations do not influence the fitness of their carriers.

Neutrality test: A statistical method for exploring whether observed genetic diversity is compatible with neutral evolutionary processes.

The power of neutrality tests to detect selection depends on, amongst other factors, the characteristic of the selective regime:
a) the type of selection operating;
b) the strength of selection;
c) the period during which selection occurred or is occurring.
The results of a neutrality test will often depend on whether the region being analyzed is itself under selection or is linked to a region under selection.

GATK: download dbsnp_138.b37.vcf.gz

The login information is provided by http://www.broadinstitute.org/gatk/guide/article?id=1215.

"ftp ftp.broadinstitute.org", and at prompt enter the username "gsapubftp-anonymous".

At the "/bundle/2.8/b37" directory, use "get dbsnp_138.b37.vcf.gz" to download the latest database of known polymorphic sites to the current local working directory.

This file is used in BaseRecalibrator to supply the parameter "-knownSites" (full details available from http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_bqsr_BaseRecalibrator.html).

List of RNA-seq bioinformatics tools

http://en.wikipedia.org/wiki/List_of_RNA-Seq_bioinformatics_tools

Tool to find out if fastq is in Sanger or phred64 encoding

https://www.biostars.org/p/63225/

Sunday 15 June 2014

Histone marks

Activating (H3K4me3, H3K4me1, H3K27ac) and repressive histone marks (H3K27me3).

H3K4me3, a hallmark of active promoters (Hon et al. 2009). Two states at active proximal regions are further differentiated by the presence or absence of the H3K27me3 histone modification as (1) active promoters (no H3K27me3) and (2) bivalent promoters (with H3K27me3).

Cited from "tissue-specific SMARCA4 binding at active and repressed regulatory elements during embryogenesis" doi: 10.1101/gr.168930.113

===================================
Cited from the paper "The diverse functions of Dot1 and H3K79 methylation"

"One of the covalent histone modifications is methylation, which takes place on both lysine (K) and arginine (R) residues (Zhang and Reinberg 2001; Martin and Zhang 2005). Lysine methylation exists in mono, di, and tri states, while arginine methylation only occurs in mono and di states. One way that these methyl marks contribute to transcriptional regulation is to serve as a platform for the recruitment of effector proteins. The well-studied lysine methylation residues include K4, K9, K27, K36, and K79 of histone H3, and K20 of histone H4. In general, methylation at H3K9, H3K27, and H4K20 correlates with transcriptional repression, while methylation at H3K4, H3K36, and H3K79 correlates with gene transcription (Kouzarides 2002; Peterson and Laniel 2004; Martin and Zhang 2005). In addition to its role in transcriptional regulation, methylation has also been linked to X inactivation, cell fate determination, terminal differentiation, and the spatiotemporal patterning of Hox genes (Cavalli 2006; Minard et al. 2009). Moreover, aberrant histone methylation has also been linked to various human cancers (Feinberg et al. 2002; Handel et al. 2010)."

===================================
"Replacement of canonical histones for H3.3/H2A.Z must be dependent on the recruitment of specific histone chaperones, perhaps by DNA sequence-specific bound transcription factors."

Enhancer function: new insights into the regulation of tissue-specific gene expression

H3K27 Modification

"Embryonic stem cells reflect an early stage of mammalian development and possess a unique chromatin landscape in which 'bivalent' domains of trimethylated histone H3 Lys4 (H3K4me3) and Lys27 (H3K27me3) mark key lineage-specific genes1, 2. Histone H3K4me3 and H3K27me3 are normally associated with active and repressed genes, respectively; however, the promoters of the genes carrying both of these marks simultaneously, 'bivalent promoters', are transcribed at very low levels and are considered to be poised for activation by developmental signals3, 4."

The Mll2 branch of the COMPASS family regulates bivalent promoters in mouse embryonic stem cells

Promoter-proximal and distal regulatory elements

Regions within 1 kb of an annotated RefSeq or UCSC transcription start site (TSS) were classified
as promoter-proximal and the remaining sites as distal.

Cited from "tissue-specific SMARCA4 binding at active and repressed regulatory elements during embryogenesis" doi: 10.1101/gr.168930.113

Saturday 14 June 2014

Evolutionary Quantitative Genetics: Doc88

http://www.doc88.com/p-273181819919.html

Cufflinks

-g will assemble novel transcripts.
-G only considers transcripts defined in the GTF.

find command

Use: find . -type f -size +4096c
to find files bigger than 4096 bytes.

Use : find . -type f -size -4096c
to find files smaller than 4096 bytes.

Use : find . -type f -size 4096c
to find files exactly equal to 4096 bytes.

The -size switch explained: -size n[cwbkMG]. File uses n units of space. The following suffixes can be used:

`b' for 512-byte blocks (this is the default if no suffix is used);

`c' for bytes

`w' for two-byte words

`k' for Kilobytes (units of 1024 bytes)

`M' for Megabytes (units of 1048576 bytes)

`G' for Gigabytes (units of 1073741824 bytes)

The size does not count indirect blocks, but it does count blocks in sparse files that are not actually allocated. Bear in mind that the `%k' and `%b' format specifiers of -printf handle sparse files differently. The `b' suffix always denotes 512-byte blocks and never 1 Kilobyte blocks, which is different to the behaviour of -ls.

Cited from http://superuser.com/questions/204564/how-can-i-find-files-that-are-bigger-smaller-than-x-bytes

Friday 13 June 2014

GATK error: appears to be using the wrong encoding for quality scores: we encountered an extremely high quality score of \d+

When looking at the GATK forums, it looks like there are at least 2 likely solutions:

1) Add "--fix_misencoded_quality_scores -fixMisencodedQuals" to the commands (see http://gatkforums.broadinstitute.org/discussion/1991/version-highlights-for-gatk-version-2-3)

2) Add "-allowPotentiallyMisencodedQuals" to the commands (see http://gatkforums.broadinstitute.org/discussion/2500/depthofcoverage-producing-extremely-high-quality-score-error or http://gatkforums.broadinstitute.org/discussion/2335/realignment-with-high-base-qualities)

Cited from https://www.biostars.org/p/94637/

Sort fastq file by identifiers

http://edwards.sdsu.edu/labsite/index.php/robert/399-sorting-fastq-files-by-their-sequence-identifiers

Alternatively,

Pesort – Paired-End Sorter
Pesort takes one or two input FASTA or FASTQ files containing paired end reads (or single reads) and sort them so that the read pairs occur in the same order. It also sorts out which reads that don’t have a pair and outputs them to a separate file.
Cited from http://microbiology.se/software/petkit/

Or,
https://www.biostars.org/p/12576/

Thursday 12 June 2014

Sort

A key specification like -k2 means to take all the fields from 2 to the end of the line into account. The first comparison in sort -k2 -k1 is enough to discriminate these two lines, and the second sort key -k1 is not invoked.

To sort by a single column, use -k2,2 as the key specification. This means to use the fields from #2 to #2, i.e. only the second field.

sort -k2 -k3 <txt> is redundant: it's equivalent to sort -k2 <txt>. Run sort -k2,2 -k1,1 <txt>.

Cited from http://unix.stackexchange.com/questions/52762/trying-to-sort-on-two-fields-second-then-first

#######################################
Sort Files Like A Master With The Linux Sort Command (Bash)

#######################################
sort manual

bedGraphToBigWig usage

bedGraphToBigWig v 4 - Convert a bedGraph file to bigWig format.
usage:
   bedGraphToBigWig in.bedGraph chrom.sizes out.bw
where in.bedGraph is a four column file in the format:
      <chrom> <start> <end> <value>
and chrom.sizes is two column: <chromosome name> <size in bases>
and out.bw is the output indexed big wig file.
Use the script: fetchChromSizes to obtain the actual chrom.sizes information
from UCSC, please do not make up a chrom sizes from your own information.
The input bedGraph file must be sorted, use the unix sort command:
sort -k1,1 -k2,2n unsorted.bedGraph > sorted.bedGraph
options:
   -blockSize=N - Number of items to bundle in r-tree. Default 256
   -itemsPerSlot=N - Number of data points bundled at lowest level. Default 1024
   -unc - If set, do not use compression.

Bedtools: genome file

What is a “genome” file?

Some of the BEDTools (e.g., genomeCoverageBed, complementBed, slopBed) need to know the size of the chromosomes for the organism for which your BED files are based. When using the UCSC Genome Browser, Ensemble, or Galaxy, you typically indicate which species / genome build you are working.

The way you do this for BEDTools is to create a “genome” file, which simply lists the names of the
chromosomes (or scaffolds, etc.) and their size (in basepairs).

Genome files must be tab-delimitedand are structured as follows (this is an example for C. elegans):
chrI 15072421
chrII 15279323
...
chrX 17718854
chrM 13794

BEDTools includes predefined genome files for human and mouse in the /genomesdirectory included
in the BEDTools distribution. Additionally, the “chromInfo” files/tables available from the UCSC
Genome Browser website are acceptable. For example, one can download the hg19 chromInfo file here: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/chromInfo.txt.gz

Cited from http://bedtools.googlecode.com/files/BEDTools-User-Manual.v3.pdf

Recursive file finder algorithm

import java.io.File;

public class Filewalker {

   public void walk( String path ) {

       File root = new File( path );
       File[] list = root.listFiles();

       if (list == null) return;

       for ( File f : list ) {
           if ( f.isDirectory() ) {
               walk( f.getAbsolutePath() );
               System.out.println( "Dir:" + f.getAbsoluteFile() );
           }
           else {
               System.out.println( "File:" + f.getAbsoluteFile() );
           }
       }
   }

   public static void main(String[] args) {
       Filewalker fw = new Filewalker();
       fw.walk("c:\\" );
   }
}

Cited from http://stackoverflow.com/questions/2056221/recursively-list-files-in-java

BWA and GATK: read groups

"My understanding is that a read group means, roughly, "a set of reads that were all the product of a single sequencing run on one lane". If you have multiplexed samples in a single lane, you will get multiple samples in a single read group. If you sequenced the same sample in several lanes, you will have multiple read groups for the same sample."

Cited from https://www.biostars.org/p/43897/

The meaning of the standard read group fields can be found on http://gatkforums.broadinstitute.org/discussion/1317/collected-faqs-about-bam-files

Compose the read group identifier in the following format:
@RG\tID:group1\tSM:sample1\tPL:illumina\tLB:lib1\tPU:unit1
where the \t stands for the tab character.

Cited from https://www.broadinstitute.org/gatk/guide/tagged?tag=bwa

Example of Read Group usage
Support we have a trio of samples: MOM, DAD, and KID. Each has two DNA libraries prepared, one with 400 bp inserts and another with 200 bp inserts. Each of these libraries is run on two lanes of an illumina hiseq, requiring 3 x 2 x 2 = 12 lanes of data. When the data come off the sequencer, we would create 12 BAM files, with the following @RG fields in the header:
Dad's data:
@RG ID:FLOWCELL1.LANE1 PL:illumina LB:LIB-DAD-1 SM:DAD PI:200
@RG ID:FLOWCELL1.LANE2 PL:illumina LB:LIB-DAD-1 SM:DAD PI:200
@RG ID:FLOWCELL1.LANE3 PL:illumina LB:LIB-DAD-2 SM:DAD PI:400
@RG ID:FLOWCELL1.LANE4 PL:illumina LB:LIB-DAD-2 SM:DAD PI:400
Mom's data:
@RG ID:FLOWCELL1.LANE5 PL:illumina LB:LIB-MOM-1 SM:MOM PI:200
@RG ID:FLOWCELL1.LANE6 PL:illumina LB:LIB-MOM-1 SM:MOM PI:200
@RG ID:FLOWCELL1.LANE7 PL:illumina LB:LIB-MOM-2 SM:MOM PI:400
@RG ID:FLOWCELL1.LANE8 PL:illumina LB:LIB-MOM-2 SM:MOM PI:400
Kid's data:
@RG ID:FLOWCELL2.LANE1 PL:illumina LB:LIB-KID-1 SM:KID PI:200
@RG ID:FLOWCELL2.LANE2 PL:illumina LB:LIB-KID-1 SM:KID PI:200
@RG ID:FLOWCELL2.LANE3 PL:illumina LB:LIB-KID-2 SM:KID PI:400
@RG ID:FLOWCELL2.LANE4 PL:illumina LB:LIB-KID-2 SM:KID PI:400

Cited from http://toolshed.g2.bx.psu.edu/repository/display_tool?repository_id=c45d6c51a4fcfc6c&tool_config=database%2Fcommunity_files%2F000%2Frepo_259%2Fpicard_AddOrReplaceReadGroups.xml&changeset_revision=bf1c3f9f8282

Compose the read group identifier in the following format:
@RG\tID:group1\tSM:sample1\tPL:illumina\tLB:lib1\tPU:unit1
where the \t stands for the tab character.
Cited from https://www.broadinstitute.org/gatk/guide/tagged?tag=bwa

Naming conventions of Illumina sequence reads and fastq files

@HWUSI-EAS100R:6:73:941:1973#0/1

HWUSI-EAS100R the unique instrument name
6 flowcell lane
73 tile number within the flowcell lane
941 'x'-coordinate of the cluster within the tile
1973 'y'-coordinate of the cluster within the tile
#0 index number for a multiplexed sample (0 for no indexing)
/1 the member of a pair, /1 or /2 (paired-end or mate-pair reads only)

Cited from http://en.wikipedia.org/wiki/FASTQ_format#Illumina_sequence_identifiers

The naming conventions of fastq files are inconsistent across different sources.

According to http://support.illumina.com/help/SequencingAnalysisWorkflow/Content/Vault/Informatics/Sequencing_Analysis/CASAVA/swSEQ_mCA_FASTQFiles.htm

Illumina FASTQ files use the following naming scheme:

<sample name>_<barcode sequence>_L<lane (0-padded to 3 digits)>_R<read number>_<set number (0-padded to 3 digits>.fastq.gz

For example, the following is a valid FASTQ file name:

NA10831_ATCACG_L002_R1_001.fastq.gz

However, according to http://physiology.med.cornell.edu/faculty/mason/lab/r-make/usage.html

The Illumina fastq files are also encoded as
<sample name>_<flowcell id>_<barcode sequence>_L<lane (0-padded to 3 digits)>_R<read number >_<set number (0-padded to 3 digits>.fastq.gz

Wednesday 11 June 2014

GATK reference preparation

http://gatkforums.broadinstitute.org/discussion/1601/how-can-i-prepare-a-fasta-file-to-use-as-reference

Java virtual machine parameters

-64
The -d64 option means to use the 64-bit version of the JVM. The JVM has three basic configurations: -client, -server (both for 32-bit JVMs) and -d64. In order to use -d64, you have to separately install the 64-bit JVM for your platform (see, e.g., http://java.sun.com/javase/6/webnotes/install/system-configurations.html for supported platforms; the option itself is documented in the online man pages for the java tool).
Cited from https://www.java.net/node/674019

-Xms and -Xmx
The flag Xmx specifies maximum memory allocation pool for a Java Virtual Machine (JVM), while Xms is specifies the initial memory allocation pool.

That means your JVM will be started with Xms amount of memory and will be able to use a maximum of Xmx amount of memory.

For example, starting a JVM like so will start it with 256MB of memory, and will allow the process to use up to 2048MB of memory: java -Xmx2048m -Xms256m

Cited from http://stackoverflow.com/questions/14763079/what-are-the-xms-and-xmx-parameters-when-starting-jvms

Tuesday 10 June 2014

GATK best practice analysis software download and installation

http://www.broadinstitute.org/gatk/guide/tagged?tag=install

NGS analysis tools: overview

http://drduanehassane.com/blog/sequencing-resources

Cufflinks input SAM/BAM file: sorted or not?

The SAM file supplied to Cufflinks must be sorted by reference position. If you aligned your reads with TopHat, your alignments will be properly sorted already. If you used another tool, you may want to make sure they are properly sorted.

BWA and GATK analysis pipeline

http://www.broadinstitute.org/gatk/guide/best-practices#dnaseq-ovw-ovw

https://wikis.utexas.edu/display/bioiteam/Variant+calling+with+GATK

BWA: whole genome sequencing

Create the index for the reference genome

bwa index -a bwtsw reference.fa

Usage:   bwa index [-a bwtsw|is] [-c] <in.fasta>

Options: -a STR    BWT construction algorithm: bwtsw or is [auto]
         -p STR    prefix of the index [same as fasta name]
         -6        index files named as <in.fasta>.64.* instead of <in.fasta>.*

Warning: `-a bwtsw' does not work for short genomes, while `-a is' and
         `-a div' do not work not for long genomes. Please choose `-a'
         according to the length of the genome.

Cited from http://gatkforums.broadinstitute.org/discussion/2798/howto-prepare-a-reference-for-use-with-bwa-and-gatk;
http://icb.med.cornell.edu/wiki/index.php/Elementolab/BWA_tutorial

Mapping short reads to the reference genome
bwa mem ref.fa reads.fq > aln-se.sam
bwa mem ref.fa read1.fq read2.fq > aln-pe.sam

Input files can be gz compressed.

Cited from http://bio-bwa.sourceforge.net/bwa.shtml

Detect if variable is an array

$ v=(foo bar) # v is an array

$ declare -p v | grep -q '^declare \-a' && echo array || echo no array
array

$ v='declare -a' # v is a string

$ declare -p v | grep -q '^declare \-a' && echo array || echo no array
no array

$ declare -p non-existing-var | grep -q '^declare \-a' && echo array || echo no array
bash: declare: non-existing-var: not found
no array

$ # The `2> dev/null' protects against the `declare: not found' error message above

$ declare -p non-existing-var 2> /dev/null | grep -q '^declare \-a' && echo array || echo no array
no array

Cited from http://www.fvue.nl/wiki/Bash:_Detect_if_variable_is_an_array

Sunday 8 June 2014

Plant: Peperomia rotundifolia

Thursday 5 June 2014

Genome-locus-based sequence alignment counts

http://crazyhottommy.blogspot.co.uk/search/label/count%20reads%20at%20features

ChIP-seq: normalisation of signals

To normalize for differences in sequencing depth among different experiments, the number of tags per genomic position in each ChIP-Seq library was first rescaled by the total number of mapped tags.

Cited from http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001442#s4

This is done as
RPM(reads per million)=number of reads aligning to a position per million aligned reads.

"As the signal-to-noise ratio in comparative ChIP-seq experiments may differ, we recommend following one of two strategies. First, if samples from different experimental conditions are compared and the same antibody is used32, we recommend performing the series of experiments side by side as this minimizes differences in signal-to-noise ratios due to experimental variability. By using this strategy, the ChIP-seq data do not need to be normalized to each other (other than by the total library read count) and differences in overall binding enrichments can be detected. Second, if this strategy is not possible because different species or antibodies are used, quantile normalization can be used to adjust for differences in signal-to-noise ratios between samples if one can reasonably assume that the overall binding of the factor is similar (e.g., if the factor is well conserved and is expressed at similar levels in the same tissue across species). If this assumption is not justified, it is still possible to identify qualitative differences between samples while being aware that conclusions on the overall binding strength cannot be made. "

Cited from A computational pipeline for comparative ChIP-seq analyse

align2rawsignal (aka. WIGGLER .. because it generates wiggle files) reads in a set of tagAlign/BAM files, filters out multi-mapping tags and creates a consolidated genome-wide signal file using various tag-shift and smoothing parameters as well as various normalization schemes

The method accounts for the following:
1） depth of sequencing
2） mappabilty of the genome (based on read length and ambiguous bases)
3） differentiates between positions that shown 0 signal simply because they are unmappable vs positions that are mappable by have no reads. The former are not represented in the output wiggle or bedgraph files while the latter are represented as 0s.
4) different tag shifts for the different datasets being combined

Signal values generated by Wiggler should NOT be used as-is for the foll. applications directly

1) In general, if you are comparing signal values across multiple experiments, you should not use signal values from wiggler directly. This is because even though wiggler normalizes for sequencing depth, mappability etc. it cannot account for basic data quality (signal-to-noise ratio) differences between experiments. You will want to use some type of explicit cross-experiment normalization.

2) For differential analysis across experiments (e.g. you have ChIP-seq data for a TF in 2 conditions), it is much better to use statistical methods explicitly designed to capture this. e.g. DE-seq, EdgeR or other differential analysis methods. If you are using norm=5 (fold-change) as the normalization method withing Wiggler, it is recommended to use an asinh() or log() transform on the values for statistical correlative analyses.

Cited from https://code.google.com/p/align2rawsignal/

Normalization by sequencing depth (i.e. total read count) is probably the simplest approach but is widely used. There are some drawbacks with this approach. For example, if the signal-to-noise ratio is very different between two libraries, one library is going to contain more background reads than the other. However, these background reads are taken into consideration when you calculate total read counts. This will certainly cause bias in your estimation.

In diffReps, a better approach is taken to do normalization. Basically, the low count regions are first removed from consideration, then a normalization ratio is calculated for each library and each of the regions left. Finally, the medians of the ratios are used as normalization factors. This way, a relatively unbiased, robust estimate can be used for normalization.

Cited from How To Do Normalization Of Two Chip-Seq Data

Other relevant articles:
A signal–noise model for significance analysis of ChIP-seq with negative control
MAnorm: a robust model for quantitative comparison of ChIP-Seq data sets

Wednesday 4 June 2014

The usage of chown and chgrp

sudo chown -R username:group directory
will change ownership (both user and group) of all files and directories inside of directory and directory itself.

sudo chown username:group directory
will only change the permission of the folder directory but will leave the files and folders inside the directory alone.

chown user: file (Note the left-out group), it will use the default group for that user.

Cited from http://askubuntu.com/questions/6723/change-folder-permissions-and-ownership

You can change the group ownership of a file or directory with the command:

chgrp group_name file/directory_name You must be a member of the group to which you are changing ownership to.

Cited from http://unixhelp.ed.ac.uk/tasks/change_own.html

Tuesday 3 June 2014

Run a shell script in background

for i in `seq 0 15`;
do
nohup ./getNCBI.sh $i 1>out.$i 2>err.$i &
done

PBS cluster: job dependencies

https://docs.loni.org/wiki/PBS_Job_Chains_and_Dependencies

http://stackoverflow.com/questions/18442224/wait-for-all-jobs-of-a-user-to-finish-before-submitting-subsequent-jobs-to-a-pbs?lq=1

Installing Latex in Ubuntu

Cited from http://linuxandfriends.com/install-latex-in-ubuntu-linux/

In Ubuntu Linux, you have to first install LaTeX to use it. This is how LaTeX is installed in Ubuntu.$ sudo apt-get install texlive

The above command will install a basic subset of TeX Live’s functionality. To install all the packages in the LaTeX distribution, you have to run the following command.$ sudo apt-get install texlive-full

Gedit has a plugin for LaTeX which converts Gedit into a LaTeX editor. You can install the Gedit LaTeX plugin as follows :
$ sudo apt-get install gedit-latex-plugin

To install additional latex packages such as Prosper
$ sudo apt-get install Prosper

ChIP-seq: strand shifts

Cited from the book "Tag-based Next Generation Sequencing" page 146-147.

For single-end ChIP-seq, there should be an enrichment of sequence reads on the positive and negative strand flanking the location of a true protein interaction site. The sequence reads from the positive and negative strands can be shifted or extended to construct a combined read density profile.

In a single-end ChIP-seq experiment, sequence reads from the positive and negative strands are usually shifted or extended towards its 3' end by a certain number of bases (usually half the estimated average fragment length) such that the location with the highest read density corresponds to the center of the ChIP protected region. One way to estimate the optimal "shift distance" is by analyzing the strand cross-correlation profile.

Installing picard-tools

To install picard-tools:
sudo apt-get install picard-tools

To uninstall picard-tools:
sudo apt-get remove picard-tools

Install Oracle Java8

sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer

Cited from http://www.enqlu.com/2014/03/how-to-install-oracle-java-78-jdk-and.html

Monday 2 June 2014

Conversion of bed to bigwig

For me wiggle representation is generally for the coverage, so I'll just compute the coverage first using bedtools utility as genomeCoverageBed –i file.bed -bg –g my.genome > sample.cov, where file.bed can be obtained using bamToBed utility bamToBed -i file.bam > file.bed and then would use bedGraphToBigWig utility for ucsc to convert the coverage file to bigwig, which is better than wig as you can place on the local server and pass the text file containing links of these bigwigs to collaborators etc. for viewing with UCSC and its safe as only you having rights can change the file content. Another option would be using pyicos convert tool as pyicos convert my_experiment.bam my_experiment.wig -f SAM -F variable_wig -O, this will give you wig directly.

Cited from https://www.biostars.org/p/2699/

bedGraphToBigWig is downloadable from http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64.v287/

wigCorrelate

Download:

wigCorrelate is downloadable from https://github.com/adamlabadorf/ucsc_tools/blob/master/executables/wigCorrelate

wigCorrelate needs to be made executable for calling it.

Principle:

wigCorrelate, finds the overlapping features between two wig and calculate base-to-base Pearson's Coefficient.

Cited from https://www.biostars.org/p/85006/

wigCorrelate - Produce a table that correlates all pairs of wigs.
usage:
wigCorrelate one.wig two.wig ... n.wig
This works on bigWig as well as wig files.
The output is to stdout
options:
-clampMax=N - values larger than this are clipped to this value

It works by finding items that overlap in the different wig files.
Within the overlap it considers each base a separate
observation and calculates Pearson's R based on that. Here's two
relevant snippets of the code from lib/correlate.c

void correlateNext(struct correlate c, double x, double y)
/ Add next sample to correlation. */ {
c->sumX += x;
c->sumXX += x*x;
c->sumXY += x*y;
c->sumY += y;
c->sumYY += y*y;
c->n += 1;
}

double correlateResult(struct correlate c)
/ Returns correlation (aka R) */ {
double r = 0;
if (c->n > 0) {
double sp = c->sumXY - c->sumX*c->sumY/c->n;
double ssx = c->sumXX - c->sumX*c->sumX/c->n;
double ssy = c->sumYY - c->sumY*c->sumY/c->n;
double q = ssx*ssy;
if (q != 0)
r = sp/sqrt(q);
}
return r;
}

It has a little optimization that lets it work faster than this when it
knows it has a multiple-base window where x and y are constant that
keeps the base-by-base approach from actually getting expensive when
it's not really needed.

Unfortunately it doesn't do anything particularly sensible with the
regions where there is data in one wig but not another. It's treated the
same as something that wasn't covered by either wig.

Cited from http://redmine.soe.ucsc.edu/forum/index.php?t=msg&goto=4215&S=793c2bd15fc7a2af0bf8351cb46e32b4

Install XML and RCurl packages

The installation of XML and RCurl packages using install.packages() function does not work.
The following commands at the terminal are effective.

sudo apt-get update sudo apt-get install libxml2-dev
sudo apt-get install r-cran-xml
sudo apt-get install r-cran-rcurl