Shortcuts in Science: May 2014

Saturday, 31 May 2014

Bootstrap and Permutation

Bootstrap:

Re-sampling from the original random sample. Each resample is the same size as the original random
sample.

The original sample represents the population from which it wasdrawn. So resamples from this sample represent what we would get if we took many samples from the population. The bootstrap distribution of a statistic, based on many resamples, represents the sampling distribution of the statistic, based on many samples.

The bootstrap standard error of a statistic is the standard deviation of the bootstrap distribution of that statistic.

Statistics: Central Limit Theorem and Law of Large Numbers

The Central Limit Theorem describes the characteristics of the "population of the means" which has been created from the means of an infinite number of random population samples of size (N), all of them drawn from a given "parent population". The Central Limit Theorem predicts that regardless of the distribution of the parent population:

[1] The mean of the population of means is always equal to the mean of the parent population from which the population samples were drawn.

[2] The standard deviation of the population of means is always equal to the standard deviation of the parent population divided by the square root of the sample size (N).

[3] [And the most amazing part!!] The distribution of means will increasingly approximate a normal distribution as the size N of samples increases.

Cited from http://www.chem.uoa.gr/applets/appletcentrallimit/appl_centrallimit2.html

The law of large numbers (LLN) is a theorem that describes the result of performing the same experiment a large number of times. According to the law, the average of the results obtained from a large number of trials should be close to the expected value, and will tend to become closer as more trials are performed.

Cited from http://en.wikipedia.org/wiki/Law_of_large_numbers

Ubuntu: install openssh

Using the following command to install the utility for enabling ssh/Winscp

sudo apt-get update sudo apt-get install openssh-server

Details can be found at http://askubuntu.com/questions/51925/how-do-i-configure-a-new-ubuntu-installation-to-accept-ssh-connections

Ubuntu: setup ssh/Winscp with no pass phrase for log-in

Enter the following command at the command prompt:
ssh-keygen -t rsa

At the prompt for entering file in which to save the key, press "Enter".
At the prompt for entering passphrase, press "Enter".
At the prompt for entering same passphrase again, press "Enter".
Screen output: "Your identification has been save in xxx", "Your public key has been saved in xxx".
Type the following command:
cat ~/.ssh/id_rsa.pub >>~/.ssh/authorized_keys

Download ~/.ssh/id_rsa to the local machine.

Open PuTTYgen, use "Load" to load an existing private key file, and save the private key.

In Putty, load a session or set a new session by filling in the server adress and session name, then go to connections -> SSH -> Auth -> browse and select the ppk file under "Private key file for authentication".

In Winscp, load private key under "Private key file" and save the session.

In Linux, download from the server the private key id_rs to ~/.ssh/ of the local machine, and use the command "ssh-add" to add private key identities to the authentication agent.

To save the typing of the user name and the host name, append to ~/.ssh/config of the local machine, the following lines and save:

Host [abbreviated name]
    HostName [dev.example.com]
    Port 22000 (optional; can be omitted)
    User [fooey]
Then it is done!

Ubuntu: enable the access of VNC Viewer to the machine

To configure the correct settings in "Desktop Sharing".

In the dconf-editor,dconf > org > gnome > desktop > remote-access > require-encryption - uncheck dconf > org > gnome > desktop > remote-access > vnc-password - set the correct password

More details are available from http://discourse.ubuntu.com/t/remote-desktop-sharing-in-ubuntu-14-04/1640/3

Ubuntu: dconf

dconf is a low-level configuration system. Its main purpose is to provide a backend to GSettings on platforms that don't already have configuration storage systems. Cited from https://wiki.gnome.org/action/show/Projects/dconf?action=show&redirect=dconf#Introduction

dconf is the GNOME technology used to store application settings, and it is unspecific to Ubuntu.

To install dconf-editor,
sudo apt-get install dconf-tools

To run it by entering in the terminal or in the quick launch menu,
dconf-editor

See http://askubuntu.com/questions/22313/what-is-dconf-what-is-its-function-and-how-do-i-use-it for how to use dconf.

Wednesday, 28 May 2014

Weighted-gene co-expression network analysis

http://labs.genetics.ucla.edu/horvath/CoexpressionNetwork/OverviewWGCNA.pdf

Sequence Alignment Tutorial

http://homer.salk.edu/homer/basicTutorial/mapping.html

STAR: ultrafast universal RNA-seq aligner

To align single-end reads:
star --runThreadN 12 --outFilterMultimapNmax 1 --outFilterMismatchNmax 10 --outSAMstrandField intronMotif --genomeLoad LoadAndKeep --outSAMattributes All --genomeDir $TMPDIR/star --readFilesCommand zcat --readFilesIn $TMPDIR/${fq_id}_1.fastq.gz --outFileNamePrefix $TMPDIR/$fq_id/ 2>$TMPDIR/$fq_id/$fq_id.log

To align paired-end reads:
star --runThreadN 12 --outFilterMultimapNmax 1 --outFilterMismatchNmax 10 --outSAMstrandField intronMotif --genomeLoad LoadAndKeep --outSAMattributes All --genomeDir $TMPDIR/star --readFilesCommand zcat --readFilesIn $TMPDIR/${fq_id}_1.fastq.gz $TMPDIR/${fq_id}_2.fastq.gz --outFileNamePrefix $TMPDIR/$fq_id/ 2>$TMPDIR/$fq_id/$fq_id.log

outFilterMismatchNmax 10
int: alignment will be output only if it has fewer mismatches than this value.

If you have un-stranded RNA-seqdata, and wish to run Cufflinks/Cuffdiff on STAR alignments, you will need to run STAR with --outSAMstrandField intronMotif option, which will generate the XS strand attribute for all alignments that contain splice junctions. The spliced alignments that have undefined strand (i.e. containing only non-canonical junctions) will be suppressed.

With default --outSAMattributes Standardoption the following SAM attributes will be generated:
Column 12: NH: number of loci a read (pair) maps to
Column 13: IH: alignment index for all alignments of a read
Column 14: aS: alignment score
Column 15: nM: number of mismatches (does not include indels)
If --outSAMattributes All option is used, the following additional attributes will be output:

Column 16: jM:B:c,M1,M2,… Intron motifs for all junctions (i.e. N in CIGAR): 0: non-canonical; 1: GT/AG, 2: CT/AC, 3: GC/AG, 4: CT/GC, 5: AT/AC, 6: GT/AT. If splice junctions database is used, and a junction is annotated, 20 is added to its motif value.
Column 17: jI:B:I,Start1,End1,Start2,End2,… Start and End of introns for all junctions (1-based)

Note, that samtools 0.1.18 or later have to be used with these extra attributes.

Tuesday, 27 May 2014

ChIP-seq workshops

Usage of bowtie1:
http://cgrlucb.wikispaces.com/file/view/ChIPseq_workshop_spring2013.pdf/415628696/ChIPseq_workshop_spring2013.pdf

http://cgrlucb.wikispaces.com/file/view/ChIP-seq-analysis.pdf/263706166/ChIP-seq-analysis.pdf

http://www.epigenesys.eu/images/stories/protocols/pdf/20120529112503_p56.pdf

ChIP peak calling:
http://epigenie.com/guide-peak-calling-for-chip-seq/

Workflow of ChIP-seq analysis:
https://sites.google.com/site/princetonhtseq/tutorials/chip-seq

http://haemcode.stemcells.cam.ac.uk/doc_pipeline.php

Bowtie1 command specification

bowtie -q -n 1 -m 1 -p 12 --phred64-quals ${dir_bt1}GRCh37.p13.genome $TMPDIR/$fastq $TMPDIR/$mapf > $TMPDIR/$logf

-q
Input file is fastq format. Also the default setting

-p
Number of processors to use

-n
Maximum number of mismatches permitted. This may be 0, 1, 2 or 3. Default is 2.

-m
Suppress all alignments if more than <int> reportable alignments exist for it.

-S
Print alignments in the SAM format

Friday, 23 May 2014

Awk: Patterns

The whole line is stored in $0, the first field in $1, and the second field in $2 and so on.
awk 'Pattern {Action}' InputFile

Patterns inlucde relational expression patterns, range pattern and pattern-matching expression.
The action is always enclosed in braces. The whole awk statement, consisting of pattern and action, must be enclosed in single quotes. You could treat more than one file at once by giving several filenames.

awk's editing statement must consist of either a pattern, an action or both. If a pattern is omitted, then the default pattern is employed. The default pattern is matching every line. If the action is omitted, the default action is printing the whole line.

You can change the field separator by supplying it immediately after the -F option, enclosed in double quotes.
$awk -F":" '/Freddy/ {print $1 " uses " $7}' /etc/passwd
Freddy uses /bin/bash
### Here the content of fields 1 and 7, represented by the variables $1 and $7, respectively, separated by some text that must be enclosed in double quotes, is printed.

With awk, you can write very complex programs. Usually, programs are saved in a file that can be called by the option -f.

Except for the patterns BEGIN and END, patterns can be combined with the Boolean operator ||, && and !.

Regular Expressions
Regular expressions must be enclosed in slashes (/.../). If you append an exclamation mark (!) before the first slash, all records not matching the regular expression will be chosen.

$awk '!/xxx/ {print $1}' enzyme.txt

Pattern-Matching Expressions
Sometimes you will need to ask the question if a regular expression matches a field. Thus, you do not wish to see whether a pattern matches a record but a specified field of this record. In this case, you would use a pattern-matching expression. $n stands for any field variable like $1, $2 or so.

$n~/re/ ### Is true if the field $n matches the regular expression re.
$n!~/re/ ### Is true if the field $n does not match the regular expression re.

Relational Character Expressions
It is often useful to check if a character string (a row of characters) in a certain field fulfills specified conditions. You might, for example, want to check whether a certain string is contained within another string. In these cases, you would use relational character expressions. The character string s must always be enclosed in double quotes.

$n=="s"
$n!="s"
$n<"s" ### Character-by-character comparison of $n and s. The first character of each string is compared, then the second character and so on. The result is true if $n is lexicographically smaller than s: "flag<fly" and "abc<abcd"
$n<="s"
$n>"s"

Uppercase characters are lexicographically less than lowercase characters, and number characters are less than alphabetic characters.

numbers<uppercase<lowercase

"==" requires a perfect match of strings and not a partial match.

Relational Number Expressions
Similar to relation character expressions are relational number expressions, except that they compare the value of numbers.

$n==v
$n!=v
$n<v
$n<=v
$n>v
$n>=v

Any numerical value is allowed on the left side of the relation.
$awk 'length($1)>6' enzyme.txt

Conversion of Numbers and Characters
If you need to force a number to be converted to a string, concatenate that number with the empty string "".

A string (that contains numerical characters) can be converted to a number by adding 0 to that string.
$awk '($2+0)>2' enzyme.txt

Ranges
A range of records can be specified using two patterns separated by a comma.
$awk '/En/,/Hy/' enzyme.txt
### In this example, all records between and including the first line matching the regular expression "En" and the first line matching the regular expression "Hy" are printed.

The range works like a switch. Upon the first occurrence of the first pattern, all lines are treated by the action until the second pattern matches or becomes true. When it becomes switched off, no line matches, until it becomes turned on again. If the "off switch", which is the second pattern, is not found, all records down the end of the file match.

If both range patterns are the same, the switch will be turned on and off at each record. Thus, the statement
awk '/AUTHOR/,/AUTHOR/' structure.pdb
prints only one line.

BEGIN and END
BEGIN and END do not deal with the input file at all. These patterns allow for initialization and cleanup actions, respectively. Both BEGIN and END must have actions and these must be enclosed in braces. There is no default action.

The BEGIN block is executed before the first line (record) of the input file is read. Likewise, the END block is executed after the last record has been read. Both blocks are executed only once.

We can assign the field separator either with the option -F or by assigning the variable FS in the BEGIN block.

The special patterns BEGIN and END cannot be used in ranges or with any operators. However, an awk program can have multiple BEGIN and END blocks. They will be executed in the order in which they appear.

Quit cat

Press [Enter] to go to an empty line, and use the [Ctrl]-[D] keys to quit cat.

Thursday, 22 May 2014

Negative Selection

Cited from http://www.nature.com/scitable/topicpage/negative-selection-1136

Ubuntu--Displaying the Screen on a Projector

System Settings-->Monitors and then select the projector.

Tuesday, 20 May 2014

Linux Touch Command

The touch command is the easiest way to create new, empty files. It is also used to change the timestamps (i.e., dates and times of the most recent access and modification) on existing files and directories.

Linux Shell Programming

Cited from the book "Beginning Linux Programming"
"File descriptor 0 is the standard input to a program, file descriptor 1 is the standard output, and file descriptor 2 is the standard error output.

To direct error and standard output to the same file, using

kill -1 1234 >killouterr.txt 2>&1

Never use the same filename twice in a string of commands.

In globing, single-character wildcards using ?, while [set] allows any of a number of single characters to be checked. [^set] negates the set.

To check if files are scripts or not is to use the "file" command, for example, "file xxx.pl".

To make file executable, "chmod +x bbb.sh".

If the shell environment variable PATH isn’t set to look in the current directory for commands to execute. To change this, either type PATH=$PATH:.on the command line or edit your .bash_profile file to add this command to the end of the file; then log out and log back in again.

If the command is just for yourself, you could create a bin directory in your home directory and add that to your path. If you want the script to be executable by others, you could use /usr/local/bin or another system directory as a convenient location for adding new programs.

chown root /usr/local/bin/first
chgrp root /usr/local/bin/first

By default, all variables are considered and stored as strings, even when they are assigned numeric values. The shell and some utilities will convert numeric strings to their values in order to operate on them as required.

We can assign user input to a variable by using the read command. This takes one parameter, the name of the variable to be read into, and then waits for the user to enter some text.

$ IFS=’’
$ set foo bar bam
$ echo “$@”
foo bar bam
$ echo “$*”
foobarbam
$ unset IFS
$ echo “$*”
foo bar bam

The "["or "test" command is the shell’s Boolean check.

If you prefer putting thenon the same line as if, you must add a semicolon to separate the "test" from the "then".

String and arithmetic comparison is illustrated in page 33.

elif=>else if

If you want the echo command to delete the trailing new line, the most portable choice is to use the printf command rather than the echo command. Some shells use echo –e, but that’s not supported on all systems. bash allows echo -n to suppress the new line, so if you are confident your script needs to work only on bash, you can use that syntax.

The * wildcard expression doesn’t work within quotes.

The AND list construct allows us to execute a series of commands, executing the next command only if all the previous commands have succeeded. The syntax is statement1 &&statement2 &&statement3 &&...

if [ -f file_one ] && echo “hello” && [ -f file_two ] && echo “ there”
then
echo “in if”
else
echo “in else”
fi

"echo" always returns true.

The OR list construct allows us to execute a series of commands until one succeeds, then not execute any more. The syntax is statement1 ||statement2 ||statement3 ||...

Monday, 19 May 2014

Fastx command specifications

FASTQ Quality Filter

$ fastq_quality_filter -h
 usage: fastq_quality_filter [-h] [-v] [-q N] [-p N] [-z] [-i INFILE] [-o OUTFILE]

 version 0.0.6
    [-h]         = This helpful help screen.
    [-q N]       = Minimum quality score to keep.
    [-p N]       = Minimum percent of bases that must have [-q] quality.
    [-z]         = Compress output with GZIP.
    [-i INFILE]  = FASTA/Q input file. default is STDIN.
    [-o OUTFILE] = FASTA/Q output file. default is STDOUT.
    [-v]         = Verbose - report number of sequences.
     If [-o] is specified,  report will be printed to STDOUT.
     If [-o] is not specified (and output goes to STDOUT),
     report will be printed to STDERR.

Cited from Fastx toolkit

FASTX-Toolkit only can output files in compressed format, not use them as input.

According to the documentation:Some tools can compress the output with GZIP (-z).
Only in one reported case can the input be gzipped:

fasta_clipping_histogram.pl help page includes this: 
INPUT_FILE.FA   = input file (in FASTA format, can be GZIPped)



Cited from https://www.biostars.org/p/83237/

FastQC command specifications

        fastqc seqfile1 seqfile2 .. seqfileN

    fastqc [-o output dir] [--(no)extract] [-f fastq|bam|sam] 
           [-c contaminant file] seqfile1 .. seqfileN

DESCRIPTION

    FastQC reads a set of sequence files and produces from each one a quality
    control report consisting of a number of different modules, each one of 
    which will help to identify a different potential type of problem in your
    data.
    
    If no files to process are specified on the command line then the program
    will start as an interactive graphical application.  If files are provided
    on the command line then the program will run with no user interaction
    required.  In this mode it is suitable for inclusion into a standardised
    analysis pipeline.
    
    The options for the program as as follows:
    
    -h --help       Print this help file and exit
    
    -v --version    Print the version of the program and exit
    
    -o --outdir     Create all output files in the specified output directory.
                    Please note that this directory must exist as the program
                    will not create it.  If this option is not set then the 
                    output file for each sequence file is created in the same
                    directory as the sequence file which was processed.
                    
    --casava        Files come from raw casava output. Files in the same sample
                    group (differing only by the group number) will be analysed
                    as a set rather than individually. Sequences with the filter
                    flag set in the header will be excluded from the analysis.
                    Files must have the same names given to them by casava
                    (including being gzipped and ending with .gz) otherwise they
                    won't be grouped together correctly.
                   
    --extract       If set then the zipped output file will be uncompressed in
                    the same directory after it has been created.  By default
                    this option will be set if fastqc is run in non-interactive
                    mode.
                    
    -j --java       Provides the full path to the java binary you want to use to
                    launch fastqc. If not supplied then java is assumed to be in
                    your path.
                   
    --noextract     Do not uncompress the output file after creating it.  You
                    should set this option if you do not wish to uncompress
                    the output when running in non-interactive mode.
                    
    --nogroup       Disable grouping of bases for reads >50bp. All reports will
                    show data for every base in the read.  WARNING: Using this
                    option will cause fastqc to crash and burn if you use it on
                    really long reads, and your plots may end up a ridiculous size.
                    You have been warned!
                    
    -f --format     Bypasses the normal sequence file format detection and
                    forces the program to use the specified format.  Valid
                    formats are bam,sam,bam_mapped,sam_mapped and fastq
                    
    -t --threads    Specifies the number of files which can be processed
                    simultaneously.  Each thread will be allocated 250MB of
                    memory so you shouldn't run more threads than your
                    available memory will cope with, and not more than
                    6 threads on a 32 bit machine
                  
    -c              Specifies a non-default file which contains the list of
    --contaminants  contaminants to screen overrepresented sequences against.
                    The file must contain sets of named contaminants in the
                    form name[tab]sequence.  Lines prefixed with a hash will
                    be ignored.
                    
   -k --kmers       Specifies the length of Kmer to look for in the Kmer content
                    module. Specified Kmer length must be between 2 and 10. Default
                    length is 5 if not specified.
                    
   -q --quiet       Supress all progress messages on stdout and only report errors.

Cited from Wiki: FastQC

Sunday, 18 May 2014

PBS learning about resource profiles of queues

qsub              #submit a job, see man qsub
qdel -p jobid     #will force purge the job if it is not killed by qdel 
qstat             #list information about queues and jobs
qstat -q          #list all queues on system
qstat -Q          #list queue limits for all queues
qstat -a          #list all jobs on system
qstat -s          #list all jobs with status comments
qstat -r          #list all running jobs
qstat -f jobid    #list full information known about jobid
qstat -Qf queueid #list all information known about queueid
qstat -B          #list summary information about the PBS server
qstat -iu userid  #get info for queued jobs of userid
qstat -u userid  #get info for all the jobs of userid
qstat -n -1 jobid #will list nodes on which jobid is running in one line
checkjob jobid    #will list job details

Source (A few basic PBS commands)

Information regarding the screen output is explained here

(PBS Configuration Instructions for the NorduGrid Sites).

Thursday, 15 May 2014

PBS Job Subscript

qsub -v var1=value1,var2=value2 PBS-jobscript
In the PBS job script, you can get the parameters by $var1 and $var2.

Sunday, 11 May 2014

Linux Cluster Commands

lscpu - display information about the CPU architecture

uniq - report or omit repeated lines; -c, --count prefix lines by the number of occurrences

PBS:

pbsnodes -a (display information of cluster nodes).

Thursday, 8 May 2014

Convert sra files to fastq files

fastq-dump --split-files SRA_filename

Bash script to download SRA files

#!/bin/bash
#shell script to download SRA files
#
#Author: Minghui Wang, minghui.wang@mssm.edu
#Date: 8 May, 2014
#

cd $WORK_DIR #you must set this directory

#the ftp address to the sequence experiment on SRA (you must change this to the real ftp)
link=ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX016/SRX016366/

#list all the samples shown under the link
sampleLists=`curl $link 2>/dev/null | awk '{print $NF}'`

#download *.sra files
for f in $sampleLists; do wget $link${f}/${f}.sra 2>/dev/null; done

echo "Job completed!"

Linux Cluster OS Check

lsb_release -a