Wednesday, 26 November 2014

Bioinformatic Algorithm Tutorials

Bioinformatic Algorithm Tutorials

Tuesday, 25 November 2014

VCF File Info and Format Field Abbreviations

Format Field:

FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">

FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">

FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">

FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">


Info Field:
INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">

INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">

INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">

INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">

INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases">

INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">

INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">

INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias">

INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">

INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">

INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">

INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">

INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">

INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">

INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">

INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">

INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">


Monday, 10 November 2014

MicroRNA Analysis Software

miRseqViewer – Multi-panel visualization of sequence, structure and expression for analysis of microRNA sequencing data

IsomiRage: From Functional Classification to Differential Expression of miRNA Isoforms

TruSeq Exome Targeted Regions

The Illumina website (http://support.illumina.com/sequencing/sequencing_kits/truseq_exome_enrichment_kit/downloads.html) provides the file "TruSeq Exome Targeted Regions BED file", and the explicit link is as the follow.   

http://support.illumina.com/content/dam/illumina-support/documents/myillumina/5dfd7e70-c4a5-405a-8131-33f683414fb7/truseq_exome_targeted_regions.hg19.bed.chr.gz

Stranded-RNA seq

Visualising stranded RNA-seq data with Gviz/Bioconductor

Perl: the Eval Function

Excerpts from Perl Eval Function Examples – Regex, Error Handling, Require, Timeout, Dynamic Code

Regular Expressions Handling with Eval
$line = <>;
%hash = ( number => qr/^[0-9]+$/,
                 alphabets => qr/^[a-zA-Z]+$/
);

while( my ($key,$value) = each(%hash) )
{
    if(eval "$line =~ /$value/") { print "$key\n"; }
}

Trapping Errors

During the execution of the subroutine the program might die because of errors, or external calling of die function. During this time, if the block of perl code is executed inside the eval, then program continues to run even after the die or errors, and it also captures the errors or dieing words.

Zero Divide Error:
eval { $average = $total / $count }; print “Error captured : $@\n”;











Sunday, 9 November 2014

RNA-seq Quality Control Software

RSeQC: An RNA-seq Quality Control Package

Reverse engineering of molecular regulatory networks

birta: the R Bioconductor package

qpgraph: the R Bioconductor package

Interface to BioMart databases

biomaRt: the R Bioconductor package

Target Enrichment Quality Control

TEQC: the R Bioconductor package

Infer miRNA-mRNA interactions using paired expression data from a single sample

Roleswitch: the R Bioconductor package

R Interface to David

DAVIDQuery: the R Bioconductor package

RDAVIDWebService: the R Bioconductor package

R Circos plots

OmicCircos: the R Bioconductor package

Gene Regulatory Network Inference Using Time Series

GRENITS: the R Bioconductor package

TDARACNE: the R Bioconductor package

Inference of differential exon usage in RNA-Seq

DEXSeq: the R Bioconductor package

ChIP-seq QC package

ChIPQC: the R Bioconductor package

ChIP-seq: calculate read-enrichment scores for each nucleotide position

CSAR: the R Bioconductor package

Allelic Imbalance: RNA-seq Strand-specific Analysis

AllelicImbalance: the R Bioconductor package

quantitative trait loci WASP: allele-specific software for robust discovery of molecular

The PeakAnnotator Software the Overlap Data Sets (ODS) subroutine

PeakAnalyzer main functions

ChIPpeakAnno: the R Bioconductor package

TSS plot using RNA-seq and ChIP-seq data

ngsplot

metaseq examples

ChIPseeker: the R Bioconductor package

the CEAS software

Thursday, 6 November 2014

abs() and fabs()

The function abs() takes an argument of type int and returns its absolute value as an int. Its function prototype is in stdlib.h.

fabs() taks an argument of type double and returns its absolute value as a double. Its function prototype is in math.h.


The Use of typedef

An excerpt from "A book on C"

The C language provides the typedef mechanism, which allows the programmer to explicitly associate a type with an identifier.

typedef char uppercase;

typedef int INCHES, FEET;

uppercase u;
INCHES length, width;

Assignment Operators

An excerpt from "A book on C".

The semantics is specified by

variable op= expression

which is equivalent to

variable=variable op (expression)


Assignment operators
=, +=, -=, *=, /=, %=, >>=,  <<=, &=, ^=, |=.





Increment and Decrement Operators

An excerpt form "A book on C".

The expression ++i causes the stored value of i to be incremented first, with the expression then taking as its value the new stored value of i.

In contrast, the expression i++ has as its value the current value of i; then the expression causes the stored value of i to be incremented.

Files

Excerpt from "A book on C"
# include <studio.h>

int main(int argc, char *argv[])
{
    .....

argc: argument count; its vlaue is the number of arguments in the command line that was used to execute the program.

argv: argument vector; it is an array of pointers to char.

Functions

Function prototypes:

double pow(double x, double y);
or equivalently,
double pow(double, double);

Identifiers such as x and y that occur in parameter type lists in function prototypes are not used by the compiler. Their purpose is to provide documentation to the programmer and other readers of the code.

In C, arguments to functions are always passed by value.
In C, to get the effect of call-by reference, pointers must be used.



scanf() and printf()

The function scanf() returns an int value that is the number of successful conversions accomplished or the system defined en-of-value.

The function printf() returns an int value that is the number of characters printed or a negative value in case of an error.


Wednesday, 5 November 2014

Book: "Bioinformatics Sequence and Genome Analysis"

Affine gap penality is a gap penality score that is a linear function of gap length, consisting of a gap opening penality and a gap extension penality multiplied by the length of the gap.

Alignment score is a computed score based on the number of matches, substitutions, and insertions/deletesions (gaps) within an alignment. For DNA sequences, usually a match and mismatch score is chosen along with a gap penality that will produce the most reasonable alignment.

BLOSUM scoring matrices are commonly used to align protein sequences.

Convergent evolution refers to the evolution of two genes to the same biological function. However, because they have different starting points, the resulting sequences are not similar.

Distance score between aligned sequences is a measure of the evolutionary distance between the sequences.

Dynamic programming algorithm solves the problem of finding the optimal alignment between sequences by breaking the alignment down into a series of sequential sub-alignments that can be readily computed.

PAM scoring matrix, or percent accepted mutation scoring matrix, is a table or matrix that describes the odds that a sequence position, e.g., an amino acid, has changed into a second one during a period of evolutionary time.

Smith-Waterman algorithm is a dynamic programming algorithm for locating the highest-scoring local alignments of sequences. The key feature is that all negative scores calculated in the dynamic programming matrix are changed to zero to avoid extending poorly scoring alignments and to assist in identifying local alignments starting and stopping anywhere in the matrix.

In a local alignment, the alignment stops at the ends of regions of strong similarity, and a much higher priority is given to finding these local regions than to extending the alignment to include more neighboring amino acid pairs.

Odds ratio in the sequence alignment is the ratio of the odds o f obtaining the sore of related sequences to the odds of obtaining the same score between unrelated sequences. 


Tuesday, 4 November 2014

Paper "Strand-Specific RNA-Seq Provides Greater Resolution of Transcriptome Profiling"

Strand-Specific RNA-Seq Provides Greater Resolution of Transcriptome Profiling

It seems that antisense tran-scriptional ‘hot spots’ are located around nucleosome-free regions such as those associated with promoters, indicating that it is likely that antisense transcripts carry out important regulatory functions.

Furthermore, antisense transcripts have been documented that partner with active promoter sites or those that are in close proximity of transcription start sites [17, 22, 23]. While antisense transcripts occur at lower abundances than their sense transcripts, all evidence points to non-coding antisense transcripts playing a pivotal role in regulation of the transcriptome [19].

There exist a variety of pathways in which antisense transcripts can act as regulatory elements. It is possible to divide these pathways into three broad categories; transcription modulation, hybridization of sense-antisense RNA partners and chromatin modification.

The act of antisense transcription, rather than asRNA molecule itself can modulate gene expression levels. During transcription RNA polymerase binds to the promoter region of the gene and proceeds along the strand. If transcription occurs on the DNA sense strand and antisense strand simultaneously it can result in the RNA polymerases colliding.

Splicing is controlled by the presence of exonic splicing enhancers/silencers and intronic enhancer/silencers, the ratios of these elements impact on the splicing pattern [27]. These elements contain motifs that will recruit splicing machinery to the site. If sections of the transcript containing these elements are masked, by hybridization with an antisense transcript, then the splicing patterns of the sense transcript will be changed.

RNA duplex formation in the cytoplasm may alter the ability of a transcript to be translated. It is possible that the duplex formation blocks the ability of the transcript to associate with the ribosome hence altering the efficiency of the translation machinery.

It has been suggested that long ncRNAs, such as those produced by antisense transcription, may interact with histone modifying enzymes via the formation of specific RNA secondary structures [36].

Monday, 3 November 2014

Alignment Methods

Alignment Methods

Bowtie2

# specified parameters

--n-ceil L,0,0.03
L: linear
the maximum number of ambiguous characters allowed in a read as a function of read length; specifying -L,0,0.03 sets the N-ceiling function f to f(x) = 0 + 0.03* x, where x is the read length.

--score-min C,-14,0
C: constant
governing the minimum alignment score needed for an alignment to be considered "valid" (i.e. good enough to report). This is a function of read length. For instance, specifying L,0,-0.6 sets the minimum-score function f to f(x) = -14 + 0 * 1, where x is the read length.

--phred33
Input qualities are ASCII chars equal to the Phred quality plus 33. This is also called the "Phred+33" encoding, which is used by the very latest Illumina pipelines.

-X 50000
The maximum fragment length for valid paired-end alignments. E.g. if -X 100 is specified and a paired-end alignment consists of two 20-bp alignments in the proper orientation with a 60-bp gap between them, that alignment is considered valid (as long as -I is also satisfied). A 61-bp gap would not be valid in that case. If trimming options -3 or -5 are also used, the -X constraint is applied with respect to the untrimmed mates, not the trimmed mates.

-N 1
Sets the number of mismatches to allowed in a seed alignment during multiseed alignment. Can be set to 0 or 1. Setting this higher makes alignment slower (often much slower) but increases sensitivity. Default: 0.

-q/--quietbowtie2-build is verbose by default. With this option bowtie2-build will print only error messages.

# default parameters

--ma <int>
Sets the match bonus. In --local mode <int> is added to the alignment score for each position where a read character aligns to a reference character and the characters match. Not used in --end-to-end mode. Default: 2.

qalter

Torque:
qalter <jobid> -W queue=<new queue name>


Other non-specified job scheduler:
qalter -q <new queue name> <jobid>