Shortcuts in Science: April 2015

Thursday 30 April 2015

Complements and Intersections of VCF Files

Cited from bcftools

bcftools isec [OPTIONS] A.vcf.gz B.vcf.gz […]

Creates intersections, unions and complements of VCF files. Depending on the options, the program can output records from one (or more) files which have (or do not have) corresponding records with the same position in the other files.

Bash Dereference Concatenated Variable Name

Cited from Dereference concatenated variable name

FRUITS="BANANA APPLE ORANGE"

BANANA_COLOUR="Yellow"
APPLE_COLOUR="Green or Red"
ORANGE_COLOUR="Blue"

for fruit in $FRUITS ;do
eval echo $fruit is \$${fruit}_COLOUR
done

'The eval simply tells bash to make a second evaluation of the following statement (ie. one more that its normal evaluation).. The \$ survives the first evaluation as $, and the next evaluation then treats this $ as the start of a variable name, which resolves to "Yellow", etc..'.

The Executation of Pipe with a Bash Find Command

How to use pipe within -exec in find

Looping through files with spaces in the names?

Wednesday 29 April 2015

UCSC chr_random Sequences

UCSC chr_random in genome and gtf files

Retrive Genome File from UCSC Database

According to Downloading Data using MySQL,

For example,

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "select chrom, size from mm9.chromInfo" > mm9.genome

Tuesday 28 April 2015

Linux Command Cut

How to remove columns from CSV file based on column number using bash shell

GTF to BED

How To Convert Gencode Gtf Into Bed Format ?

Ken Command Utility

Extracting UCSC GTF: The Recommended Way

Create a '.gtf' annotation file from the UCSC table

Genes in gtf or gff format

Downloading UCSC GTF

Human Reference Genome Annotation GTF download

Friday 24 April 2015

De-novo Mutation Identification: Gemini

GEMINI: a flexible framework for exploring genome variation

Thursday 23 April 2015

Wednesday 22 April 2015

Quality Assessment of Sequence Reads

Quality assessment and quality control of NGS data

Tuesday 21 April 2015

Identifying De-novo Mutations From Trio Data

Trio-Analysis Pipeline Script

Demultiplexing Fastq According to Barcodes

How to demultiplex fastq files with a dedicated, separate barcode file

Regular Expression: The Order of Lookaheads

Cited from The Order of Lookaheads Doesn't Matter… Almost

"While the order of lookaheads doesn't matter on a logical level, keep in mind that it may matter for matching speed. If one lookahead is more likely to fail than the other two, it makes little sense to place it in third position and expend a lot of energy checking the first two conditions. Make it first, so that if we're going to fail, we fail early—an application of the design to fail principle from the regex style guide."

"The negative lookbehind (?<!.) asserts that what precedes the current position is not any character—therefore the position must be the beginning of the string."

Regular Expression: DOTALL mode

Cited from DOTALL (Dot Matches Line Breaks): s (with exceptions)

"By default, the dot . doesn't match line break characters such as line feeds and carriage returns. If you want patterns such as BEGIN .*? END to match across lines, we need to turn that feature on."

"This mode is sometimes called single-line (hence the s) because as far as the dot is concerned, it turns the whole string into one big line—.* will match from the first character to the last, no matter how many line breaks stand in between."

"In Perl, apart from the (?s) inline modifier, Perl lets you add the s flag after your pattern's closing delimiter. For instance, you can use:
if ($the_subject =~ m/BEGIN .*? END/s) { … }"

Regular Expression: Non-greedy Matching

Cited from Regular Expression Tutorial Part 5: Greedy and Non-Greedy Quantification

To make the quantifier non-greedy you simply follow it with a '?'

symbol:

my $string = 'bcdabdcbabcd';

$string =~ m/^(.*?)ab/;

Regular Expression Possessive: Don't Give Up Characters

Cited from Possessive: Don't Give Up Characters

"As you'll see in the table below, a quantifier is made possessive by appending a + plus sign to it. Therefore, A++ is possessive—it matches as many characters as needed and never gives any of them back."

Monday 20 April 2015

Regular Expression Anchors

"Regex anchors force the regex engine to start or end a match at an absolute position. The start of string anchor (\A) dictates that any match must start at the beginning of the string."

"The end of line string anchor (\Z) requires that a match end at the end of a line within the string."

"The word boundary anchor (\b) matches only at the boundary between a word character (\w) and a non-word character (\W)."

Cited from Regular Expressions and Matching

##################################################
✽ In .NET, Perl and Ruby, \Z is allowed to match before a final line feed. Therefore, e\Z will match the final e in the string "apple\norange\n".

Cited from Regex Anchors

Regular Expression: The Use of (?

Named capture in Perl:

'Perl uses (?<NAME>pattern) to specify names captures. You have to use the %+ hash to retrieve them.

$variable =~ /(?<count>\d+)/;
print "Count is {count}";'

Cited from Can I use named groups in a Perl regex to get the results in a hash?
##################################################
"The normal capturing (pattern) has the property of capturing and group. Capturing means that the text matches the pattern inside will be captured so that you can use it with back-reference, in matching or replacement. The non-capturing group (?:pattern) doesn't have the capturing property."

"Atomic grouping (?>pattern) also has the non-capturing property, so the position of the text matched inside will not be captured."

Cited from Confusion with Atomic Grouping - how it differs from the Grouping in regular expression of Ruby?

Tuesday 14 April 2015

The European Molecular Biology Open Software Suite (EMBOSS)

EMBOSS

Paper "Charting a dynamic DNA methylation landscape of the human genome"

Excerpts from Charting a dynamic DNA methylation landscape of the human genome

"Most cell types, except germ cells and pre-implantation embryos^{3, 4, 5}, display relatively stable DNA methylation patterns, with 70–80% of all CpGs being methylated."

CpG Island and Shores

Excerpts from "CpG site"

""CpG" is shorthand for "—C—phosphate—G—", that is, cytosine and guanine separated by only one phosphate; phosphate links any two nucleosides together in DNA. The "CpG" notation is used to distinguish this linear sequence from the CG base-pairing of cytosine and guanine."

#################################################
Excerpts from "Question: Find Cpg Islands"

"CpG islands were predicted by searching the sequence one base at a time, scoring each dinucleotide (+17 for CG and -1 for others) and identifying maximally scoring segments. Each segment was then evaluated for the following criteria: GC content of 50% or greater, length greater than 200 bp, ratio greater than 0.6 of observed number of CG dinucleotides to the expected number on the basis of the number of Gs and Cs in the segment.

The CpG count is the number of CG dinucleotides in the island. The Percentage CpG is the ratio of CpG nucleotide bases (twice the CpG count) to the length. The ratio of observed to expected CpG is calculated according to the formula (cited in Gardiner-Garden et al. (1987)): Obs/Exp CpG = Number of CpG * N / (Number of C * Number of G) where N = length of sequence."

##################################################
Excerpts from "What is a CpG shore and how to I get them all?"

"CpG shores are the regions immediately flanking and up to 2 kbp away from CpG islands. These regions are interesting because methylation they are variably methylated in cancer and development."

Monday 13 April 2015

Paper "Targeted disruption of DNMT1, DNMT3A and DNMT3B in human embryonic stem cells"

Excerpts from "Targeted disruption of DNMT1, DNMT3A and DNMT3B in human embryonic stem cells"

"Human ESC methylation patterns are most unique at hypomethylated regulatory elements that are enriched for binding of pluripotency-associated master regulators, such as OCT4, SOX2 and NANOG."

Hemimethylated DNA

Exerpts from What is hemimethylated DNA?

"DNA-hemimethylation is when only one of two (complementary) strands is methylated. A hemi-methylated site is a single CpG that is methylated on one strand, but not on the other. This is not the same thing as allele-specific methylation, which is common in imprinting. In hemi-methylation, we’re talking about 2 strands from the same parent. Hemimethylation is important because it directly identifies de novo methylation events, allowing you to differentiation between de novo vs. maintenance factors. Because DNA methylation is faithfully propagated during DNA replication (by DNMT1), any hemimethylated sites must have arisen during the last replication round, either because: 1) failure to faithfully propagate a parental methylation signal; or, 2) a de novo methylation event. You can differentiate between the two if you know the methylation status of the parent: if the parent strand was entirely methylated, then hemimethylation indicates failure of maintenance. Vice versa, if the parent straned was unmethylated, hemimethylation indicates de novo methylation."

Friday 3 April 2015

TMM Normalisation

Excerpts from "NormalizationAndDifferentialExpression"

tmm <- calcNormFactors(geneCounts.dgelist)

# equation from the edgeR documentation for estimating normalized absolute expression from their scaling factors
tmmScaleFactors <- geneCounts.dgelist$samples$lib.size * tmm$samples$norm.factors
tmmExp <- round(t(t(tmm$counts)/tmmScaleFactors) * mean(tmmScaleFactors))

#################################################
Excerpts from "Question: After Getting Normalization Factor Via Edger, What To Do For Normalization?"

The TMM counts are: count / (library size * normalization factor)

Then multiply that by a million to get CPM.

Not count / normalization factor

And DESeq doesn't just do a simple division by library size. It takes the median of the ratio of the count to the geometric mean of the expression values as the scaling factor for each library.