Thursday, 30 April 2015
Complements and Intersections of VCF Files
Cited from bcftools
bcftools isec [OPTIONS] A.vcf.gz B.vcf.gz […]
Creates intersections, unions and complements of VCF files. Depending on the options, the program can output records from one (or more) files which have (or do not have) corresponding records with the same position in the other files.
Bash Dereference Concatenated Variable Name
Cited from Dereference concatenated variable name
FRUITS="BANANA APPLE ORANGE"
BANANA_COLOUR="Yellow"
APPLE_COLOUR="Green or Red"
ORANGE_COLOUR="Blue"
for fruit in $FRUITS ;do
eval echo $fruit is \$${fruit}_COLOUR
done
'The
FRUITS="BANANA APPLE ORANGE"
BANANA_COLOUR="Yellow"
APPLE_COLOUR="Green or Red"
ORANGE_COLOUR="Blue"
for fruit in $FRUITS ;do
eval echo $fruit is \$${fruit}_COLOUR
done
'The
eval
simply tells bash to make a second evaluation of the following statement (ie. one more that its normal evaluation).. The \$
survives the first evaluation as $
, and the next evaluation then treats this $
as the start of a variable name, which resolves to "Yellow", etc..'.
Wednesday, 29 April 2015
Retrive Genome File from UCSC Database
According to Downloading Data using MySQL,
For example,
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "select chrom, size from mm9.chromInfo" > mm9.genome
For example,
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "select chrom, size from mm9.chromInfo" > mm9.genome
Tuesday, 28 April 2015
Friday, 24 April 2015
Thursday, 23 April 2015
Wednesday, 22 April 2015
Tuesday, 21 April 2015
Regular Expression: The Order of Lookaheads
Cited from The Order of Lookaheads Doesn't Matter… Almost
"While the order of lookaheads doesn't matter on a logical level, keep in mind that it may matter for matching speed. If one lookahead is more likely to fail than the other two, it makes little sense to place it in third position and expend a lot of energy checking the first two conditions. Make it first, so that if we're going to fail, we fail early—an application of the design to fail principle from the regex style guide."
"The negative lookbehind (?<!.) asserts that what precedes the current position is not any character—therefore the position must be the beginning of the string."
"While the order of lookaheads doesn't matter on a logical level, keep in mind that it may matter for matching speed. If one lookahead is more likely to fail than the other two, it makes little sense to place it in third position and expend a lot of energy checking the first two conditions. Make it first, so that if we're going to fail, we fail early—an application of the design to fail principle from the regex style guide."
"The negative lookbehind (?<!.) asserts that what precedes the current position is not any character—therefore the position must be the beginning of the string."
Regular Expression: DOTALL mode
Cited from DOTALL (Dot Matches Line Breaks): s (with exceptions)
"By default, the dot . doesn't match line break characters such as line feeds and carriage returns. If you want patterns such as BEGIN .*? END to match across lines, we need to turn that feature on."
"This mode is sometimes called single-line (hence the s) because as far as the dot is concerned, it turns the whole string into one big line—.* will match from the first character to the last, no matter how many line breaks stand in between."
"In Perl, apart from the (?s) inline modifier, Perl lets you add the s flag after your pattern's closing delimiter. For instance, you can use:
if ($the_subject =~ m/BEGIN .*? END/s) { … }"
"By default, the dot . doesn't match line break characters such as line feeds and carriage returns. If you want patterns such as BEGIN .*? END to match across lines, we need to turn that feature on."
"This mode is sometimes called single-line (hence the s) because as far as the dot is concerned, it turns the whole string into one big line—.* will match from the first character to the last, no matter how many line breaks stand in between."
"In Perl, apart from the (?s) inline modifier, Perl lets you add the s flag after your pattern's closing delimiter. For instance, you can use:
if ($the_subject =~ m/BEGIN .*? END/s) { … }"
Regular Expression: Non-greedy Matching
Cited from Regular Expression Tutorial Part 5: Greedy and Non-Greedy Quantification
To make the quantifier non-greedy you simply follow it with a '?'
symbol:
my $string = 'bcdabdcbabcd';
$string =~ m/^(.*?)ab/;
To make the quantifier non-greedy you simply follow it with a '?'
symbol:
my $string = 'bcdabdcbabcd';
$string =~ m/^(.*?)ab/;
Regular Expression Possessive: Don't Give Up Characters
Cited from Possessive: Don't Give Up Characters
"As you'll see in the table below, a quantifier is made possessive by appending a + plus sign to it. Therefore, A++ is possessive—it matches as many characters as needed and never gives any of them back."
"As you'll see in the table below, a quantifier is made possessive by appending a + plus sign to it. Therefore, A++ is possessive—it matches as many characters as needed and never gives any of them back."
Monday, 20 April 2015
Regular Expression Anchors
"Regex anchors force the regex engine to start or end a match at an absolute position. The start of string anchor (\A) dictates that any match must start at the beginning of the string."
"The end of line string anchor (\Z) requires that a match end at the end of a line within the string."
"The word boundary anchor (\b) matches only at the boundary between a word character (\w) and a non-word character (\W)."
Cited from Regular Expressions and Matching
##################################################
✽ In .NET, Perl and Ruby, \Z is allowed to match before a final line feed. Therefore, e\Z will match the final e in the string "apple\norange\n".
Cited from Regex Anchors
"The end of line string anchor (\Z) requires that a match end at the end of a line within the string."
"The word boundary anchor (\b) matches only at the boundary between a word character (\w) and a non-word character (\W)."
Cited from Regular Expressions and Matching
##################################################
✽ In .NET, Perl and Ruby, \Z is allowed to match before a final line feed. Therefore, e\Z will match the final e in the string "apple\norange\n".
Cited from Regex Anchors
Regular Expression: The Use of (?
Named capture in Perl:
'Perl uses (?<NAME>pattern) to specify names captures. You have to use the %+ hash to retrieve them.
$variable =~ /(?<count>\d+)/;
print "Count is {count}";'
Cited from Can I use named groups in a Perl regex to get the results in a hash?
##################################################
"The normal capturing (pattern) has the property of capturing and group. Capturing means that the text matches the pattern inside will be captured so that you can use it with back-reference, in matching or replacement. The non-capturing group (?:pattern) doesn't have the capturing property."
"Atomic grouping (?>pattern) also has the non-capturing property, so the position of the text matched inside will not be captured."
Cited from Confusion with Atomic Grouping - how it differs from the Grouping in regular expression of Ruby?
'Perl uses (?<NAME>pattern) to specify names captures. You have to use the %+ hash to retrieve them.
$variable =~ /(?<count>\d+)/;
print "Count is {count}";'
Cited from Can I use named groups in a Perl regex to get the results in a hash?
##################################################
"The normal capturing (pattern) has the property of capturing and group. Capturing means that the text matches the pattern inside will be captured so that you can use it with back-reference, in matching or replacement. The non-capturing group (?:pattern) doesn't have the capturing property."
"Atomic grouping (?>pattern) also has the non-capturing property, so the position of the text matched inside will not be captured."
Cited from Confusion with Atomic Grouping - how it differs from the Grouping in regular expression of Ruby?
Tuesday, 14 April 2015
Paper "Charting a dynamic DNA methylation landscape of the human genome"
Excerpts from Charting a dynamic DNA methylation landscape of the human genome
"Most cell types, except germ cells and pre-implantation embryos3, 4, 5, display relatively stable DNA methylation patterns, with 70–80% of all CpGs being methylated."
"Most cell types, except germ cells and pre-implantation embryos3, 4, 5, display relatively stable DNA methylation patterns, with 70–80% of all CpGs being methylated."
CpG Island and Shores
Excerpts from "CpG site"
""CpG" is shorthand for "—C—phosphate—G—", that is, cytosine and guanine separated by only one phosphate; phosphate links any two nucleosides together in DNA. The "CpG" notation is used to distinguish this linear sequence from the CG base-pairing of cytosine and guanine."
#################################################
Excerpts from "Question: Find Cpg Islands"
Excerpts from "What is a CpG shore and how to I get them all?"
"CpG shores are the regions immediately flanking and up to 2 kbp away from CpG islands. These regions are interesting because methylation they are variably methylated in cancer and development."
""CpG" is shorthand for "—C—phosphate—G—", that is, cytosine and guanine separated by only one phosphate; phosphate links any two nucleosides together in DNA. The "CpG" notation is used to distinguish this linear sequence from the CG base-pairing of cytosine and guanine."
#################################################
Excerpts from "Question: Find Cpg Islands"
"CpG islands were predicted by searching the sequence one base at a time, scoring each dinucleotide (+17 for CG and -1 for others) and identifying maximally scoring segments. Each segment was then evaluated for the following criteria: GC content of 50% or greater, length greater than 200 bp, ratio greater than 0.6 of observed number of CG dinucleotides to the expected number on the basis of the number of Gs and Cs in the segment.##################################################
The CpG count is the number of CG dinucleotides in the island. The Percentage CpG is the ratio of CpG nucleotide bases (twice the CpG count) to the length. The ratio of observed to expected CpG is calculated according to the formula (cited in Gardiner-Garden et al. (1987)): Obs/Exp CpG = Number of CpG * N / (Number of C * Number of G) where N = length of sequence."
Excerpts from "What is a CpG shore and how to I get them all?"
"CpG shores are the regions immediately flanking and up to 2 kbp away from CpG islands. These regions are interesting because methylation they are variably methylated in cancer and development."
Monday, 13 April 2015
Paper "Targeted disruption of DNMT1, DNMT3A and DNMT3B in human embryonic stem cells"
Excerpts from "Targeted disruption of DNMT1, DNMT3A and DNMT3B in human embryonic stem cells"
"Human ESC methylation patterns are most unique at hypomethylated regulatory elements that are enriched for binding of pluripotency-associated master regulators, such as OCT4, SOX2 and NANOG."
"Human ESC methylation patterns are most unique at hypomethylated regulatory elements that are enriched for binding of pluripotency-associated master regulators, such as OCT4, SOX2 and NANOG."
Hemimethylated DNA
Exerpts from What is hemimethylated DNA?
"DNA-hemimethylation is when only one of two (complementary) strands is methylated. A hemi-methylated site is a single CpG that is methylated on one strand, but not on the other. This is not the same thing as allele-specific methylation, which is common in imprinting. In hemi-methylation, we’re talking about 2 strands from the same parent. Hemimethylation is important because it directly identifies de novo methylation events, allowing you to differentiation between de novo vs. maintenance factors. Because DNA methylation is faithfully propagated during DNA replication (by DNMT1), any hemimethylated sites must have arisen during the last replication round, either because: 1) failure to faithfully propagate a parental methylation signal; or, 2) a de novo methylation event. You can differentiate between the two if you know the methylation status of the parent: if the parent strand was entirely methylated, then hemimethylation indicates failure of maintenance. Vice versa, if the parent straned was unmethylated, hemimethylation indicates de novo methylation."
"DNA-hemimethylation is when only one of two (complementary) strands is methylated. A hemi-methylated site is a single CpG that is methylated on one strand, but not on the other. This is not the same thing as allele-specific methylation, which is common in imprinting. In hemi-methylation, we’re talking about 2 strands from the same parent. Hemimethylation is important because it directly identifies de novo methylation events, allowing you to differentiation between de novo vs. maintenance factors. Because DNA methylation is faithfully propagated during DNA replication (by DNMT1), any hemimethylated sites must have arisen during the last replication round, either because: 1) failure to faithfully propagate a parental methylation signal; or, 2) a de novo methylation event. You can differentiate between the two if you know the methylation status of the parent: if the parent strand was entirely methylated, then hemimethylation indicates failure of maintenance. Vice versa, if the parent straned was unmethylated, hemimethylation indicates de novo methylation."
Friday, 3 April 2015
TMM Normalisation
Excerpts from "NormalizationAndDifferentialExpression"
tmm <- calcNormFactors(geneCounts.dgelist)
# equation from the edgeR documentation for estimating normalized absolute expression from their scaling factors
tmmScaleFactors <- geneCounts.dgelist$samples$lib.size * tmm$samples$norm.factors
tmmExp <- round(t(t(tmm$counts)/tmmScaleFactors) * mean(tmmScaleFactors))
#################################################
Excerpts from "Question: After Getting Normalization Factor Via Edger, What To Do For Normalization?"
The TMM counts are: count / (library size * normalization factor)
Then multiply that by a million to get CPM.
Not count / normalization factor
And DESeq doesn't just do a simple division by library size. It takes the median of the ratio of the count to the geometric mean of the expression values as the scaling factor for each library.
tmm <- calcNormFactors(geneCounts.dgelist)
# equation from the edgeR documentation for estimating normalized absolute expression from their scaling factors
tmmScaleFactors <- geneCounts.dgelist$samples$lib.size * tmm$samples$norm.factors
tmmExp <- round(t(t(tmm$counts)/tmmScaleFactors) * mean(tmmScaleFactors))
#################################################
Excerpts from "Question: After Getting Normalization Factor Via Edger, What To Do For Normalization?"
The TMM counts are: count / (library size * normalization factor)
Then multiply that by a million to get CPM.
Not count / normalization factor
And DESeq doesn't just do a simple division by library size. It takes the median of the ratio of the count to the geometric mean of the expression values as the scaling factor for each library.
Subscribe to:
Posts (Atom)