Bootstrap:
Re-sampling from the original random sample. Each resample is the same size as the original random
sample.
The original sample represents the population from which it wasdrawn. So resamples from this sample represent what we would get if we took many samples from the population. The bootstrap distribution of a statistic, based on many resamples, represents the sampling distribution of the statistic, based on many samples.
The bootstrap standard error of a statistic is the standard deviation of the bootstrap distribution of that statistic.
Saturday, 31 May 2014
Statistics: Central Limit Theorem and Law of Large Numbers
The Central Limit Theorem describes the characteristics of the "population of the means" which has been created from the means of an infinite number of random population samples of size (N), all of them drawn from a given "parent population". The Central Limit Theorem predicts that regardless of the distribution of the parent population:
[1] The mean of the population of means is always equal to the mean of the parent population from which the population samples were drawn.
[2] The standard deviation of the population of means is always equal to the standard deviation of the parent population divided by the square root of the sample size (N).
[3] [And the most amazing part!!] The distribution of means will increasingly approximate a normal distribution as the size N of samples increases.
Cited from http://www.chem.uoa.gr/applets/appletcentrallimit/appl_centrallimit2.html
The law of large numbers (LLN) is a theorem that describes the result of performing the same experiment a large number of times. According to the law, the average of the results obtained from a large number of trials should be close to the expected value, and will tend to become closer as more trials are performed.
Cited from http://en.wikipedia.org/wiki/Law_of_large_numbers
[1] The mean of the population of means is always equal to the mean of the parent population from which the population samples were drawn.
[2] The standard deviation of the population of means is always equal to the standard deviation of the parent population divided by the square root of the sample size (N).
[3] [And the most amazing part!!] The distribution of means will increasingly approximate a normal distribution as the size N of samples increases.
Cited from http://www.chem.uoa.gr/applets/appletcentrallimit/appl_centrallimit2.html
The law of large numbers (LLN) is a theorem that describes the result of performing the same experiment a large number of times. According to the law, the average of the results obtained from a large number of trials should be close to the expected value, and will tend to become closer as more trials are performed.
Cited from http://en.wikipedia.org/wiki/Law_of_large_numbers
Ubuntu: install openssh
Using the following command to install the utility for enabling ssh/Winscp
sudo apt-get update sudo apt-get install openssh-server
Details can be found at http://askubuntu.com/questions/51925/how-do-i-configure-a-new-ubuntu-installation-to-accept-ssh-connections
sudo apt-get update sudo apt-get install openssh-server
Details can be found at http://askubuntu.com/questions/51925/how-do-i-configure-a-new-ubuntu-installation-to-accept-ssh-connections
Ubuntu: setup ssh/Winscp with no pass phrase for log-in
Enter the following command at the command prompt:
ssh-keygen -t rsa
At the prompt for entering file in which to save the key, press "Enter".
At the prompt for entering passphrase, press "Enter".
At the prompt for entering same passphrase again, press "Enter".
Screen output: "Your identification has been save in xxx", "Your public key has been saved in xxx".
Type the following command:
cat ~/.ssh/id_rsa.pub >>~/.ssh/authorized_keys
Download ~/.ssh/id_rsa to the local machine.
Open PuTTYgen, use "Load" to load an existing private key file, and save the private key.
In Putty, load a session or set a new session by filling in the server adress and session name, then go to connections -> SSH -> Auth -> browse and select the ppk file under "Private key file for authentication".
In Winscp, load private key under "Private key file" and save the session.
In Linux, download from the server the private key id_rs to ~/.ssh/ of the local machine, and use the command "ssh-add" to add private key identities to the authentication agent.
To save the typing of the user name and the host name, append to ~/.ssh/config of the local machine, the following lines and save:
Host [abbreviated name]
HostName [dev.example.com]
Port 22000 (optional; can be omitted)
User [fooey]
Then it is done!
ssh-keygen -t rsa
At the prompt for entering file in which to save the key, press "Enter".
At the prompt for entering passphrase, press "Enter".
At the prompt for entering same passphrase again, press "Enter".
Screen output: "Your identification has been save in xxx", "Your public key has been saved in xxx".
Type the following command:
cat ~/.ssh/id_rsa.pub >>~/.ssh/authorized_keys
Download ~/.ssh/id_rsa to the local machine.
Open PuTTYgen, use "Load" to load an existing private key file, and save the private key.
In Putty, load a session or set a new session by filling in the server adress and session name, then go to connections -> SSH -> Auth -> browse and select the ppk file under "Private key file for authentication".
In Winscp, load private key under "Private key file" and save the session.
In Linux, download from the server the private key id_rs to ~/.ssh/ of the local machine, and use the command "ssh-add" to add private key identities to the authentication agent.
To save the typing of the user name and the host name, append to ~/.ssh/config of the local machine, the following lines and save:
Host [abbreviated name]
HostName [dev.example.com]
Port 22000 (optional; can be omitted)
User [fooey]
Then it is done!
Ubuntu: enable the access of VNC Viewer to the machine
To configure the correct settings in "Desktop Sharing".
In the dconf-editor,dconf > org > gnome > desktop > remote-access > require-encryption - uncheck dconf > org > gnome > desktop > remote-access > vnc-password - set the correct password
More details are available from http://discourse.ubuntu.com/t/remote-desktop-sharing-in-ubuntu-14-04/1640/3
In the dconf-editor,dconf > org > gnome > desktop > remote-access > require-encryption - uncheck dconf > org > gnome > desktop > remote-access > vnc-password - set the correct password
More details are available from http://discourse.ubuntu.com/t/remote-desktop-sharing-in-ubuntu-14-04/1640/3
Ubuntu: dconf
dconf is a low-level configuration system. Its main purpose is to provide a backend to GSettings on platforms that don't already have configuration storage systems. Cited from https://wiki.gnome.org/action/show/Projects/dconf?action=show&redirect=dconf#Introduction
dconf is the GNOME technology used to store application settings, and it is unspecific to Ubuntu.
To install dconf-editor,
sudo apt-get install dconf-tools
To run it by entering in the terminal or in the quick launch menu,
dconf-editor
See http://askubuntu.com/questions/22313/what-is-dconf-what-is-its-function-and-how-do-i-use-it for how to use dconf.
dconf is the GNOME technology used to store application settings, and it is unspecific to Ubuntu.
To install dconf-editor,
sudo apt-get install dconf-tools
To run it by entering in the terminal or in the quick launch menu,
dconf-editor
See http://askubuntu.com/questions/22313/what-is-dconf-what-is-its-function-and-how-do-i-use-it for how to use dconf.
Wednesday, 28 May 2014
STAR: ultrafast universal RNA-seq aligner
To align single-end reads:
star --runThreadN 12 --outFilterMultimapNmax 1 --outFilterMismatchNmax 10 --outSAMstrandField intronMotif --genomeLoad LoadAndKeep --outSAMattributes All --genomeDir $TMPDIR/star --readFilesCommand zcat --readFilesIn $TMPDIR/${fq_id}_1.fastq.gz --outFileNamePrefix $TMPDIR/$fq_id/ 2>$TMPDIR/$fq_id/$fq_id.log
To align paired-end reads:
star --runThreadN 12 --outFilterMultimapNmax 1 --outFilterMismatchNmax 10 --outSAMstrandField intronMotif --genomeLoad LoadAndKeep --outSAMattributes All --genomeDir $TMPDIR/star --readFilesCommand zcat --readFilesIn $TMPDIR/${fq_id}_1.fastq.gz $TMPDIR/${fq_id}_2.fastq.gz --outFileNamePrefix $TMPDIR/$fq_id/ 2>$TMPDIR/$fq_id/$fq_id.log
outFilterMismatchNmax 10
int: alignment will be output only if it has fewer mismatches than this value.
If you have un-stranded RNA-seqdata, and wish to run Cufflinks/Cuffdiff on STAR alignments, you will need to run STAR with --outSAMstrandField intronMotif option, which will generate the XS strand attribute for all alignments that contain splice junctions. The spliced alignments that have undefined strand (i.e. containing only non-canonical junctions) will be suppressed.
With default --outSAMattributes Standardoption the following SAM attributes will be generated:
Column 12: NH: number of loci a read (pair) maps to
Column 13: IH: alignment index for all alignments of a read
Column 14: aS: alignment score
Column 15: nM: number of mismatches (does not include indels)
If --outSAMattributes All option is used, the following additional attributes will be output:
Column 16: jM:B:c,M1,M2,… Intron motifs for all junctions (i.e. N in CIGAR): 0: non-canonical; 1: GT/AG, 2: CT/AC, 3: GC/AG, 4: CT/GC, 5: AT/AC, 6: GT/AT. If splice junctions database is used, and a junction is annotated, 20 is added to its motif value.
Column 17: jI:B:I,Start1,End1,Start2,End2,… Start and End of introns for all junctions (1-based)
Note, that samtools 0.1.18 or later have to be used with these extra attributes.
star --runThreadN 12 --outFilterMultimapNmax 1 --outFilterMismatchNmax 10 --outSAMstrandField intronMotif --genomeLoad LoadAndKeep --outSAMattributes All --genomeDir $TMPDIR/star --readFilesCommand zcat --readFilesIn $TMPDIR/${fq_id}_1.fastq.gz --outFileNamePrefix $TMPDIR/$fq_id/ 2>$TMPDIR/$fq_id/$fq_id.log
To align paired-end reads:
star --runThreadN 12 --outFilterMultimapNmax 1 --outFilterMismatchNmax 10 --outSAMstrandField intronMotif --genomeLoad LoadAndKeep --outSAMattributes All --genomeDir $TMPDIR/star --readFilesCommand zcat --readFilesIn $TMPDIR/${fq_id}_1.fastq.gz $TMPDIR/${fq_id}_2.fastq.gz --outFileNamePrefix $TMPDIR/$fq_id/ 2>$TMPDIR/$fq_id/$fq_id.log
outFilterMismatchNmax 10
int: alignment will be output only if it has fewer mismatches than this value.
If you have un-stranded RNA-seqdata, and wish to run Cufflinks/Cuffdiff on STAR alignments, you will need to run STAR with --outSAMstrandField intronMotif option, which will generate the XS strand attribute for all alignments that contain splice junctions. The spliced alignments that have undefined strand (i.e. containing only non-canonical junctions) will be suppressed.
With default --outSAMattributes Standardoption the following SAM attributes will be generated:
Column 12: NH: number of loci a read (pair) maps to
Column 13: IH: alignment index for all alignments of a read
Column 14: aS: alignment score
Column 15: nM: number of mismatches (does not include indels)
If --outSAMattributes All option is used, the following additional attributes will be output:
Column 16: jM:B:c,M1,M2,… Intron motifs for all junctions (i.e. N in CIGAR): 0: non-canonical; 1: GT/AG, 2: CT/AC, 3: GC/AG, 4: CT/GC, 5: AT/AC, 6: GT/AT. If splice junctions database is used, and a junction is annotated, 20 is added to its motif value.
Column 17: jI:B:I,Start1,End1,Start2,End2,… Start and End of introns for all junctions (1-based)
Note, that samtools 0.1.18 or later have to be used with these extra attributes.
Tuesday, 27 May 2014
ChIP-seq workshops
Usage of bowtie1:
http://cgrlucb.wikispaces.com/file/view/ChIPseq_workshop_spring2013.pdf/415628696/ChIPseq_workshop_spring2013.pdf
http://cgrlucb.wikispaces.com/file/view/ChIP-seq-analysis.pdf/263706166/ChIP-seq-analysis.pdf
http://www.epigenesys.eu/images/stories/protocols/pdf/20120529112503_p56.pdf
ChIP peak calling:
http://epigenie.com/guide-peak-calling-for-chip-seq/
Workflow of ChIP-seq analysis:
https://sites.google.com/site/princetonhtseq/tutorials/chip-seq
http://haemcode.stemcells.cam.ac.uk/doc_pipeline.php
http://cgrlucb.wikispaces.com/file/view/ChIPseq_workshop_spring2013.pdf/415628696/ChIPseq_workshop_spring2013.pdf
http://cgrlucb.wikispaces.com/file/view/ChIP-seq-analysis.pdf/263706166/ChIP-seq-analysis.pdf
http://www.epigenesys.eu/images/stories/protocols/pdf/20120529112503_p56.pdf
ChIP peak calling:
http://epigenie.com/guide-peak-calling-for-chip-seq/
Workflow of ChIP-seq analysis:
https://sites.google.com/site/princetonhtseq/tutorials/chip-seq
http://haemcode.stemcells.cam.ac.uk/doc_pipeline.php
Bowtie1 command specification
bowtie -q -n 1 -m 1 -p 12 --phred64-quals ${dir_bt1}GRCh37.p13.genome $TMPDIR/$fastq $TMPDIR/$mapf > $TMPDIR/$logf
-q
Input file is fastq format. Also the default setting
-p
Number of processors to use
-n
Maximum number of mismatches permitted. This may be 0, 1, 2 or 3. Default is 2.
-m
Suppress all alignments if more than <int> reportable alignments exist for it.
-S
Print alignments in the SAM format
-q
Input file is fastq format. Also the default setting
-p
Number of processors to use
-n
Maximum number of mismatches permitted. This may be 0, 1, 2 or 3. Default is 2.
-m
Suppress all alignments if more than <int> reportable alignments exist for it.
-S
Print alignments in the SAM format
Friday, 23 May 2014
Awk: Patterns
The whole line is stored in $0, the first field in $1, and the second field in $2 and so on.
awk 'Pattern {Action}' InputFile
Patterns inlucde relational expression patterns, range pattern and pattern-matching expression.
The action is always enclosed in braces. The whole awk statement, consisting of pattern and action, must be enclosed in single quotes. You could treat more than one file at once by giving several filenames.
awk's editing statement must consist of either a pattern, an action or both. If a pattern is omitted, then the default pattern is employed. The default pattern is matching every line. If the action is omitted, the default action is printing the whole line.
You can change the field separator by supplying it immediately after the -F option, enclosed in double quotes.
$awk -F":" '/Freddy/ {print $1 " uses " $7}' /etc/passwd
Freddy uses /bin/bash
### Here the content of fields 1 and 7, represented by the variables $1 and $7, respectively, separated by some text that must be enclosed in double quotes, is printed.
With awk, you can write very complex programs. Usually, programs are saved in a file that can be called by the option -f.
Except for the patterns BEGIN and END, patterns can be combined with the Boolean operator ||, && and !.
Regular Expressions
Regular expressions must be enclosed in slashes (/.../). If you append an exclamation mark (!) before the first slash, all records not matching the regular expression will be chosen.
$awk '!/xxx/ {print $1}' enzyme.txt
Pattern-Matching Expressions
Sometimes you will need to ask the question if a regular expression matches a field. Thus, you do not wish to see whether a pattern matches a record but a specified field of this record. In this case, you would use a pattern-matching expression. $n stands for any field variable like $1, $2 or so.
$n~/re/ ### Is true if the field $n matches the regular expression re.
$n!~/re/ ### Is true if the field $n does not match the regular expression re.
Relational Character Expressions
It is often useful to check if a character string (a row of characters) in a certain field fulfills specified conditions. You might, for example, want to check whether a certain string is contained within another string. In these cases, you would use relational character expressions. The character string s must always be enclosed in double quotes.
$n=="s"
$n!="s"
$n<"s" ### Character-by-character comparison of $n and s. The first character of each string is compared, then the second character and so on. The result is true if $n is lexicographically smaller than s: "flag<fly" and "abc<abcd"
$n<="s"
$n>"s"
Uppercase characters are lexicographically less than lowercase characters, and number characters are less than alphabetic characters.
numbers<uppercase<lowercase
"==" requires a perfect match of strings and not a partial match.
Relational Number Expressions
Similar to relation character expressions are relational number expressions, except that they compare the value of numbers.
$n==v
$n!=v
$n<v
$n<=v
$n>v
$n>=v
Any numerical value is allowed on the left side of the relation.
$awk 'length($1)>6' enzyme.txt
Conversion of Numbers and Characters
If you need to force a number to be converted to a string, concatenate that number with the empty string "".
A string (that contains numerical characters) can be converted to a number by adding 0 to that string.
$awk '($2+0)>2' enzyme.txt
Ranges
A range of records can be specified using two patterns separated by a comma.
$awk '/En/,/Hy/' enzyme.txt
### In this example, all records between and including the first line matching the regular expression "En" and the first line matching the regular expression "Hy" are printed.
The range works like a switch. Upon the first occurrence of the first pattern, all lines are treated by the action until the second pattern matches or becomes true. When it becomes switched off, no line matches, until it becomes turned on again. If the "off switch", which is the second pattern, is not found, all records down the end of the file match.
If both range patterns are the same, the switch will be turned on and off at each record. Thus, the statement
awk '/AUTHOR/,/AUTHOR/' structure.pdb
prints only one line.
BEGIN and END
BEGIN and END do not deal with the input file at all. These patterns allow for initialization and cleanup actions, respectively. Both BEGIN and END must have actions and these must be enclosed in braces. There is no default action.
The BEGIN block is executed before the first line (record) of the input file is read. Likewise, the END block is executed after the last record has been read. Both blocks are executed only once.
We can assign the field separator either with the option -F or by assigning the variable FS in the BEGIN block.
The special patterns BEGIN and END cannot be used in ranges or with any operators. However, an awk program can have multiple BEGIN and END blocks. They will be executed in the order in which they appear.
awk 'Pattern {Action}' InputFile
Patterns inlucde relational expression patterns, range pattern and pattern-matching expression.
The action is always enclosed in braces. The whole awk statement, consisting of pattern and action, must be enclosed in single quotes. You could treat more than one file at once by giving several filenames.
awk's editing statement must consist of either a pattern, an action or both. If a pattern is omitted, then the default pattern is employed. The default pattern is matching every line. If the action is omitted, the default action is printing the whole line.
You can change the field separator by supplying it immediately after the -F option, enclosed in double quotes.
$awk -F":" '/Freddy/ {print $1 " uses " $7}' /etc/passwd
Freddy uses /bin/bash
### Here the content of fields 1 and 7, represented by the variables $1 and $7, respectively, separated by some text that must be enclosed in double quotes, is printed.
With awk, you can write very complex programs. Usually, programs are saved in a file that can be called by the option -f.
Except for the patterns BEGIN and END, patterns can be combined with the Boolean operator ||, && and !.
Regular Expressions
Regular expressions must be enclosed in slashes (/.../). If you append an exclamation mark (!) before the first slash, all records not matching the regular expression will be chosen.
$awk '!/xxx/ {print $1}' enzyme.txt
Pattern-Matching Expressions
Sometimes you will need to ask the question if a regular expression matches a field. Thus, you do not wish to see whether a pattern matches a record but a specified field of this record. In this case, you would use a pattern-matching expression. $n stands for any field variable like $1, $2 or so.
$n~/re/ ### Is true if the field $n matches the regular expression re.
$n!~/re/ ### Is true if the field $n does not match the regular expression re.
Relational Character Expressions
It is often useful to check if a character string (a row of characters) in a certain field fulfills specified conditions. You might, for example, want to check whether a certain string is contained within another string. In these cases, you would use relational character expressions. The character string s must always be enclosed in double quotes.
$n=="s"
$n!="s"
$n<"s" ### Character-by-character comparison of $n and s. The first character of each string is compared, then the second character and so on. The result is true if $n is lexicographically smaller than s: "flag<fly" and "abc<abcd"
$n<="s"
$n>"s"
Uppercase characters are lexicographically less than lowercase characters, and number characters are less than alphabetic characters.
numbers<uppercase<lowercase
"==" requires a perfect match of strings and not a partial match.
Relational Number Expressions
Similar to relation character expressions are relational number expressions, except that they compare the value of numbers.
$n==v
$n!=v
$n<v
$n<=v
$n>v
$n>=v
Any numerical value is allowed on the left side of the relation.
$awk 'length($1)>6' enzyme.txt
Conversion of Numbers and Characters
If you need to force a number to be converted to a string, concatenate that number with the empty string "".
A string (that contains numerical characters) can be converted to a number by adding 0 to that string.
$awk '($2+0)>2' enzyme.txt
Ranges
A range of records can be specified using two patterns separated by a comma.
$awk '/En/,/Hy/' enzyme.txt
### In this example, all records between and including the first line matching the regular expression "En" and the first line matching the regular expression "Hy" are printed.
The range works like a switch. Upon the first occurrence of the first pattern, all lines are treated by the action until the second pattern matches or becomes true. When it becomes switched off, no line matches, until it becomes turned on again. If the "off switch", which is the second pattern, is not found, all records down the end of the file match.
If both range patterns are the same, the switch will be turned on and off at each record. Thus, the statement
awk '/AUTHOR/,/AUTHOR/' structure.pdb
prints only one line.
BEGIN and END
BEGIN and END do not deal with the input file at all. These patterns allow for initialization and cleanup actions, respectively. Both BEGIN and END must have actions and these must be enclosed in braces. There is no default action.
The BEGIN block is executed before the first line (record) of the input file is read. Likewise, the END block is executed after the last record has been read. Both blocks are executed only once.
We can assign the field separator either with the option -F or by assigning the variable FS in the BEGIN block.
The special patterns BEGIN and END cannot be used in ranges or with any operators. However, an awk program can have multiple BEGIN and END blocks. They will be executed in the order in which they appear.
Quit cat
Press [Enter] to go to an empty line, and use the [Ctrl]-[D] keys to quit cat.
Thursday, 22 May 2014
Ubuntu--Displaying the Screen on a Projector
System Settings-->Monitors and then select the projector.
Tuesday, 20 May 2014
Linux Touch Command
The touch command is the easiest way to create new, empty files. It is also used to change the timestamps (i.e., dates and times of the most recent access and modification) on existing files and directories.
Linux Shell Programming
Cited from the book "Beginning Linux Programming"
"File descriptor 0 is the standard input to a program, file descriptor 1 is the standard output, and file descriptor 2 is the standard error output.
To direct error and standard output to the same file, using
kill -1 1234 >killouterr.txt 2>&1
Never use the same filename twice in a string of commands.
In globing, single-character wildcards using ?, while [set] allows any of a number of single characters to be checked. [^set] negates the set.
To check if files are scripts or not is to use the "file" command, for example, "file xxx.pl".
To make file executable, "chmod +x bbb.sh".
If the shell environment variable PATH isn’t set to look in the current directory for commands to execute. To change this, either type PATH=$PATH:.on the command line or edit your .bash_profile file to add this command to the end of the file; then log out and log back in again.
If the command is just for yourself, you could create a bin directory in your home directory and add that to your path. If you want the script to be executable by others, you could use /usr/local/bin or another system directory as a convenient location for adding new programs.
chown root /usr/local/bin/first
chgrp root /usr/local/bin/first
By default, all variables are considered and stored as strings, even when they are assigned numeric values. The shell and some utilities will convert numeric strings to their values in order to operate on them as required.
We can assign user input to a variable by using the read command. This takes one parameter, the name of the variable to be read into, and then waits for the user to enter some text.
$ IFS=’’
$ set foo bar bam
$ echo “$@”
foo bar bam
$ echo “$*”
foobarbam
$ unset IFS
$ echo “$*”
foo bar bam
The "["or "test" command is the shell’s Boolean check.
If you prefer putting thenon the same line as if, you must add a semicolon to separate the "test" from the "then".
String and arithmetic comparison is illustrated in page 33.
elif=>else if
If you want the echo command to delete the trailing new line, the most portable choice is to use the printf command rather than the echo command. Some shells use echo –e, but that’s not supported on all systems. bash allows echo -n to suppress the new line, so if you are confident your script needs to work only on bash, you can use that syntax.
The * wildcard expression doesn’t work within quotes.
The AND list construct allows us to execute a series of commands, executing the next command only if all the previous commands have succeeded. The syntax is statement1 &&statement2 &&statement3 &&...
if [ -f file_one ] && echo “hello” && [ -f file_two ] && echo “ there”
then
echo “in if”
else
echo “in else”
fi
"echo" always returns true.
The OR list construct allows us to execute a series of commands until one succeeds, then not execute any more. The syntax is statement1 ||statement2 ||statement3 ||...
"File descriptor 0 is the standard input to a program, file descriptor 1 is the standard output, and file descriptor 2 is the standard error output.
To direct error and standard output to the same file, using
kill -1 1234 >killouterr.txt 2>&1
Never use the same filename twice in a string of commands.
In globing, single-character wildcards using ?, while [set] allows any of a number of single characters to be checked. [^set] negates the set.
To check if files are scripts or not is to use the "file" command, for example, "file xxx.pl".
To make file executable, "chmod +x bbb.sh".
If the shell environment variable PATH isn’t set to look in the current directory for commands to execute. To change this, either type PATH=$PATH:.on the command line or edit your .bash_profile file to add this command to the end of the file; then log out and log back in again.
If the command is just for yourself, you could create a bin directory in your home directory and add that to your path. If you want the script to be executable by others, you could use /usr/local/bin or another system directory as a convenient location for adding new programs.
chown root /usr/local/bin/first
chgrp root /usr/local/bin/first
By default, all variables are considered and stored as strings, even when they are assigned numeric values. The shell and some utilities will convert numeric strings to their values in order to operate on them as required.
We can assign user input to a variable by using the read command. This takes one parameter, the name of the variable to be read into, and then waits for the user to enter some text.
$ IFS=’’
$ set foo bar bam
$ echo “$@”
foo bar bam
$ echo “$*”
foobarbam
$ unset IFS
$ echo “$*”
foo bar bam
The "["or "test" command is the shell’s Boolean check.
If you prefer putting thenon the same line as if, you must add a semicolon to separate the "test" from the "then".
String and arithmetic comparison is illustrated in page 33.
elif=>else if
If you want the echo command to delete the trailing new line, the most portable choice is to use the printf command rather than the echo command. Some shells use echo –e, but that’s not supported on all systems. bash allows echo -n to suppress the new line, so if you are confident your script needs to work only on bash, you can use that syntax.
The * wildcard expression doesn’t work within quotes.
The AND list construct allows us to execute a series of commands, executing the next command only if all the previous commands have succeeded. The syntax is statement1 &&statement2 &&statement3 &&...
if [ -f file_one ] && echo “hello” && [ -f file_two ] && echo “ there”
then
echo “in if”
else
echo “in else”
fi
"echo" always returns true.
The OR list construct allows us to execute a series of commands until one succeeds, then not execute any more. The syntax is statement1 ||statement2 ||statement3 ||...
Monday, 19 May 2014
Fastx command specifications
FASTQ Quality Filter
$ fastq_quality_filter -h usage: fastq_quality_filter [-h] [-v] [-q N] [-p N] [-z] [-i INFILE] [-o OUTFILE] version 0.0.6 [-h] = This helpful help screen. [-q N] = Minimum quality score to keep. [-p N] = Minimum percent of bases that must have [-q] quality. [-z] = Compress output with GZIP. [-i INFILE] = FASTA/Q input file. default is STDIN. [-o OUTFILE] = FASTA/Q output file. default is STDOUT. [-v] = Verbose - report number of sequences. If [-o] is specified, report will be printed to STDOUT. If [-o] is not specified (and output goes to STDOUT), report will be printed to STDERR.
Cited from Fastx toolkit
FASTX-Toolkit only can output files in compressed format, not use them as input.
According to the documentation:
Some tools can compress the output with GZIP (-z).
Only in one reported case can the input be gzipped:fasta_clipping_histogram.pl
help page includes this:INPUT_FILE.FA = input file (in FASTA format, can be GZIPped)
Cited from https://www.biostars.org/p/83237/
FastQC command specifications
fastqc seqfile1 seqfile2 .. seqfileN fastqc [-o output dir] [--(no)extract] [-f fastq|bam|sam] [-c contaminant file] seqfile1 .. seqfileN DESCRIPTION FastQC reads a set of sequence files and produces from each one a quality control report consisting of a number of different modules, each one of which will help to identify a different potential type of problem in your data. If no files to process are specified on the command line then the program will start as an interactive graphical application. If files are provided on the command line then the program will run with no user interaction required. In this mode it is suitable for inclusion into a standardised analysis pipeline. The options for the program as as follows: -h --help Print this help file and exit -v --version Print the version of the program and exit -o --outdir Create all output files in the specified output directory. Please note that this directory must exist as the program will not create it. If this option is not set then the output file for each sequence file is created in the same directory as the sequence file which was processed. --casava Files come from raw casava output. Files in the same sample group (differing only by the group number) will be analysed as a set rather than individually. Sequences with the filter flag set in the header will be excluded from the analysis. Files must have the same names given to them by casava (including being gzipped and ending with .gz) otherwise they won't be grouped together correctly. --extract If set then the zipped output file will be uncompressed in the same directory after it has been created. By default this option will be set if fastqc is run in non-interactive mode. -j --java Provides the full path to the java binary you want to use to launch fastqc. If not supplied then java is assumed to be in your path. --noextract Do not uncompress the output file after creating it. You should set this option if you do not wish to uncompress the output when running in non-interactive mode. --nogroup Disable grouping of bases for reads >50bp. All reports will show data for every base in the read. WARNING: Using this option will cause fastqc to crash and burn if you use it on really long reads, and your plots may end up a ridiculous size. You have been warned! -f --format Bypasses the normal sequence file format detection and forces the program to use the specified format. Valid formats are bam,sam,bam_mapped,sam_mapped and fastq -t --threads Specifies the number of files which can be processed simultaneously. Each thread will be allocated 250MB of memory so you shouldn't run more threads than your available memory will cope with, and not more than 6 threads on a 32 bit machine -c Specifies a non-default file which contains the list of --contaminants contaminants to screen overrepresented sequences against. The file must contain sets of named contaminants in the form name[tab]sequence. Lines prefixed with a hash will be ignored. -k --kmers Specifies the length of Kmer to look for in the Kmer content module. Specified Kmer length must be between 2 and 10. Default length is 5 if not specified. -q --quiet Supress all progress messages on stdout and only report errors.
Cited from Wiki: FastQC
Sunday, 18 May 2014
PBS learning about resource profiles of queues
qsub #submit a job, see man qsub qdel -p jobid #will force purge the job if it is not killed by qdel qstat #list information about queues and jobs qstat -q #list all queues on system qstat -Q #list queue limits for all queues qstat -a #list all jobs on system qstat -s #list all jobs with status comments qstat -r #list all running jobs qstat -f jobid #list full information known about jobid qstat -Qf queueid #list all information known about queueid qstat -B #list summary information about the PBS server qstat -iu userid #get info for queued jobs of userid qstat -u userid #get info for all the jobs of userid qstat -n -1 jobid #will list nodes on which jobid is running in one line checkjob jobid #will list job details
Source (A few basic PBS commands)
Information regarding the screen output is explained here
(PBS Configuration Instructions for the NorduGrid Sites).
Thursday, 15 May 2014
PBS Job Subscript
qsub -v var1=value1,var2=value2 PBS-jobscript
In the PBS job script, you can get the parameters by $var1 and $var2.
In the PBS job script, you can get the parameters by $var1 and $var2.
Sunday, 11 May 2014
Linux Cluster Commands
lscpu - display information about the CPU architecture
uniq - report or omit repeated lines; -c, --count prefix lines by the number of occurrences
PBS:
pbsnodes -a (display information of cluster nodes).
Thursday, 8 May 2014
Convert sra files to fastq files
fastq-dump --split-files SRA_filename
Bash script to download SRA files
#!/bin/bash
#shell script to download SRA files
#
#Author: Minghui Wang, minghui.wang@mssm.edu
#Date: 8 May, 2014
#
cd $WORK_DIR #you must set this directory
#the ftp address to the sequence experiment on SRA (you must change this to the real ftp)
link=ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX016/SRX016366/
#list all the samples shown under the link
sampleLists=`curl $link 2>/dev/null | awk '{print $NF}'`
#download *.sra files
for f in $sampleLists; do wget $link${f}/${f}.sra 2>/dev/null; done
echo "Job completed!"
#shell script to download SRA files
#
#Author: Minghui Wang, minghui.wang@mssm.edu
#Date: 8 May, 2014
#
cd $WORK_DIR #you must set this directory
#the ftp address to the sequence experiment on SRA (you must change this to the real ftp)
link=ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX016/SRX016366/
#list all the samples shown under the link
sampleLists=`curl $link 2>/dev/null | awk '{print $NF}'`
#download *.sra files
for f in $sampleLists; do wget $link${f}/${f}.sra 2>/dev/null; done
echo "Job completed!"
Subscribe to:
Posts (Atom)