Tuesday, 20 January 2015

R Density Plot

R Scatter Plot: symbol color represents number of overlapping points

Monday, 19 January 2015

GATKReport

What is a GATKReport?

VCF Information Field Code

##INFO=<ID=ASP,Number=0,Type=Flag,Description="Is Assembly specific. This is set if the variant only maps to one assembly">
##INFO=<ID=R5,Number=0,Type=Flag,Description="In 5' gene region FxnCode = 15">
##INFO=<ID=VC,Number=1,Type=String,Description="Variation Class">
##INFO=<ID=VP,Number=1,Type=String,Description="Variation Property.  Documentation is at ftp://ftp.ncbi.nlm.nih.gov/snp/specs/dbSNP_BitField_latest.pdf">
##INFO=<ID=WGT,Number=1,Type=Integer,Description="Weight, 00 - unmapped, 1 - weight 1, 2 - weight 2, 3 - weight 3 or more">
##INFO=<ID=dbSNPBuildID,Number=1,Type=Integer,Description="First dbSNP Build for RS">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">


GATK: VariantEval

Variant Eval

Friday, 9 January 2015

SureSelect XT Target Enrichment System for Illumina Paired-End Sequencing

SureSelect XT Target Enrichment System for Illumina Paired-End Sequencing Library Illumina HiSeq and MiSeq Multiplexed Sequencing Platforms

Sequencing Depth and Coverage

"Sequencing depth and coverage: key considerations in genomic analyses"

The theoretical or expected coverage is the average number of times that each nucleotide is expected to be sequenced given a certain number of reads of a given length and the assumption that reads are randomly distributed across an idealized genome.

Actual empirical per-base coverage represents the exact number of times that a base in the reference is covered by a high-quality aligned read from a given sequencing experiment.

Redundancy of coverage is also called the depth or the depth of coverage.

Although the terms depth and coverage can be used interchangeably (as they are in this Review), coverage has also been used to denote the breadth of coverage of a target genome, which is defined as the percentage of target bases that are sequenced a given number of times. For example, a genome sequencing study may sequence a genome to 30× average depth and achieve a 95% breadth of coverage of the reference genome at a minimum depth of ten reads.

GC-rich regions, such as CpG islands, are particularly prone to low depth of coverage partly because these regions remain annealed during amplification. Consequently, it is important to assess the uniformity of coverage, and thus data quality, by calculating the variance in sequencing depth across the genome.

In a sequencing experiment only some of these fragments are sampled. The number of these distinct fragments sequenced is positively correlated with the depth of the true biological variation that has been sampled.

The first human genome that was sequenced using Illumina short-read technology showed that, although almost all homozygous SNVs are detected at a 15× average depth, an average depth of 33× is required to detect the same proportion of heterozygous SNVs.

Consequently, an average depth that exceeds 30× rapidly became the de facto standard13, 14. In 2011, one study15 suggested that an average mapped depth of 50× would be required to allow reliable calling of SNVs and small indels across 95% of the genome. However, improvements in sequencing chemistry reduced GC bias and thus yielded a more uniform coverage of the genome, which later reduced the required average mapped depth to 35×.

The power to detect variants is reduced by low base quality and by non-uniformity of coverage. Increasing sequencing depth can both improve these issues and reduce the false-discovery rate for variant calling. Although read quality is mostly governed by sequencing technology, the uniformity of depth of coverage can also be affected by sample preparation. A GC bias that is introduced during DNA amplification by PCR has been identified as a major source of variation in coverage. Elimination of PCR amplification results in improved coverage of high GC regions of the genome and in fewer duplicate reads16.