Thursday, 5 June 2014

ChIP-seq: normalisation of signals

To normalize for differences in sequencing depth among different experiments, the number of tags per genomic position in each ChIP-Seq library was first rescaled by the total number of mapped tags.

Cited from http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001442#s4

This is done as
RPM(reads per million)=number of reads aligning to a position per million aligned reads.

"As the signal-to-noise ratio in comparative ChIP-seq experiments may differ, we recommend following one of two strategies. First, if samples from different experimental conditions are compared and the same antibody is used32, we recommend performing the series of experiments side by side as this minimizes differences in signal-to-noise ratios due to experimental variability. By using this strategy, the ChIP-seq data do not need to be normalized to each other (other than by the total library read count) and differences in overall binding enrichments can be detected. Second, if this strategy is not possible because different species or antibodies are used, quantile normalization can be used to adjust for differences in signal-to-noise ratios between samples if one can reasonably assume that the overall binding of the factor is similar (e.g., if the factor is well conserved and is expressed at similar levels in the same tissue across species). If this assumption is not justified, it is still possible to identify qualitative differences between samples while being aware that conclusions on the overall binding strength cannot be made. "

Cited from A computational pipeline for comparative ChIP-seq analyse

align2rawsignal (aka. WIGGLER .. because it generates wiggle files) reads in a set of tagAlign/BAM files, filters out multi-mapping tags and creates a consolidated genome-wide signal file using various tag-shift and smoothing parameters as well as various normalization schemes

The method accounts for the following:
1) depth of sequencing
2) mappabilty of the genome (based on read length and ambiguous bases)
3) differentiates between positions that shown 0 signal simply because they are unmappable vs positions that are mappable by have no reads. The former are not represented in the output wiggle or bedgraph files while the latter are represented as 0s.
4) different tag shifts for the different datasets being combined

Signal values generated by Wiggler should NOT be used as-is for the foll. applications directly

1) In general, if you are comparing signal values across multiple experiments, you should not use signal values from wiggler directly. This is because even though wiggler normalizes for sequencing depth, mappability etc. it cannot account for basic data quality (signal-to-noise ratio) differences between experiments. You will want to use some type of explicit cross-experiment normalization.

2) For differential analysis across experiments (e.g. you have ChIP-seq data for a TF in 2 conditions), it is much better to use statistical methods explicitly designed to capture this. e.g. DE-seq, EdgeR or other differential analysis methods. If you are using norm=5 (fold-change) as the normalization method withing Wiggler, it is recommended to use an asinh() or log() transform on the values for statistical correlative analyses.

Cited from https://code.google.com/p/align2rawsignal/

Normalization by sequencing depth (i.e. total read count) is probably the simplest approach but is widely used. There are some drawbacks with this approach. For example, if the signal-to-noise ratio is very different between two libraries, one library is going to contain more background reads than the other. However, these background reads are taken into consideration when you calculate total read counts. This will certainly cause bias in your estimation.

In diffReps, a better approach is taken to do normalization. Basically, the low count regions are first removed from consideration, then a normalization ratio is calculated for each library and each of the regions left. Finally, the medians of the ratios are used as normalization factors. This way, a relatively unbiased, robust estimate can be used for normalization.

Cited from How To Do Normalization Of Two Chip-Seq Data

Other relevant articles:
A signal–noise model for significance analysis of ChIP-seq with negative control
MAnorm: a robust model for quantitative comparison of ChIP-Seq data sets

No comments:

Post a Comment