Topic 3: Epigenome clustering
Relevant papers:
Review paper: https://doi.org/10.1371/journal.pcbi.1009423
http://dx.doi.org/10.1038/nmeth.1937
http://dx.doi.org/10.1093/nar/gks1284
https://doi.org/10.1186/s13059-019-1784-2
https://doi.org/10.1093/nar/gkw278
Data: https://vault.sfu.ca/index.php/s/0zGk19YAAIpon3x
Data sets: histone_modifications.csv: ChIP-seq data for 6 tracks. Data is binned to 100 base pairs. Data is from a selected 1% of the genome, in the human lymphoblastoid cell line GM12878. Columns = (1) chromsome; (2) bin start position; (3) bin end position (start+100); (4-10) ChIP-seq measurements. ChIP-seq data is for the following six histone modifications: H3K27ac (associated with regulatory activity), H3K27me3 (repression), H3K36me3 (transcribed genes), H3K4me1 (enhancers), H3K4me3 (promoters), H3K9me3 (repression).
gene_expression.csv: RNA-seq measurement of gene expression. Columns = (1) Chromosome. (2) Location of the gene's transcription start site (TSS). (3) RNA-seq gene expression, quantified as the arcsinh(TPM). (4) Gene ID.
The recommended goal is to cluster genomic positions (5-15 clusters) according to their epigenomic activity. Because this is an unsupervised problem, there is perfect measure of performance. To evaluate the quality of your clustering, we recommend you use the following metric, which measures the degree to which the clusters predict gene expression. Let e_i be the expression of gene i, let y_i be the cluster that gene i falls into, and let mu_k be the average gene expression of TSSs that fall in cluster k. Define the quality as variance(e) - sum_i (e_i - mu_{y_i})^2. You should convince yourself that a random clustering produces a low value and a clustering that groups together all the high-expression genes produces a high value.