Clustering Analysis in ChlamyNET

The goal of clustering techniques consists of the identification of groups or clusters of genes such that genes from the same cluster are highly co-expressed whereas genes from different clusters are not co-expressed.

The following libraries are used to apply clustering techniques over ChlamyNET.

library("WGCNA")
## Loading required package: dynamicTreeCut
## Loading required package: fastcluster
## 
## Attaching package: 'fastcluster'
## 
## The following object is masked from 'package:stats':
## 
##     hclust
## 
## Loading required package: DBI
## Loading required package: AnnotationDbi
## Loading required package: stats4
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## 
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## 
## The following object is masked from 'package:stats':
## 
##     xtabs
## 
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, as.vector, cbind,
##     colnames, do.call, duplicated, eval, evalq, Filter, Find, get,
##     intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rep.int, rownames, sapply, setdiff, sort,
##     table, tapply, union, unique, unlist, unsplit
## 
## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## 
## Loading required package: GenomeInfoDb
## Loading required package: S4Vectors
## Loading required package: IRanges
## 
## Attaching package: 'AnnotationDbi'
## 
## The following object is masked from 'package:GenomeInfoDb':
## 
##     species
## ==========================================================================
## *
## *  Package WGCNA 1.46 loaded.
## *
## *    Important note: It appears that your system supports multi-threading,
## *    but it is not enabled within WGCNA in R. 
## *    To allow multi-threading within WGCNA with all available cores, use 
## *
## *          allowWGCNAThreads()
## *
## *    within R. Use disableWGCNAThreads() to disable threading if necessary.
## *    Alternatively, set the following environment variable on your system:
## *
## *          ALLOW_WGCNA_THREADS=<number_of_processors>
## *
## *    for example 
## *
## *          ALLOW_WGCNA_THREADS=4
## *
## *    To set the environment variable in linux bash shell, type 
## *
## *           export ALLOW_WGCNA_THREADS=4
## *
## *     before running R. Other operating systems or shells will
## *     have a similar command to achieve the same aim.
## *
## ==========================================================================
## 
## Attaching package: 'WGCNA'
## 
## The following object is masked from 'package:IRanges':
## 
##     cor
## 
## The following object is masked from 'package:stats':
## 
##     cor
library("cluster")
allowWGCNAThreads()
## Allowing multi-threading with up to 4 threads.

We use as similiraty between genes one minus the absolute value of the Pearson correlation coefficciente between gene profiles. To compute this it is necessary to extract the gene expression profiles.

setwd("~/Documents/ChlamyNET_data")

library(cummeRbund)
## Loading required package: ggplot2
## Loading required package: reshape2
## Loading required package: rtracklayer
## Loading required package: GenomicRanges
## Loading required package: Gviz
## Loading required package: grid
## 
## Attaching package: 'cummeRbund'
## 
## The following object is masked from 'package:GenomicRanges':
## 
##     promoters
## 
## The following object is masked from 'package:IRanges':
## 
##     promoters
## 
## The following object is masked from 'package:Biobase':
## 
##     samples
## 
## The following object is masked from 'package:BiocGenerics':
## 
##     conditions
data <- readCufflinks(dbFile="chlamyData.db")

xloc.diff <- read.table(file="differentially_expressed_genes.txt",header=FALSE)
DEGs <- as.vector(xloc.diff[[1]])

gene.fpkm <- fpkm(genes(data))

matrix.levels <- matrix(nrow=20,ncol=length(DEGs))

for(i in 1:length(DEGs))
{
  matrix.levels[,i] <- subset(gene.fpkm,(gene_id == DEGs[i]))[["fpkm"]]
}
rm(gene.fpkm)
rm(xloc.diff)
rm(DEGs)
rm(cre.names)
## Warning in rm(cre.names): object 'cre.names' not found
rm(data)
rm(i)
gc()
##           used  (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells 3640471 194.5    5543382 296.1  4170209 222.8
## Vcells 2959167  22.6    8713995  66.5  8688249  66.3
cre.names <- read.table(file="cre_names.txt",header=FALSE)
cre.names <- as.vector(cre.names[["V1"]])
colnames(matrix.levels) <- cre.names
gc()
##           used  (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells 3654337 195.2    5543382 296.1  4170209 222.8
## Vcells 2999865  22.9    8713995  66.5  8688249  66.3
non.isolated.genes <- read.table(file="non_isolated_nodes.txt")
non.isolated.genes <- as.vector(non.isolated.genes[[1]])

matrix.levels <- matrix.levels[,non.isolated.genes]

similarity.matrix <- 1 - abs(cor(matrix.levels))
rm(matrix.levels)
gc()
##            used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  3654518 195.2    5543382  296.1   4170209  222.8
## Vcells 86860906 662.7  179944658 1372.9 171182883 1306.1

We determine different number of gene clusters ranging from 4 to 20 using the two most widely used clustering algorithms hierarchical clustering with the function hclust and partition around medoids with the function pam.

hierarchical.clustering <- hclust(as.dist(similarity.matrix),method="average")
gc()
##            used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  3655466 195.3    5543382  296.1   4170209  222.8
## Vcells 86887891 663.0  235198357 1794.5 234045741 1785.7
hclust.4 <- cutree(hierarchical.clustering,k=4)
hclust.5 <- cutree(hierarchical.clustering,k=5)
hclust.6 <- cutree(hierarchical.clustering,k=6)
hclust.7 <- cutree(hierarchical.clustering,k=7)
hclust.8 <- cutree(hierarchical.clustering,k=8)
hclust.9 <- cutree(hierarchical.clustering,k=9)
hclust.10 <- cutree(hierarchical.clustering,k=10)
hclust.11 <- cutree(hierarchical.clustering,k=11)
hclust.12 <- cutree(hierarchical.clustering,k=12)
hclust.13 <- cutree(hierarchical.clustering,k=13)
hclust.14 <- cutree(hierarchical.clustering,k=14)
hclust.15 <- cutree(hierarchical.clustering,k=15)
hclust.16 <- cutree(hierarchical.clustering,k=16)
hclust.17 <- cutree(hierarchical.clustering,k=17)
hclust.18 <- cutree(hierarchical.clustering,k=18)
hclust.19 <- cutree(hierarchical.clustering,k=19)
hclust.20 <- cutree(hierarchical.clustering,k=20)
rm(hierarchical.clustering)
gc()
##            used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  3655950 195.3    5543382  296.1   4170209  222.8
## Vcells 86943671 663.4  235198357 1794.5 234045741 1785.7
pam.4 <- pam(as.dist(similarity.matrix),k=4,diss=TRUE)
pam.5 <- pam(as.dist(similarity.matrix),k=5,diss=TRUE)
pam.6 <- pam(as.dist(similarity.matrix),k=6,diss=TRUE)
pam.7 <- pam(as.dist(similarity.matrix),k=7,diss=TRUE)
pam.8 <- pam(as.dist(similarity.matrix),k=8,diss=TRUE)
pam.9 <- pam(as.dist(similarity.matrix),k=9,diss=TRUE)
pam.10 <- pam(as.dist(similarity.matrix),k=10,diss=TRUE)
pam.11 <- pam(as.dist(similarity.matrix),k=11,diss=TRUE)
pam.12 <- pam(as.dist(similarity.matrix),k=12,diss=TRUE)
pam.13 <- pam(as.dist(similarity.matrix),k=13,diss=TRUE)
pam.14 <- pam(as.dist(similarity.matrix),k=14,diss=TRUE)
pam.15 <- pam(as.dist(similarity.matrix),k=15,diss=TRUE)
pam.16 <- pam(as.dist(similarity.matrix),k=16,diss=TRUE)
pam.17 <- pam(as.dist(similarity.matrix),k=17,diss=TRUE)
pam.18 <- pam(as.dist(similarity.matrix),k=18,diss=TRUE)
pam.19 <- pam(as.dist(similarity.matrix),k=19,diss=TRUE)
pam.20 <- pam(as.dist(similarity.matrix),k=20,diss=TRUE)
gc()
##            used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  3658604 195.4    5543382  296.1   4170209  222.8
## Vcells 87651750 668.8  292601573 2232.4 348258323 2657.1

We compute the silhouette of the clustering results combining different number of gene clusters ranging from 4 to 20 with the two most widely used clustering algorithms hierarchical clustering and partition around medoids with the function silhouette.

h.sil4 <- silhouette(hclust.4,dist=similarity.matrix)
h.sil5 <- silhouette(hclust.5,dist=similarity.matrix)
h.sil6 <- silhouette(hclust.6,dist=similarity.matrix)
h.sil7 <- silhouette(hclust.7,dist=similarity.matrix)
h.sil8 <- silhouette(hclust.8,dist=similarity.matrix)
h.sil9 <- silhouette(hclust.9,dist=similarity.matrix)
h.sil10 <- silhouette(hclust.10,dist=similarity.matrix)
h.sil11 <- silhouette(hclust.11,dist=similarity.matrix)
h.sil12 <- silhouette(hclust.12,dist=similarity.matrix)
h.sil13 <- silhouette(hclust.13,dist=similarity.matrix)
h.sil14 <- silhouette(hclust.14,dist=similarity.matrix)
h.sil15 <- silhouette(hclust.15,dist=similarity.matrix)
h.sil16 <- silhouette(hclust.16,dist=similarity.matrix)
h.sil17 <- silhouette(hclust.17,dist=similarity.matrix)
h.sil18 <- silhouette(hclust.18,dist=similarity.matrix)
h.sil19 <- silhouette(hclust.19,dist=similarity.matrix)
h.sil20 <- silhouette(hclust.20,dist=similarity.matrix)


hclust.sil.values <- c(summary(h.sil4)[["avg.width"]],summary(h.sil5)[["avg.width"]],summary(h.sil6)[["avg.width"]],summary(h.sil7)[["avg.width"]],summary(h.sil8)[["avg.width"]],summary(h.sil9)[["avg.width"]],summary(h.sil10)[["avg.width"]],summary(h.sil11)[["avg.width"]],summary(h.sil12)[["avg.width"]],summary(h.sil13)[["avg.width"]],summary(h.sil14)[["avg.width"]],summary(h.sil15)[["avg.width"]],summary(h.sil16)[["avg.width"]],summary(h.sil17)[["avg.width"]],summary(h.sil18)[["avg.width"]],summary(h.sil19)[["avg.width"]],summary(h.sil20)[["avg.width"]])

p.sil4 <- silhouette(pam.4)
p.sil5 <- silhouette(pam.5)
p.sil6 <- silhouette(pam.6)
p.sil7 <- silhouette(pam.7)
p.sil8 <- silhouette(pam.8)
p.sil9 <- silhouette(pam.9)
p.sil10 <- silhouette(pam.10)
p.sil11 <- silhouette(pam.11)
p.sil12 <- silhouette(pam.12)
p.sil13 <- silhouette(pam.13)
p.sil14 <- silhouette(pam.14)
p.sil15 <- silhouette(pam.15)
p.sil16 <- silhouette(pam.16)
p.sil17 <- silhouette(pam.17)
p.sil18 <- silhouette(pam.18)
p.sil19 <- silhouette(pam.19)
p.sil20 <- silhouette(pam.20)
gc()
##            used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  3663571 195.7    5543382  296.1   4170209  222.8
## Vcells 88596699 676.0  245849320 1875.7 348258323 2657.1
pam.sil.values <- c(summary(p.sil4)[["avg.width"]],summary(p.sil5)[["avg.width"]],summary(p.sil6)[["avg.width"]],summary(p.sil7)[["avg.width"]],summary(p.sil8)[["avg.width"]],summary(p.sil9)[["avg.width"]],summary(p.sil10)[["avg.width"]],summary(p.sil11)[["avg.width"]],summary(p.sil12)[["avg.width"]],summary(p.sil13)[["avg.width"]],summary(p.sil14)[["avg.width"]],summary(p.sil15)[["avg.width"]],summary(p.sil16)[["avg.width"]],summary(p.sil17)[["avg.width"]],summary(p.sil18)[["avg.width"]],summary(p.sil19)[["avg.width"]],summary(p.sil20)[["avg.width"]])

We plot the silhouettes of the two different algorithms, hclust and PAM, for different number of gene clusters in order to chose the best option.

plot(4:20,type="overplotted",pam.sil.values,col="red",ylim=c(0.05,0.33),lwd=5,cex.axis=1.25,font.axis=2,xlab="Number of Clusters",ylab="Average Silhouette Width",cex.lab=1.5,pch=2)
## Warning in plot.xy(xy, type, ...): plot type 'overplotted' will be
## truncated to first character
lines(4:20,hclust.sil.values,type="overplotted",col="blue",lwd=5,pch=0)
## Warning in plot.xy(xy.coords(x, y), type = type, ...): plot type
## 'overplotted' will be truncated to first character
legend("topright",legend=c("PAM","HCLUST"),pch=c(2,0),lwd=5,col=c("red","blue"),cex=1.2)

We plot the silhouette obtained using PAM with nine gene clusters.

plot(p.sil9,border="blue",main="",axis.cex=1.5)
## Warning in plot.window(xlim, ylim, log = log, ...): "axis.cex" is not a
## graphical parameter
## Warning in title(main = main, sub = sub, xlab = xlab, ylab = ylab, ...):
## "axis.cex" is not a graphical parameter
## Warning in axis(if (horiz) 1 else 2, cex.axis = cex.axis, ...): "axis.cex"
## is not a graphical parameter