Skip to content

Network Clustering

Tao edited this page Jul 28, 2022 · 12 revisions

After SynNet-Build, we obtained a synteny network edgelist (with a name like SynNet-k6s5m25). The header looks like this:

Alyr0-0b 936.0 AlyrAL1G19310 AlyrAL1G65100
Alyr0-1b 936.0 AlyrAL1G19350 AlyrAL1G65150
Alyr0-2b 936.0 AlyrAL1G19400 AlyrAL1G65290
Alyr0-3b 936.0 AlyrAL1G19420 AlyrAL1G65430
Alyr0-4b 936.0 AlyrAL1G19430 AlyrAL1G65470

Such a format (of network edgelist) can be imported and visualized as nodes in certain tools (e.g. Cytoscape, Gephi) or packages (e.g. igraph, networkx). However, depending on the number of genomes being used, the constructed network is usually too big to visualize directly in abovementioned tools . Thus clustering is a crucial step for further synteny network analysis. Different methods of network clustering (e.g. Girvan-Newman, Clauset-Newman-Moore, MCL, and walktrap, etc.) report quite different clusters. Empirically, the infomap algorithm is recommended for clustering synteny network data. K-cliques can also be a nice alternative, depending on the purpose of the analysis. This aspect actually deserve further discussion and exploration if you are interested.

Here, we use the infomap algorithm complemented in igraph (a R package). We extract the last two columns from the synteny network constructed, and use the script infomap.r to cluster the network. 

Usage example: 'Rscript infomap.r SynNet-k6s5m25_2cols SynNet-k6s5m25_2cols_infoclusters'

We will have a two-column output file looks like this:

names mem
AlyrAL1G19310 2228
AlyrAL1G19350 1094
AlyrAL1G19400 835
AlyrAL1G19420 3456
AlyrAL1G19430 1614
AlyrAL1G19480 2227
AlyrAL1G19600 3040
AlyrAL1G19650 3039
AlyrAL1G19680 4254

The result just tells you which clusters the nodes belong to, you could sort the 2nd column and summarize cluster sizes, etc. You maybe also interested what clusters contain what species. Such an analysis is called phylogenomic profiling. Use the script Phylogenomic_Profiling.r for such a target. 

Usage example: 'Rscript Phylogenomic_Profiling.r SynNet-k6s5m25_2cols_infoclusters SynNet-k6s5m25_2cols_infoclusters_profiled SynNet-k6s5m25_2cols_infoclusters_profiled_clustered'

Now you've got each cluster profiled by #nodes in each species.

In the script, we use the jaccard distances and ward.D to cluster the patterns from all the clusters. The script also represent you a profiling figure made by the r package 'pheatmap'.