Digital Themes

Clustering

Clustering is the process of grouping together objects so that those in the same grouping (cluster) have more similarities in common with those in their group than those in other groups. Clustering looks at all input data and is commonly used in different machine learning (ML) techniques. In creating a cluster, ML or data scientists will look at all of the different data points and create clusters based on what characteristics the data share in contrast to the characteristics of other data. The clustering method depends on the algorithm being used. Clustering approaches can include measuring the average distance between data points within dimensional spaces, counting the number of intervals for each set of data, expected number of clusters, or  basing them on dense areas of data. Clustering results in clear relationships between data, with reasons for why each data point belongs in its cluster.   

Clustered data can then be used to perform a cluster analysis. Just as there are different ways of clustering data, there are different ways of analyzing the clusters. Cluster analysis now most often occurs through machine learning, which can use different algorithms to analyze inputted data. Some popular analyses include: looking at the hierarchy of clusters (including average linkage clustering and hierarchical clustering), which connects clusters based on distances between data points, with closer data points being considered more similar than those further away; density-based clustering that defines clusters based on the density of data sets in relation to one another; and centroid-based clustering, where clusters are formed by finding the nearest cluster centers, which is not necessarily a point of data. The clustering results can vary depending on the clustering approach utilized, so multiple analyses can be run on the same data to get a better overall view of how the data interacts. There are multiple programs available that can assist with clustering analysis, including the free program Scikit-learn, which utilizes the Python programming language. Rather than requiring many hours of manual calculations, ML programs can be set to run automatically, and the data can be checked once it is completed.

Clustering has a number of ways that it can be applied for businesses or organizations, such as in:

  • Biology: Clustering can compare and contrast communities of organisms in different environments, to see similarities in genetic testing, and to help automatically assign genotypes.

  • Medicine: Clustering can help identify patterns in how a disease spreads, or to note the differences in type of tissues as part of a PET scan. Clustering can also be used for anomaly detection when comparing pathological and non-pathological data points.

  • Business: By looking at large groups of shoppers and the items they purchase, clustering can suggest items that should be placed next to each other in order to increase sales.  

  • Marketing: Market research often utilizes clustering to better understand large portions of the population through surveys and panels. Social network analysis can also lead to creating clusters of communities and assist in figuring out best ways to interact with these clusters.

  • Finance: Clustering can be utilized to classify stocks into different categories, and, through proper analysis, make informed decisions based on how the different clusters are likely to act.
Related content