In the text analytics industry, Cluster analysis is a method of grouping similar items together. Objects for cluster analysis may be text documents, images, or other data objects.
Criteria for objects for cluster analysis can be:
- similarity in terms of content (e.g., for text documents: same topic, sentiment, style)
- similarity in terms of form or structure (e.g., for images: same shape, color)
Methods of Cluster Analysis
This can be done using various methods, including but not limited to:
- K-means clustering is a method of finding groups in data by randomly assigning points to k clusters and then optimizing the clusters so that points within a cluster are as close to each other as possible while points in different clusters are as far away from each other as possible.
- Hierarchical clustering is a method of finding groups in data by creating a hierarchy of clusters, where each cluster is a subset of the points in the previous cluster.
- Density-based clustering is a method of finding groups in data by identifying points that are densely packed together and then expanding the clusters from these points.
Classification, Clustering, and Segmentation
Classification, clustering, and segmentation are all similar in that they involve grouping data points together. However, there are some key differences between these terms.
Classification is usually done using supervised learning, where the groups (classes) are already defined and the goal is to assign new data points to the correct group.
Clustering is usually done using unsupervised learning, where the goal is to find groups in the data without having any prior information about what those groups might be.
Segmentation is usually done with the goal of dividing a larger group into smaller groups (segments), where each segment is more homogeneous than the larger group.
When to Use Cluster Analysis
Cluster analysis can be used for a variety of purposes, including but not limited to:
- finding groups of similar items in a dataset
- compressing a dataset by reducing the number of items
- identifying outliers or unique items in a dataset