Cluster Analysis Tool
© Phil Feldman 2014 -

Finding clusters of similar points in a data set can be a tricky problem. If you want a good overview with lots of citations, check out Pavel Berkhin's survey. Generally, one uses statistical tools such as [R]'s cluster package or SPSS. I wanted to make something that was a little more interactive to play with, and didn't come with a lot of overhead. This tool takes a csv clipboard file and will attempt to render it in 3D and will run a similarity analysis that determines the amount of attraction and repulsion between each element (row) based on the algorithm in this spreadsheet (If there's a name for this algorithm, I'd love to know it). You can try it out with the test case on the lower left of this page. Once ingested, you'll see that the similar purple and green balls will cluster together, while the lone gray ball will be excluded from both groups. Basically, points move until an equilibrium between the attraction and repulsion forces for all the points are reached. In some ways, this resembles simulated annealing, but using a physics-based approach for the modelling.

To use the tool, simply create a file that resembles the example (Note: _name is required. The colors - _red, _green, _blue are optional. They provide colors for the shapes and need to be between 0.0 and 1.0) and paste it into the text area at the bottom of the page. Any number of columns and rows can be used, but the speed of the system will be dictated by the speed of your processor and graphics card. Cells can only contain numbers. Once pasted, click on the "Ingest" button, and the analysis will start. You can move the point of view as follows:

Clicking on a shape will highlight the shape and cause a list of the shapes in the set to be printed in the bottom text panel. These will be sorted by distance, with the closest distance (zero) being the object that was clicked. Clicking the shape again will turn off the highlight.

The "Perturb" button will cause all the objects in the analyses to be "nudged" relative to their current positions. If you think the positions of the shapes are at a saddle point, try this button. The "Randomize" button will reset all the shapes to new, random points. The "Clear" button will delete all shapes and clear the text fields. Lastly, the "Attraction" and "Repulsion" dials will let you adjust those factors interactively.

This is very much a work in progress. On my list of things to do is to allow the use of custom similarity functions that the user can load, and more fancy graphics options.

#this is a comment.
#below are column headers. Items without preceding underscores are values that will be used for clustering.
#The rest of the items a comma-separated values. The number of items must match the number of headers.
#You can paste your text file into this text area and then "Ingest" by pressing the button