K-means++ clustering
From Rosetta Code
K-means++ clustering is a draft programming task. It is not yet considered ready to be promoted as a complete task, for reasons that should be found in its talk page.
K-means++ clustering a classification of data, so that points assigned to the same cluster are similar (in some sense). It is identical to the K-means algorithm, except for the selection of initial conditions. This data was partitioned into 7 clusters using the K-means algorithm.
The task is to implement the K-means++ algorithm. Produce a function which takes two arguments: the number of clusters K, and the dataset to classify. K is a positive integer and the dataset is a list of points in the Cartesian plane. The output is a list of clusters (related sets of points, according to the algorithm).
For extra credit (in order):
- Provide a function to exercise your code, which generates a list of random points.
- Provide a visualization of your results, including centroids (see example image).
- Generalize the function to polar coordinates (in radians).
- Generalize the function to points in an arbitrary N space (i.e. instead of x,y pairs, points are an N-tuples representing coordinates in ℝN).
If this is different or more difficult than the [naive] solution for ℝ2, discuss what had to change to support N dimensions.
Extra credit is only awarded if the examples given demonstrate the feature in question. To earn credit for 1. and 2., visualize 6 clusters of 30,000 points in ℝ2. It is not necessary to provide visualization for spaces higher than ℝ2 but the examples should demonstrate features 3. and 4. if the solution has them.
J
Solution:NB. Selection of initial centroids, per K-means++Example:
initialCentroids =: (] , randomCentroid)^:(<:@:]`(,:@:[email protected]:[))~
seedCentroid =: {~ [email protected]#
randomCentroid =: [ {~ [: wghtProb [: <./ distance/~
distance =: +/&.:*:@:-"1 NB. Extra credit #3 (N-dimensional is the same as 2-dimensional in J)
wghtProb =: 1&$: : ((%{:)@:(+/\)@:] I. [ [email protected]$ 0:)"0 1 NB. Due to Roger Hui http://j.mp/lj5Pnt
NB. Having selected the initial centroids, the standard K-means algo follows
centroids =: ([ mean/.~ closestCentroid)^:(]`_:`initialCentroids)
closestCentroid =: [: (i.<./)"1 distance/
mean =: +/ % #
NB. Visualization code. Due to Max Harms http://j.mp/l8L45V
NB. as is the example image in this task
packPoints =: <"1@:|:
plotClusters =: dyad define NB. Extra credit #2
require 'plot'
pd 'reset;aspect 1;type dot;pensize 2'
[email protected]:packPoints&> y
pd 'type marker;markersize 1.5;color 0 0 0'
[email protected]:packPoints x
pd 'markersize 0.8;color 255 255 0'
[email protected]:packPoints x
pd 'show'
)
randMatrix =: [email protected]$&0 NB. Extra credit #1
plotRandomClusters =: 3&$: : (dyad define)
dataset =. randMatrix 2 {. y,2
centers =. x centroids dataset
clusters =. centers (closestCentroid~ </. ]) dataset
centers plotClusters clusters
)
6 plotRandomClusters 30000 NB. 3e5 points, 6 clusters
plotRandomClusters 300 NB. 300 points, 3 clusters
Extra credit:
- Given a data set size & dimensionality randMatrix random points in that shape.
- Two-dimensional visualizations provided by plotRandomClusters.
- In J, the naive 2D approach is identical to the ND approach; no changes are necessary.
- Polar coordinates are not available in this version (but wouldn't be hard to provide: &.cartToPole
