Help language development. Donate to The Perl Foundation

ML::Clustering zef:antononcube last updated on 2022-08-04
# Raku ML::Clustering

[![License: Artistic-2.0](](

This repository has the code of a Raku package for
Machine Learning (ML)
[Clustering (or Cluster analysis)](
functions, [Wk1].

The Clustering framework includes:

- The algorithms 
  and others

- The distance functions Euclidean, Cosine, Hamming, Manhattan, and others,
  and their corresponding similarity functions

The data in the examples below is generated and manipulated with the packages
["Data::Reshapers"](, and
["Data::Summarizers"](, described in the article
["Introduction to data wrangling with Raku"](,

The plots are made with the package
["Text::Plot"](, [AAp6].


## Installation

Via zef-ecosystem:

zef install ML::Clustering

From GitHub:

zef install


## Usage example

Here we derive a set of random points, and summarize it:

use Data::Generators;
use Data::Summarizers;
use Text::Plot;

my $n = 100;
my @data1 = (random-variate(,1.5), $n) X random-variate(,1), $n)).pick(30);
my @data2 = (random-variate(,1), $n) X random-variate(,1), $n)).pick(50);
my @data3 = [|@data1, |@data2].pick(*);

Here we plot the points:

use Text::Plot;

**Problem:** Group the points in such a way that each group has close (or similar) points.

Here is how we use the function `find-clusters` to give an answer:

use ML::Clustering;
my %res = find-clusters(@data3, 2, prop => 'All');

**Remark:** The first argument is data points that is a list-of-numeric-lists. 
The second argument is a number of clusters to be found. 
(It is in the TODO list to have the number clusters automatically determined -- currently they are not.)  

**Remark:** The function `find-clusters` can return results of different types controlled with the named argument "prop".
Using `prop => 'All'` returns a hash with all properties of the cluster finding result.

Here are sample points from each found cluster:

.say for %res<Clusters>>>.pick(3);

Here are the centers of the clusters (the mean points):


We can verify the result by looking at the plot of the found clusters:

text-list-plot((|%res<Clusters>, %res<MeanPoints>), point-char => <▽ ☐ ●>, title => '▽ - 1st cluster; ☐ - 2nd cluster; ● - cluster centers')

**Remark:** By default `find-clusters` uses the K-means algorithm. The functions `k-means` and `k-medoids`
call `find-clusters` with the option settings `method=>'K-means'` and `method=>'K-medoids'` respectively.


## More interesting looking data

Here is more interesting looking two-dimensional data, `data2D2`:

use Data::Reshapers;
my $pointsPerCluster = 200;
my @data2D5 = [[10,20,4],[20,60,6],[40,10,6],[-30,0,4],[100,100,8]].map({ 
    random-variate($_[0], $_[2]), $pointsPerCluster) Z random-variate($_[1], $_[2]), $pointsPerCluster)
@data2D5 = flatten(@data2D5, max-level=>1).pick(*);

Here is a plot of that data:


Here we find clusters and plot them together with their mean points:

my %clRes = find-clusters(@data2D5, 5, prop=>'All');
text-list-plot([|%clRes<Clusters>, %clRes<MeanPoints>], point-char=><1 2 3 4 5 ●>)


## Detailed function pages

Detailed parameter explanations and usage examples for the functions provided by the package are given in:

- ["K-means function page"](./doc/

- ["K-medoids function page"]()

- ["Bi-sectional-K-means function page"]()


## Implementation considerations

### UML diagram

Here is a UML diagram that shows package's structure:


[PlantUML spec](./resources/class-diagram.puml)
were obtained with the CLI script `to-uml-spec` of the package "UML::Translators", [AAp6].

Here we get the [PlantUML spec](./resources/class-diagram.puml):

to-uml-spec ML::AssociationRuleLearning > ./resources/class-diagram.puml

Here get the [diagram](./resources/class-diagram.png):

to-uml-spec ML::Clustering | java -jar ~/PlantUML/plantuml-1.2022.5.jar -pipe > ./resources/class-diagram.png

**Remark:** Maybe it is a good idea to have an abstract class named, say,
`ML::Clustering::AbstractFinder` that is a parent of
`ML::Clustering::KMeans`, `ML::Clustering::KMedoids`, `ML::Clustering::BiSectionalKMeans`, etc.,
but I have not found to be necessary. (At this point of development.)

**Remark:** It seems it is better to have a separate package for the distance functions, named, say,
"ML::DistanceFunctions". (Although distance functions are not just for ML...)
After thinking over package and function names I will make such a package. 



- [ ] Implement Bi-sectional K-means algorithm, [AAp1].

- [ ] Implement K-medoids algorithm.

- [ ] Automatic determination of the number of clusters.

- [ ] Allow data points to be `Pair` objects the keys of which are point labels.

   - Hence, the returned clusters consist of those labels, not points themselves.

- [ ] Implement Agglomerate algorithm.

- [ ] Factor-out the distance functions in a separate package.


## References

### Articles

[Wk1] Wikipedia entry, ["Cluster Analysis"](

[AA1] Anton Antonov,
["Introduction to data wrangling with Raku"](,
[RakuForPrediction at WordPress](

### Packages

[AAp1] Anton Antonov,
[Bi-sectional K-means algorithm in Mathematica](,
[MathematicaForPrediction at GitHub/antononcube](

[AAp2] Anton Antonov,
[Data::Generators Raku package](,

[AAp3] Anton Antonov,
[Data::Reshapers Raku package](,

[AAp4] Anton Antonov,
[Data::Summarizers Raku package](,

[AAp5] Anton Antonov,
[UML::Translators Raku package](,

[AAp6] Anton Antonov,
[Text::Plot Raku package](,