Help language development. Donate to The Perl Foundation
This repository has the code of a Raku package for Machine Learning (ML) Clustering (or Cluster analysis) functions, [Wk1].
The Clustering framework includes:
The distance functions Euclidean, Cosine, Hamming, Manhattan, and others, and their corresponding similarity functions
The data in the examples below is generated and manipulated with the packages "Data::Generators", "Data::Reshapers", and "Data::Summarizers", described in the article "Introduction to data wrangling with Raku", [AA1].
The plots are made with the package "Text::Plot", [AAp6].
Via zef-ecosystem:
zef install ML::Clustering
From GitHub:
zef install https://github.com/antononcube/Raku-ML-Clustering
Here we derive a set of random points, and summarize it:
use Data::Generators; use Data::Summarizers; use Text::Plot; my $n = 100; my @data1 = (random-variate(NormalDistribution.new(5,1.5), $n) X random-variate(NormalDistribution.new(5,1), $n)).pick(30); my @data2 = (random-variate(NormalDistribution.new(10,1), $n) X random-variate(NormalDistribution.new(10,1), $n)).pick(50); my @data3 = [|@data1, |@data2].pick(*); records-summary(@data3)
# +------------------------------+------------------------------+ # | 0 | 1 | # +------------------------------+------------------------------+ # | Min => 1.9152332258517637 | Min => 3.5534718877004092 | # | 1st-Qu => 5.981652120865826 | 1st-Qu => 5.688429919222849 | # | Mean => 8.134616163671051 | Mean => 8.277055937453321 | # | Median => 8.921578933301301 | Median => 9.365632049881459 | # | 3rd-Qu => 10.002478575801664 | 3rd-Qu => 10.36445713798808 | # | Max => 12.12409031158045 | Max => 11.905774375487244 | # +------------------------------+------------------------------+
Here we plot the points:
use Text::Plot; text-list-plot(@data3)
# +---+---------+---------+---------+----------+---------+---+ # + + 12.00 # | * **** * * ** * | # | * ** * | # + ** ******* * + 10.00 # | **** **** * * * | # | * * ** * | # + + 8.00 # | * | # + * *** * * * + 6.00 # | * * * ** * * * | # | * * * *** * * * | # + * * * * + 4.00 # | | # +---+---------+---------+---------+----------+---------+---+ # 2.00 4.00 6.00 8.00 10.00 12.00
Problem: Group the points in such a way that each group has close (or similar) points.
Here is how we use the function find-clusters
to give an answer:
use ML::Clustering; my %res = find-clusters(@data3, 2, prop => 'All'); %res<Clusters>>>.elems
# (50 30)
Remark: The first argument is data points that is a list-of-numeric-lists. The second argument is a number of clusters to be found. (It is in the TODO list to have the number clusters automatically determined -- currently they are not.)
Remark: The function find-clusters
can return results of different types controlled with the named argument "prop".
Using prop => 'All'
returns a hash with all properties of the cluster finding result.
Here are sample points from each found cluster:
.say for %res<Clusters>>>.pick(3);
# ((8.730149280196386 8.689860840806768) (9.199160845916436 11.23129146924298) (9.296283411759815 11.905774375487244)) # ((4.167575475531873 5.118250052000011) (4.856378776838952 6.519994510725237) (2.941219217209155 3.6148455159938666))
Here are the centers of the clusters (the mean points):
%res<MeanPoints>
# [(10.033388803123739 10.0788732614687) (6.382563428067344 6.117153830280937)]
We can verify the result by looking at the plot of the found clusters:
text-list-plot((|%res<Clusters>, %res<MeanPoints>), point-char => <▽ ☐ ●>, title => '▽ - 1st cluster; ☐ - 2nd cluster; ● - cluster centers')
# ▽ - 1st cluster; ☐ - 2nd cluster; ● - cluster centers # +---+---------+----------+---------+----------+---------+--+ # + ▽ ▽ ▽ ▽ ▽▽ + 12.00 # | ▽ ▽▽ ▽ ▽ ▽ | # | ▽ ▽ ▽▽ ▽▽ ▽ | # + ▽▽▽ ▽ ▽●▽▽ ▽ + 10.00 # | ▽▽▽▽ ▽▽▽ ▽ ▽ | # | ▽ ▽ ▽ ▽ | # + + 8.00 # | ☐ | # + ☐☐ ☐ ●☐ ☐ + 6.00 # | ☐ ☐ ☐☐ ☐ ☐☐ | # | ☐ ☐ ☐☐☐ ☐ ☐ ☐ ☐ | # + ☐ ☐ ☐ ☐ + 4.00 # | ☐ ☐ | # +---+---------+----------+---------+----------+---------+--+ # 2.00 4.00 6.00 8.00 10.00 12.00
Remark: By default find-clusters
uses the K-means algorithm. The functions k-means
and k-medoids
call find-clusters
with the option settings method=>'K-means'
and method=>'K-medoids'
respectively.
Here is more interesting looking two-dimensional data, data2D2
:
use Data::Reshapers; my $pointsPerCluster = 200; my @data2D5 = [[10,20,4],[20,60,6],[40,10,6],[-30,0,4],[100,100,8]].map({ random-variate(NormalDistribution.new($_[0], $_[2]), $pointsPerCluster) Z random-variate(NormalDistribution.new($_[1], $_[2]), $pointsPerCluster) }).Array; @data2D5 = flatten(@data2D5, max-level=>1).pick(*); @data2D5.elems
# 1000
Here is a plot of that data:
text-list-plot(@data2D5)
# +---------------+---------------+--------------+-----------+ # | | # | ******** * | # + * ************ + 100.00 # | * ************ * | # | * * * * * | # | ********* | # + *********** * + 50.00 # | **** | # | ***** * * | # | ***************** | # | ****** **** ********** | # + ****** ** ******* + 0.00 # | | # +---------------+---------------+--------------+-----------+ # 0.00 50.00 100.00
Here we find clusters and plot them together with their mean points:
srand(32); my %clRes = find-clusters(@data2D5, 5, prop=>'All'); text-list-plot([|%clRes<Clusters>, %clRes<MeanPoints>], point-char=><1 2 3 4 5 ●>)
# +--------------+----------------+---------------+----------+ # + 1 + 120.00 # | 11111111111 | # + 1 111111●11111 + 100.00 # | 1 11111111111 1 1 | # + 2 2 2 1 1 1 + 80.00 # | 2222●25555 | # + 22225555●555 5 + 60.00 # + 5555 + 40.00 # | 4 4 | # + 44444444444444444 + 20.00 # |3 33333 444444 ●44444444444 | # +333●3333 4 444444444 + 0.00 # | 333333 4 | # +--------------+----------------+---------------+----------+ # 0.00 50.00 100.00
Detailed parameter explanations and usage examples for the functions provided by the package are given in:
Here is a UML diagram that shows package's structure:
The
PlantUML spec
and
diagram
were obtained with the CLI script to-uml-spec
of the package "UML::Translators", [AAp6].
Here we get the PlantUML spec:
to-uml-spec ML::AssociationRuleLearning > ./resources/class-diagram.puml
#
Here get the diagram:
to-uml-spec ML::Clustering | java -jar ~/PlantUML/plantuml-1.2022.5.jar -pipe > ./resources/class-diagram.png
#
Remark: Maybe it is a good idea to have an abstract class named, say,
ML::Clustering::AbstractFinder
that is a parent of
ML::Clustering::KMeans
, ML::Clustering::KMedoids
, ML::Clustering::BiSectionalKMeans
, etc.,
but I have not found to be necessary. (At this point of development.)
Remark: It seems it is better to have a separate package for the distance functions, named, say, "ML::DistanceFunctions". (Although distance functions are not just for ML...) After thinking over package and function names I will make such a package.
Allow data points to be Pair
objects the keys of which are point labels.
[Wk1] Wikipedia entry, "Cluster Analysis".
[AA1] Anton Antonov, "Introduction to data wrangling with Raku", (2021), RakuForPrediction at WordPress.
[AAp1] Anton Antonov, Bi-sectional K-means algorithm in Mathematica, (2020), MathematicaForPrediction at GitHub/antononcube.
[AAp2] Anton Antonov, Data::Generators Raku package, (2021), GitHub/antononcube.
[AAp3] Anton Antonov, Data::Reshapers Raku package, (2021), GitHub/antononcube.
[AAp4] Anton Antonov, Data::Summarizers Raku package, (2021), GitHub/antononcube.
[AAp5] Anton Antonov, UML::Translators Raku package, (2022), GitHub/antononcube.
[AAp6] Anton Antonov, Text::Plot Raku package, (2022), GitHub/antononcube.