Help language development. Donate to The Perl Foundation

ML::TriesWithFrequencies cpan:ANTONOV last updated on 2022-04-03
# Raku ML::TriesWithFrequencies

This Raku package has functions for creation and manipulation of 
[Tries (Prefix trees)]( 
with frequencies.

The package provides Machine Learning (ML) functionalities, 
not "just" a Trie data structure.

This Raku implementation closely follows the Java implementation [AAp3].

The system of function names follows the one used in the Mathematica package [AAp2].

**Remark:** Below Mathematica and Wolfram Language (WL) are used as synonyms.

**Remark:** There is a Raku package with an alternative implementation, [AAp6], 
made mostly for comparison studies. (See the implementation notes below.) 
The package in this repository, `ML::TriesWithFrequencies`, is my *primary* 
Tries-with-frequencies package.


## Usage 

Consider a trie (prefix tree) created over a list of words:

use ML::TriesWithFrequencies;
my $tr = trie-create-by-split( <bar bark bars balm cert cell> );

Here we convert the trie with frequencies above into a trie with probabilities:

my $ptr = trie-node-probabilities( $tr );

Here we shrink the trie with probabilities above:


Here we retrieve a sub-trie with a key:

trie-say(trie-retrieve($ptr, 'bar'.comb))


## Representation

Each trie is a tree of objects of the class `ML::TriesWithFrequencies::Trie`.
Such trees can be nicely represented as hash-maps. For example:

my $tr = trie-shrink(trie-create-by-split(<core cort>));
say $tr.gist;

The function `trie-say` uses that Hash-representation:


### JSON

The JSON-representation follows the inherent object-tree
representation with `ML::TriesWithFrequencies::Trie`:

say $tr.JSON;

### XML

The XML-representation follows (resembles) the Hash-representation 
(and output from `trie-say`):

say $tr.XML;

Using the XML representation allows for 
searches, say, using the package 
Here is an example:

use XML::XPath;
my $tr0 = trie-create-by-split(<bell best>);
Convert to XML:

say $tr0.XML;

Search for `<b e l>`:


### WL

The Hash-representation is used in the Mathematica package [AAp2].
Hence, such WL format is provided by the Raku package:

say $tr.WL;


## Implementation notes

This package is a Raku re-implementation of the Java Trie package [AAp3].

The initial implementation was:
- ≈ 5-6 times slower than the Mathematica implementation [AAp2]
- ≈ 100 times slower than the Java implementation [AAp3]

The initial implementation used:
- General types for Trie nodes, i.e. `Str` for the key and `Numeric` for the value
- Argument type verification with `where` statements in the signatures of the `trie-*` functions

After reading [RAC1] I refactored the code to use native types (`num`, `str`)
and moved the `where` verifications inside the functions. 

I also refactored the function `trie-merge` to use less copying of data and
to take into account which of the two tries has smaller number of children.

After those changes the current Raku implementation is:
- ≈ 2.5 times slower than the Mathematica implementation [AAp2]
- ≈ 40 times slower than the Java implementation [AAp3]

After the (monumental) work on 
[the new MoarVM dispatch mechanism](,
[JW1], was incorporated in standard Rakudo releases (September/October 2021)
additional 20% speed-up was obtained. Currently this package is:
- ≈ 2.0 times slower than the Mathematica implementation [AAp2]
- ≈ 30 times slower than the Java implementation [AAp3]

These speed improvements are definitely not satisfactory. I strongly consider:

1. Re-implementing in Raku the Mathematica package [AAp2], i.e. to move into Tries that are hashes.

   - (It turned out option 1 does not produce better results; see [AAp6].)
2. Re-implementing in C or C++ the Java package [AAp3] and hooking it up to Raku.



In the following list the most important items are placed first.

- [X] Implement "get words" and "get root-to-leaf paths" functions.
     - See `trie-words` and `trie-root-to-leaf-paths`.
- [X] Convert most of the WL unit tests in [AAp5] into Raku tests.

- [X] Implement Trie traversal functions.

     - The general `trie-map` function is in a separate role.
         - A concrete traversal functionality is a class that does the role 
           and provides additional context.
- [X] Implement (sub-)trie removal functions.

     - [X] By threshold (below and above)
     - [X] By Pareto principle adherence (top and bottom)
     - [X] By regex over the keys

- [ ] Implement optional ULP spec argument for relevant functions:
     - [X] `trie-root-to-leaf-paths`
     - [X] `trie-words`
     - [ ] Membership test functions?
- [ ] Design and code refactoring so trie objects to have OOP interface.

    - Instead of just having `trie-words($tr, <c>)` we should be also able to say `$tr.trie-words(<c>)`.
- [ ] Implement `trie-prune` function.

- [ ] Implement Trie-based classification.

- [ ] Investigate faster implementations.
  - [X] Re-implement the Trie functionalities using hash representation (instead of a tree of Trie-node objects.)
     - See [AAp6].
  - [ ] Make a C or C++ implementation and hook it up to Raku.  
- [ ] Document examples of doing Trie-based text mining or data-mining.

- [ ] Program a trie-form visualization that is "wide", i.e. places the children nodes horizontally.


## References

### Articles

[AA1] Anton Antonov,
["Tries with frequencies for data mining"](,
[MathematicaForPrediction at WordPress](

[AA2] Anton Antonov,
["Removal of sub-trees in tries"](,
[MathematicaForPrediction at WordPress](

[AA3] Anton Antonov,
["Tries with frequencies in Java"](,
[MathematicaForPrediction at WordPress](
[GitHub Markdown](

[JW1] Jonathan Worthington,
["The new MoarVM dispatch mechanism is here!"](,
[6guts at WordPress](

[RAC1] Tib,
["Day 10: My 10 commandments for Raku performances"](,
[Raku Advent Calendar](

[WK1] Wikipedia entry, [Trie](

### Packages

[AAp1] Anton Antonov, 
[Tries with frequencies Mathematica Version 9.0 package](,
[MathematicaForPrediction at GitHub](

[AAp2] Anton Antonov,
[Tries with frequencies Mathematica package](,
[MathematicaForPrediction at GitHub](

[AAp3] Anton Antonov, 
[Tries with frequencies in Java](, 
[MathematicaForPrediction at GitHub](

[AAp4] Anton Antonov, 
[Java tries with frequencies Mathematica package](, 
[MathematicaForPrediction at GitHub](

[AAp5] Anton Antonov, 
[Java tries with frequencies Mathematica unit tests](, 
[MathematicaForPrediction at GitHub](

[AAp6] Anton Antonov,
[ML::HashTriesWithFrequencies Raku package](,

### Videos

[AAv1] Anton Antonov,
["Prefix Trees with Frequencies for Data Analysis and Machine Learning"](,
Wolfram Technology Conference 2017,
[Wolfram channel at YouTube](