Help language development. Donate to The Perl Foundation
This Raku package has functions for creation and manipulation of Tries (Prefix trees) with frequencies.
The package provides Machine Learning (ML) functionalities, not "just" a Trie data structure.
This Raku implementation closely follows the Java implementation [AAp3].
The system of function names follows the one used in the Mathematica package [AAp2].
Remark: Below Mathematica and Wolfram Language (WL) are used as synonyms.
Remark: There is a Raku package with an alternative implementation, [AAp6],
made mostly for comparison studies. (See the implementation notes below.)
The package in this repository, ML::TriesWithFrequencies
, is my primary
Tries-with-frequencies package.
Consider a trie (prefix tree) created over a list of words:
use ML::TriesWithFrequencies; my $tr = trie-create-by-split( <bar bark bars balm cert cell> ); trie-say($tr);
# TRIEROOT => 6 # ├─b => 4 # │ └─a => 4 # │ ├─l => 1 # │ │ └─m => 1 # │ └─r => 3 # │ ├─k => 1 # │ └─s => 1 # └─c => 2 # └─e => 2 # ├─l => 1 # │ └─l => 1 # └─r => 1 # └─t => 1
Here we convert the trie with frequencies above into a trie with probabilities:
my $ptr = trie-node-probabilities( $tr ); trie-say($ptr);
# TRIEROOT => 1 # ├─b => 0.6666666666666666 # │ └─a => 1 # │ ├─l => 0.25 # │ │ └─m => 1 # │ └─r => 0.75 # │ ├─k => 0.3333333333333333 # │ └─s => 0.3333333333333333 # └─c => 0.3333333333333333 # └─e => 1 # ├─l => 0.5 # │ └─l => 1 # └─r => 0.5 # └─t => 1
Here we shrink the trie with probabilities above:
trie-say(trie-shrink($ptr));
# TRIEROOT => 1 # ├─ba => 0.6666666666666666 # │ ├─lm => 0.25 # │ └─r => 0.75 # │ ├─k => 0.3333333333333333 # │ └─s => 0.3333333333333333 # └─ce => 0.3333333333333333 # ├─ll => 0.5 # └─rt => 0.5
Here we retrieve a sub-trie with a key:
trie-say(trie-retrieve($ptr, 'bar'.comb))
# r => 0.75 # ├─k => 0.3333333333333333 # └─s => 0.3333333333333333
Each trie is a tree of objects of the class ML::TriesWithFrequencies::Trie
.
Such trees can be nicely represented as hash-maps. For example:
my $tr = trie-shrink(trie-create-by-split(<core cort>)); say $tr.gist;
# {TRIEROOT => {TRIEVALUE => 2, cor => {TRIEVALUE => 2, e => {TRIEVALUE => 1}, t => {TRIEVALUE => 1}}}}
The function trie-say
uses that Hash-representation:
trie-say($tr)
# TRIEROOT => 2 # └─cor => 2 # ├─e => 1 # └─t => 1
The JSON-representation follows the inherent object-tree
representation with ML::TriesWithFrequencies::Trie
:
say $tr.JSON;
# {"key":"TRIEROOT", "value":2, "children":[{"key":"cor", "value":2, "children":[{"key":"t", "value":1, "children":[]}, {"key":"e", "value":1, "children":[]}]}]}
The XML-representation follows (resembles) the Hash-representation
(and output from trie-say
):
say $tr.XML;
# <TRIEROOT> # <TRIEVALUE>2</TRIEVALUE> # <cor> # <TRIEVALUE>2</TRIEVALUE> # <t> # <TRIEVALUE>1</TRIEVALUE> # </t> # <e> # <TRIEVALUE>1</TRIEVALUE> # </e> # </cor> # </TRIEROOT>
Using the XML representation allows for
XPath
searches, say, using the package
XML::XPath
.
Here is an example:
use XML::XPath; my $tr0 = trie-create-by-split(<bell best>); trie-say($tr0);
# TRIEROOT => 2 # └─b => 2 # └─e => 2 # ├─l => 1 # │ └─l => 1 # └─s => 1 # └─t => 1
Convert to XML:
say $tr0.XML;
# <TRIEROOT> # <TRIEVALUE>2</TRIEVALUE> # <b> # <TRIEVALUE>2</TRIEVALUE> # <e> # <TRIEVALUE>2</TRIEVALUE> # <s> # <TRIEVALUE>1</TRIEVALUE> # <t> # <TRIEVALUE>1</TRIEVALUE> # </t> # </s> # <l> # <TRIEVALUE>1</TRIEVALUE> # <l> # <TRIEVALUE>1</TRIEVALUE> # </l> # </l> # </e> # </b> # </TRIEROOT>
Search for <b e l>
:
say XML::XPath.new(xml=>$tr0.XML).find('//b/e/l');
# <l> # <TRIEVALUE>1</TRIEVALUE> # <l> # <TRIEVALUE>1</TRIEVALUE> # </l> # </l>
The Hash-representation is used in the Mathematica package [AAp2]. Hence, such WL format is provided by the Raku package:
say $tr.WL;
# <|$TrieRoot -> <|$TrieValue -> 2, "cor" -> <|$TrieValue -> 2, "t" -> <|$TrieValue -> 1|>, "e" -> <|$TrieValue -> 1|>|>|>|>
This package is a Raku re-implementation of the Java Trie package [AAp3].
The initial implementation was: - ≈ 5-6 times slower than the Mathematica implementation [AAp2] - ≈ 100 times slower than the Java implementation [AAp3]
The initial implementation used:
- General types for Trie nodes, i.e. Str
for the key and Numeric
for the value
- Argument type verification with where
statements in the signatures of the trie-*
functions
After reading [RAC1] I refactored the code to use native types (num
, str
)
and moved the where
verifications inside the functions.
I also refactored the function trie-merge
to use less copying of data and
to take into account which of the two tries has smaller number of children.
After those changes the current Raku implementation is: - ≈ 2.5 times slower than the Mathematica implementation [AAp2] - ≈ 40 times slower than the Java implementation [AAp3]
After the (monumental) work on the new MoarVM dispatch mechanism, [JW1], was incorporated in standard Rakudo releases (September/October 2021) additional 20% speed-up was obtained. Currently this package is: - ≈ 2.0 times slower than the Mathematica implementation [AAp2] - ≈ 30 times slower than the Java implementation [AAp3]
These speed improvements are definitely not satisfactory. I strongly consider:
Re-implementing in Raku the Mathematica package [AAp2], i.e. to move into Tries that are hashes.
Re-implementing in C or C++ the Java package [AAp3] and hooking it up to Raku.
In the following list the most important items are placed first.
[X] Implement "get words" and "get root-to-leaf paths" functions.
trie-words
and trie-root-to-leaf-paths
.[X] Convert most of the WL unit tests in [AAp5] into Raku tests.
[X] Implement Trie traversal functions.
The general trie-map
function is in a separate role.
[X] Implement (sub-)trie removal functions.
[X] By threshold (below and above)
[X] By Pareto principle adherence (top and bottom)
[X] By regex over the keys
[ ] Implement optional ULP spec argument for relevant functions:
[X] trie-root-to-leaf-paths
[X] trie-words
[ ] Membership test functions?
[ ] Design and code refactoring so trie objects to have OOP interface.
trie-words($tr, <c>)
we should be also able to say $tr.trie-words(<c>)
.[ ] Implement trie-prune
function.
[ ] Implement Trie-based classification.
[ ] Investigate faster implementations.
[X] Re-implement the Trie functionalities using hash representation (instead of a tree of Trie-node objects.)
[ ] Make a C or C++ implementation and hook it up to Raku.
[ ] Document examples of doing Trie-based text mining or data-mining.
[ ] Program a trie-form visualization that is "wide", i.e. places the children nodes horizontally.
[AA1] Anton Antonov, "Tries with frequencies for data mining", (2013), MathematicaForPrediction at WordPress.
[AA2] Anton Antonov, "Removal of sub-trees in tries", (2013), MathematicaForPrediction at WordPress.
[AA3] Anton Antonov, "Tries with frequencies in Java", (2017), MathematicaForPrediction at WordPress. GitHub Markdown.
[JW1] Jonathan Worthington, "The new MoarVM dispatch mechanism is here!", (2021), 6guts at WordPress.
[RAC1] Tib, "Day 10: My 10 commandments for Raku performances", (2020), Raku Advent Calendar.
[WK1] Wikipedia entry, Trie.
[AAp1] Anton Antonov, Tries with frequencies Mathematica Version 9.0 package, (2013), MathematicaForPrediction at GitHub.
[AAp2] Anton Antonov, Tries with frequencies Mathematica package, (2013-2018), MathematicaForPrediction at GitHub.
[AAp3] Anton Antonov, Tries with frequencies in Java, (2017), MathematicaForPrediction at GitHub.
[AAp4] Anton Antonov, Java tries with frequencies Mathematica package, (2017), MathematicaForPrediction at GitHub.
[AAp5] Anton Antonov, Java tries with frequencies Mathematica unit tests, (2017), MathematicaForPrediction at GitHub.
[AAp6] Anton Antonov, ML::HashTriesWithFrequencies Raku package, (2021), GitHub/antononcube.
[AAv1] Anton Antonov, "Prefix Trees with Frequencies for Data Analysis and Machine Learning", (2017), Wolfram Technology Conference 2017, Wolfram channel at YouTube.