Help language development. Donate to The Perl Foundation
This Raku package has data reshaping functions for different data structures that are coercible to full arrays.
The supported data structures are: - Positional-of-hashes - Positional-of-arrays
The five data reshaping provided by the package over those data structures are:
cross-tabulate
to-long-format
to-wide-format
SQL JOIN
), join-across
transpose
The first four operations are fundamental in data wrangling and data analysis; see [AA1, Wk1, Wk2, AAv1-AAv2].
(Transposing of tabular data is, of course, also fundamental, but it also can be seen as a basic functional programming operation.)
Making contingency tables -- or cross tabulation -- is a fundamental statistics and data analysis operation, [Wk1, AA1].
Here is an example using the
Titanic
dataset (that is provided by this package through the function get-titanic-dataset
):
use Data::Reshapers; my @tbl = get-titanic-dataset(); my $res = cross-tabulate( @tbl, 'passengerSex', 'passengerClass'); say $res;
# {female => {1st => 144, 2nd => 106, 3rd => 216}, male => {1st => 179, 2nd => 171, 3rd => 493}}
to-pretty-table($res);
# +--------+-----+-----+-----+ # | | 2nd | 3rd | 1st | # +--------+-----+-----+-----+ # | female | 106 | 216 | 144 | # | male | 171 | 493 | 179 | # +--------+-----+-----+-----+
Conversion to long format allows column names to be treated as data.
(More precisely, when converting to long format specified column names of a tabular dataset become values in a dedicated column, e.g. "Variable" in the long format.)
my @tbl1 = @tbl.roll(3); .say for @tbl1;
# {id => 671, passengerAge => 30, passengerClass => 3rd, passengerSex => male, passengerSurvival => died} # {id => 1256, passengerAge => -1, passengerClass => 3rd, passengerSex => male, passengerSurvival => died} # {id => 796, passengerAge => -1, passengerClass => 3rd, passengerSex => male, passengerSurvival => died}
.say for to-long-format( @tbl1 );
# {AutomaticKey => 0, Value => male, Variable => passengerSex} # {AutomaticKey => 0, Value => 30, Variable => passengerAge} # {AutomaticKey => 0, Value => died, Variable => passengerSurvival} # {AutomaticKey => 0, Value => 671, Variable => id} # {AutomaticKey => 0, Value => 3rd, Variable => passengerClass} # {AutomaticKey => 1, Value => male, Variable => passengerSex} # {AutomaticKey => 1, Value => -1, Variable => passengerAge} # {AutomaticKey => 1, Value => died, Variable => passengerSurvival} # {AutomaticKey => 1, Value => 1256, Variable => id} # {AutomaticKey => 1, Value => 3rd, Variable => passengerClass} # {AutomaticKey => 2, Value => male, Variable => passengerSex} # {AutomaticKey => 2, Value => -1, Variable => passengerAge} # {AutomaticKey => 2, Value => died, Variable => passengerSurvival} # {AutomaticKey => 2, Value => 796, Variable => id} # {AutomaticKey => 2, Value => 3rd, Variable => passengerClass}
my @lfRes1 = to-long-format( @tbl1, 'id', [], variablesTo => "VAR", valuesTo => "VAL2" ); .say for @lfRes1;
# {VAL2 => died, VAR => passengerSurvival, id => 1256} # {VAL2 => male, VAR => passengerSex, id => 1256} # {VAL2 => 3rd, VAR => passengerClass, id => 1256} # {VAL2 => -1, VAR => passengerAge, id => 1256} # {VAL2 => died, VAR => passengerSurvival, id => 671} # {VAL2 => male, VAR => passengerSex, id => 671} # {VAL2 => 3rd, VAR => passengerClass, id => 671} # {VAL2 => 30, VAR => passengerAge, id => 671} # {VAL2 => died, VAR => passengerSurvival, id => 796} # {VAL2 => male, VAR => passengerSex, id => 796} # {VAL2 => 3rd, VAR => passengerClass, id => 796} # {VAL2 => -1, VAR => passengerAge, id => 796}
Here we transform the long format result @lfRes1
above into wide format --
the result has the same records as the @tbl1
:
to-pretty-table( to-wide-format( @lfRes1, 'id', 'VAR', 'VAL2' ) );
# +-------------------+----------------+--------------+------+--------------+ # | passengerSurvival | passengerClass | passengerSex | id | passengerAge | # +-------------------+----------------+--------------+------+--------------+ # | died | 3rd | male | 1256 | -1 | # | died | 3rd | male | 671 | 30 | # | died | 3rd | male | 796 | -1 | # +-------------------+----------------+--------------+------+--------------+
Using cross tabulation result above:
my $tres = transpose( $res ); to-pretty-table($res, title => "Original");
# +--------------------------+ # | Original | # +--------+-----+-----+-----+ # | | 1st | 2nd | 3rd | # +--------+-----+-----+-----+ # | female | 144 | 106 | 216 | # | male | 179 | 171 | 493 | # +--------+-----+-----+-----+
to-pretty-table($tres, title => "Transposed");
# +---------------------+ # | Transposed | # +-----+--------+------+ # | | female | male | # +-----+--------+------+ # | 1st | 144 | 179 | # | 2nd | 106 | 171 | # | 3rd | 216 | 493 | # +-----+--------+------+
Earlier versions of the package implemented a type "deduction" system. Currently, the type system is provided by the package "Data::TypeSystem", [AAp1].
The type system conventions follow those of Mathematica's
Dataset
-- see the presentation
"Dataset improvements".
Here we get the Titanic dataset, change the "passengerAge" column values to be numeric, and show dataset's dimensions:
my @dsTitanic = get-titanic-dataset(headers => 'auto'); @dsTitanic = @dsTitanic.map({$_<passengerAge> = $_<passengerAge>.Numeric; $_}).Array; dimensions(@dsTitanic)
# (1309 5)
Here is a sample of dataset's records:
to-pretty-table(@dsTitanic.pick(5).List, field-names => <id passengerAge passengerClass passengerSex passengerSurvival>)
# +-----+--------------+----------------+--------------+-------------------+ # | id | passengerAge | passengerClass | passengerSex | passengerSurvival | # +-----+--------------+----------------+--------------+-------------------+ # | 743 | 40 | 3rd | male | survived | # | 157 | 40 | 1st | male | died | # | 659 | 0 | 3rd | female | survived | # | 228 | 20 | 1st | female | survived | # | 738 | 20 | 3rd | male | died | # +-----+--------------+----------------+--------------+-------------------+
Here is the type of a single record:
use Data::TypeSystem; deduce-type(@dsTitanic[12])
# Struct([id, passengerAge, passengerClass, passengerSex, passengerSurvival], [Str, Int, Str, Str, Str])
Here is the type of single record's values:
deduce-type(@dsTitanic[12].values.List)
# Tuple([Atom((Str)), Atom((Str)), Atom((Int)), Atom((Str)), Atom((Str))])
Here is the type of the whole dataset:
deduce-type(@dsTitanic)
# Vector(Struct([id, passengerAge, passengerClass, passengerSex, passengerSurvival], [Str, Int, Str, Str, Str]), 1309)
Here is the type of "values only" records:
my @valArr = @dsTitanic>>.values>>.Array; deduce-type(@valArr)
# Vector((Any), 1309)
Here is the type of the string values only records:
my @valArr = delete-columns(@dsTitanic, 'passengerAge')>>.values>>.Array; deduce-type(@valArr)
# Vector(Vector(Atom((Str)), 4), 1309)
[X] DONE Simpler more convenient interface.
[ ] TODO More extensive long format tests.
[ ] TODO More extensive wide format tests.
[ ] TODO Implement verifications for:
See the type system implementation -- it has all of functionalities listed here.
[X] DONE Positional-of-hashes
[X] DONE Positional-of-arrays
[X] DONE Positional-of-key-to-array-pairs
[X] DONE Positional-of-hashes, each record of which has:
[X] DONE Positional-of-arrays, each record of which has:
[X] DONE Implement "nice tabular visualization" using Pretty::Table and/or Text::Table::Simple.
[X] DONE Document examples using pretty tables.
[X] DONE Implement transposing operation for:
[X] DONE Implement to-pretty-table for:
[ ] DONE Implement join-across:
[X] DONE Implement semi- and anti-join
[ ] TODO Implement to long format conversion for:
[ ] TODO Speed/performance profiling.
[ ] TODO Type system.
[X] DONE Refactor the type system into a separate package.
[X] DONE "Simple" or fundamental functions
flatten
take-drop
tally
.BagHash.Hash
.[AA1] Anton Antonov, "Contingency tables creation examples", (2016), MathematicaForPrediction at WordPress.
[Wk1] Wikipedia entry, Contingency table.
[Wk2] Wikipedia entry, Wide and narrow data.
[AAf1] Anton Antonov, CrossTabulate, (2019), Wolfram Function Repository.
[AAf2] Anton Antonov, LongFormDataset, (2020), Wolfram Function Repository.
[AAf3] Anton Antonov, WideFormDataset, (2021), Wolfram Function Repository.
[AAf4] Anton Antonov, RecordsSummary, (2019), Wolfram Function Repository.
[AAp1] Anton Antonov, Data::TypeSystem Raku package, (2023), GitHub/antononcube.
[AAv1] Anton Antonov, "Multi-language Data-Wrangling Conversational Agent", (2020), YouTube channel of Wolfram Research, Inc.. (Wolfram Technology Conference 2020 presentation.)
[AAv2] Anton Antonov, "Data Transformation Workflows with Anton Antonov, Session #1", (2020), YouTube channel of Wolfram Research, Inc..
[AAv3] Anton Antonov, "Data Transformation Workflows with Anton Antonov, Session #2", (2020), YouTube channel of Wolfram Research, Inc..