Hash join: Difference between revisions

Rewrite the task description for clarity. Doesn't change the task content, except that the join column in table B has been renamed from "Name" to "Character".
(→‎{{header|Perl 6}}: correct code to handle non-unique keys in both tables)
(Rewrite the task description for clarity. Doesn't change the task content, except that the join column in table B has been renamed from "Name" to "Character".)
Line 1:
{{task}}
 
The classic [[wp:Hash Join|hash join]] algorithm for an inner join of two relations has the following steps:
An [[wp:Join_(SQL)#Inner_join|inner join]] is an operation that combines two data tables into one table, based on matching column values. The simplest way of implementing this operation is the [[wp:Nested loop join|nested loop join]] algorithm, but a more scalable alternative is the [[wp:hash join|hash join]] algorithm.
<ul>
 
<li>Hash phase: Create a hash table for one of the two relations by applying a hash
{{introheader|Task}}
function to the join attribute of each row. Ideally we should create a hash table for the
 
smaller relation, thus optimizing for creation time and memory size of the hash table.</li>
Implement the "hash join" algorithm, and demonstrate that it passes the test-case listed below.
<li>Join phase: Scan the larger relation and find the relevant rows by looking in the
 
hash table created before.</li>
You should represent the tables as data structures that feel natural in your programming language.
</ul>
 
{{introheader|Guidance}}
 
The "hash join" algorithm consists of two steps:
 
# '''Hash phase:''' Create a [[wp:Multimap|multimap]] from one of the two tables, mapping from each join column value to all the rows that contain it.<br>
#* The multimap must support hash-based lookup which scales better than a simple linear search, because that's the whole point of this algorithm.
#* Ideally we should create the multimap for the ''smaller'' table, thus minimizing its creation time and memory size.
# '''Join phase:''' Scan the other table, and find matching rows by looking in the multimap created before.
 
<br>
TheIn pseudo-code, the algorithm iscould be expressed as follows:
 
'''let''' ''A'' = the first input table (or ideally, the larger one)
'''let''' ''B'' = the second input table (or ideally, the smaller one)
'''let''' ''j<sub>A</sub>'' = the join column ID of table ''A''
'''let''' ''j<sub>B</sub>'' = the join column ID of table ''B''
'''let''' ''M<sub>B</sub>'' = a multimap for mapping from single values to multiple rows of table ''B'' (starts out empty)
'''let''' ''C'' = the output table (starts out empty)
'''for each''' row ''b'' '''in''' table ''B'' '''do'''
'''place''' ''b'' '''in''' multimap ''M<sub>B</sub>'' under key ''b''(''j<sub>B</sub>'')
'''for each''' row ''a'' '''in''' table ''A'' '''do'''
'''for each''' row ''b'' '''in''' multimap ''M<sub>B</sub>'' under key ''a''(''j<sub>A</sub>'')
'''let''' ''c'' = the concatenation of row ''a'' and row ''b''
'''place''' ''c'' in table ''C''
 
{{introheader|Test-case}}
 
{| class="wikitable"
|-
! Input
! Output
|-
|
 
{| style="border:none; border-collapse:collapse;"
|-
| style="border:none" | '''<math>A</math>''' =
| style="border:none" |
 
{| class="wikitable"
|-
! Age !! Name
|-
| 27 || Jonah
|-
| 18 || Alan
|-
| 28 || Glory
|-
| 18 || Popeye
|-
| 28 || Alan
|}
 
| style="border:none; padding-left:1.5em;" rowspan="2" |
| style="border:none" | <math>B</math> =
| style="border:none" |
 
{| class="wikitable"
|-
! Character !! Nemesis
|-
| Jonah || Whales
|-
| Jonah || Spiders
|-
| Alan || Ghosts
|-
| Alan || Zombies
|-
| Glory || Buffy
|}
 
|-
| style="border:none" | <math>j_A</math> =
| style="border:none" | <code>Name</code> (i.e. column 1)
 
| style="border:none" | <math>j_B</math> =
| style="border:none" | <code>Character</code> (i.e. column 0)
|}
 
|
'''for each''' tuple ''s'' '''in''' ''S'' '''do'''
'''let''' ''h'' = hash on join attributes ''s''(b)
'''place''' ''s'' '''in''' hash table ''S<sub>h</sub>'' '''in''' bucket '''keyed by''' hash value ''h''
'''for each''' tuple ''r'' '''in''' ''R'' '''do'''
'''let''' ''h'' = hash on join attributes ''r''(a)
'''if''' ''h'' indicates a nonempty bucket (''B'') of hash table ''S<sub>h</sub>''
'''if''' ''h'' matches any ''s'' in ''B''
'''concatenate''' ''r'' and ''s''
'''place''' relation in ''Q''
 
{| class="wikitable" style="margin-left:1em"
|-
! A.Age !! A.Name !! B.Character !! B.Nemesis
|-
| 27 || Jonah || Jonah || Whales
|-
| 27 || Jonah || Jonah || Spiders
|-
| 18 || Alan || Alan || Ghosts
|-
| 18 || Alan || Alan || Zombies
|-
| 28 || Glory || Glory || Buffy
|-
| 28 || Alan || Alan || Ghosts
|-
| 28 || Alan || Alan || Zombies
|}
 
|}
;Task:
Implement the Hash Join algorithm and show the result of joining two tables with it.
 
The order of the rows in the output table is not significant.<br>
You should use your implementation to show the joining of these tables:
If you're using numerically indexed arrays to represent table rows (rather than referring to columns by name), you could represent the output rows in the form <code style="white-space:nowrap">[[27, "Jonah"], ["Jonah", "Whales"]]</code>.
 
<br><hr>
<table style="border-collapse:separate; border-spacing:2em 0;"><tr><td><table border>
<tr><th>Age</th><th>Name</th></tr>
<tr><td>27</td><td>Jonah</td></tr>
<tr><td>18</td><td>Alan</td></tr>
<tr><td>28</td><td>Glory</td></tr>
<tr><td>18</td><td>Popeye</td></tr>
<tr><td>28</td><td>Alan</td></tr>
</table></td><td><table border>
<tr><th>Name</th><th>Nemesis</th></tr>
<tr><td>Jonah</td><td>Whales</td></tr>
<tr><td>Jonah</td><td>Spiders</td></tr>
<tr><td>Alan</td><td>Ghosts</td></tr>
<tr><td>Alan</td><td>Zombies</td></tr>
<tr><td>Glory</td><td>Buffy</td></tr>
</table></td></tr></table>
<br><br>
 
=={{header|Bracmat}}==
Anonymous user