Hash join: Difference between revisions

Content added Content deleted

Inline

Revision as of 22:17, 12 December 2013

The classic hash join algorithm for an inner join of two relations has the following steps:

Hash phase: Create a hash table for one of the two relations by applying a hash function to the join attribute of each row. Ideally we should create a hash table for the smaller relation, thus optimizing for creation time and memory size of the hash table.
Join phase: Scan the larger relation and find the relevant rows by looking in the hash table created before.

The algorithm is as follows:

for each tuple s in S do
   let h = hash on join attributes s(b)
   place s in hash table S_h in bucket keyed by hash value h
for each tuple r in R do
   let h = hash on join attributes r(a)
   if h indicates a nonempty bucket (B) of hash table S_h
      if h matches any s in B
         concatenate r and s
      place relation in Q

Task: implement the Hash Join algorithm and show the result of joining two tables with it. You should use your implementation to show the joining of these tables:

Age	Name
27	Jonah
18	Alan
28	Glory
18	Popeye
28	Alan

Name	Nemesis
Jonah	Whales
Jonah	Spiders
Alan	Ghosts
Alan	Zombies
Glory	Buffy

Common Lisp

<lang lisp>(defparameter *table-A* '((27 "Jonah") (18 "Alan") (28 "Glory") (18 "Popeye") (28 "Alan")))

(defparameter *table-B* '(("Jonah" "Whales") ("Jonah" "Spiders") ("Alan" "Ghosts") ("Alan" "Zombies") ("Glory" "Buffy")))

Hash phase

(defparameter *hash-table* (make-hash-table :test #'equal))

(loop for (i r) in *table-A*

  for value = (gethash r *hash-table* (list nil))  do
  (setf (gethash r *hash-table*) value)
  (push (list i r) (first value)))

Join phase

(loop for (i r) in *table-B* do

    (let ((val (car (gethash i *hash-table*))))
      (loop for (a b) in val do

(format t "{~a ~a} {~a ~a}~%" a b i r)))) </lang>

Output:

{27 Jonah} {Jonah Whales}
{27 Jonah} {Jonah Spiders}
{28 Alan} {Alan Ghosts}
{18 Alan} {Alan Ghosts}
{28 Alan} {Alan Zombies}
{18 Alan} {Alan Zombies}
{28 Glory} {Glory Buffy}

Go

<lang go>package main

import "fmt"

func main() {

   tableA := []struct {
       value int
       key   string
   }{
       {27, "Jonah"}, {18, "Alan"}, {28, "Glory"}, {18, "Popeye"},
       {28, "Alan"},
   }
   tableB := []struct {
       key   string
       value string
   }{
       {"Jonah", "Whales"}, {"Jonah", "Spiders"},
       {"Alan", "Ghosts"}, {"Alan", "Zombies"}, {"Glory", "Buffy"},
   }
   // hash phase
   h := map[string][]int{}
   for i, r := range tableA {
       h[r.key] = append(h[r.key], i)
   }
   // join phase
   for b := range tableB {
       for _, a := range h[tableB[b].key] {
           fmt.Println(tableA[a], tableB[b])
       }
   }

}</lang>

Output:

{27 Jonah} {Jonah Whales}
{27 Jonah} {Jonah Spiders}
{18 Alan} {Alan Ghosts}
{28 Alan} {Alan Ghosts}
{18 Alan} {Alan Zombies}
{28 Alan} {Alan Zombies}
{28 Glory} {Glory Buffy}

Haskell

The ST monad allows us to utilise mutable memory behind a referentially transparent interface, allowing us to use hashtables (efficiently).

Our hashJoin function takes two lists and two selector functions.

Placing all relations with the same selector value in a list in the hashtable allows us to join many to one/many relations. <lang Haskell>{-# LANGUAGE LambdaCase, TupleSections #-} import qualified Data.HashTable.ST.Basic as H import Data.Hashable import Control.Monad.ST import Control.Monad import Data.STRef

hashJoin :: (Eq k, Hashable k) =>

           [t] -> (t -> k) -> [a] -> (a -> k) -> [(t, a)]

hashJoin xs fx ys fy = runST $ do

 l <- newSTRef []
 ht <- H.new
 forM_ ys $ \y -> H.insert ht (fy y) =<< 
   (H.lookup ht (fy y) >>= \case
     Nothing -> return [y]
     Just v -> return (y:v))
 forM_ xs $ \x -> do
   H.lookup ht (fx x) >>= \case
     Nothing -> return ()
     Just v -> modifySTRef' l ((map (x,)  v) ++) 
 readSTRef l

test = mapM_ print $ hashJoin

   [(1, "Jonah"), (2, "Alan"), (3, "Glory"), (4, "Popeye")]
       snd
   [("Jonah", "Whales"), ("Jonah", "Spiders"), 
     ("Alan", "Ghosts"), ("Alan", "Zombies"), ("Glory", "Buffy")]
       fst

</lang>

λ> test
((3,"Glory"),("Glory","Buffy"))
((2,"Alan"),("Alan","Zombies"))
((2,"Alan"),("Alan","Ghosts"))
((1,"Jonah"),("Jonah","Spiders"))
((1,"Jonah"),("Jonah","Whales"))

The task require hashtables; however, a cleaner and more functional solution would be to use Data.Map (based on binary trees): <lang Haskell>{-# LANGUAGE TupleSections #-} import qualified Data.Map as M import Data.List import Data.Maybe import Control.Applicative

mapJoin xs fx ys fy = joined

 where yMap = foldl' f M.empty ys
       f m y = M.insertWith (++) (fy y) [y] m
       joined = concat . catMaybes . 
                map (\x -> map (x,) <$> M.lookup (fx x) yMap) $ xs

test = mapM_ print $ mapJoin

   [(1, "Jonah"), (2, "Alan"), (3, "Glory"), (4, "Popeye")]
       snd
   [("Jonah", "Whales"), ("Jonah", "Spiders"), 
    ("Alan", "Ghosts"), ("Alan", "Zombies"), ("Glory", "Buffy")]
       fst

</lang>

λ> test
((1,"Jonah"),("Jonah","Spiders"))
((1,"Jonah"),("Jonah","Whales"))
((2,"Alan"),("Alan","Zombies"))
((2,"Alan"),("Alan","Ghosts"))
((3,"Glory"),("Glory","Buffy"))

Perl 6

<lang perl6>my @A = [1, "Jonah"],

       [2, "Alan"],
       [3, "Glory"],
       [4, "Popeye"];

my @B = ["Jonah", "Whales"],

       ["Jonah", "Spiders"],
       ["Alan", "Ghosts"],
       ["Alan", "Zombies"],
       ["Glory", "Buffy"];

sub hash-join(@a, &a, @b, &b) {

   my %hash{Any};
   %hash{.&a} = $_ for @a;
   ([%hash{.&b} // next, $_] for @b);

}

.perl.say for hash-join @A, *.[1], @B, *.[0];</lang>

Output:

[[1, "Jonah"], ["Jonah", "Whales"]]
[[1, "Jonah"], ["Jonah", "Spiders"]]
[[2, "Alan"], ["Alan", "Ghosts"]]
[[2, "Alan"], ["Alan", "Zombies"]]
[[3, "Glory"], ["Glory", "Buffy"]]

Tcl

Tcl uses hash tables to implement both its associative arrays and its dictionaries. <lang tcl>package require Tcl 8.6

Only for lmap, which can be replaced with foreach

proc joinTables {tableA indexA tableB indexB} {

   # Optimisation: if the first table is longer, do in reverse order
   if {[llength $tableB] < [llength $tableA]} {

return [lmap pair [joinTables $tableB $indexB $tableA $indexA] { lreverse $pair }]

   foreach value $tableA {

set hashmap([lindex $value $indexA]) $value #dict version# dict set hashmap [lindex $value $indexA] $value

   }
   lmap value $tableB {

set key [lindex $value $indexB] if {![info exist hashmap($key)]} continue #dict version# if {![dict exists $hashmap $key]} continue list $hashmap($key) $value #dict version# list [dict get $hashmap $key] $value

}

set tableA {

   {27 "Jonah"} {18 "Alan"} {28 "Glory"} {18 "Popeye"} {28 "Alan"}

} set tableB {

   {"Jonah" "Whales"} {"Jonah" "Spiders"} {"Alan" "Ghosts"} {"Alan" "Zombies"}
   {"Glory" "Buffy"}

} set joined [joinTables $tableA 1 $tableB 0] foreach row $joined {

   puts $row

}</lang>

Output:

{27 "Jonah"} {"Jonah" "Whales"}
{27 "Jonah"} {"Jonah" "Spiders"}
{28 "Alan"} {"Alan" "Ghosts"}
{28 "Alan"} {"Alan" "Zombies"}
{28 "Glory"} {"Glory" "Buffy"}