Hash join: Difference between revisions
Line 25: | Line 25: | ||
=={{header|Common Lisp}}== |
=={{header|Common Lisp}}== |
||
<lang lisp>;; Uses the same example as in Go |
<lang lisp>;; Uses the same example as in Go; Assumes that all the join attributes in *table-B* are hashed into *hash-table*; otherwise a 'when' condition needs to be added. |
||
(defparameter *table-A* '((27 "Jonah") (18 "Alan") (28 "Glory") (18 "Popeye") (28 "Alan"))) |
(defparameter *table-A* '((27 "Jonah") (18 "Alan") (28 "Glory") (18 "Popeye") (28 "Alan"))) |
Revision as of 21:34, 11 December 2013
The classic hash join algorithm for an inner join of two relations has the following steps:
- Hash phase: Create a hash table for one of the two relations by applying a hash function to the join attribute of each row. Ideally we should create a hash table for the smaller relation, thus optimizing for creation time and memory size of the hash table.
- Join phase: Scan the larger relation and find the relevant rows by looking in the hash table created before.
The algorithm is as follows:
for each tuple s in S do let h = hash on join attributes s(b) place s in hash table Sh in bucket keyed by hash value h for each tuple r in R do let h = hash on join attributes r(a) if h indicates a nonempty bucket (B) of hash table Sh if h matches any s in B concatenate r and s place relation in Q
Task: implement the Hash Join algorithm and show the result of joining two tables with it.
Common Lisp
<lang lisp>;; Uses the same example as in Go; Assumes that all the join attributes in *table-B* are hashed into *hash-table*; otherwise a 'when' condition needs to be added.
(defparameter *table-A* '((27 "Jonah") (18 "Alan") (28 "Glory") (18 "Popeye") (28 "Alan")))
(defparameter *table-B* '(("Jonah" "Whales") ("Jonah" "Spiders") ("Alan" "Ghosts") ("Alan" "Zombies") ("Glory" "Buffy")))
- Hash phase
(defparameter *hash-table* (make-hash-table :test #'equal))
(loop for (i r) in *table-A*
for value = (gethash r *hash-table* (list nil)) do (setf (gethash r *hash-table*) value) (push (list i r) (first value)))
- Join phase
(loop for (i r) in *table-B* do
(let ((val (car (gethash i *hash-table*)))) (loop for (a b) in val do
(format t "{~a ~a} {~a ~a}~%" a b i r)))) </lang>
- Output:
{27 Jonah} {Jonah Whales} {27 Jonah} {Jonah Spiders} {28 Alan} {Alan Ghosts} {18 Alan} {Alan Ghosts} {28 Alan} {Alan Zombies} {18 Alan} {Alan Zombies} {28 Glory} {Glory Buffy}
Go
<lang go>package main
import "fmt"
func main() {
tableA := []struct { value int key string }{ {27, "Jonah"}, {18, "Alan"}, {28, "Glory"}, {18, "Popeye"}, {28, "Alan"}, } tableB := []struct { key string value string }{ {"Jonah", "Whales"}, {"Jonah", "Spiders"}, {"Alan", "Ghosts"}, {"Alan", "Zombies"}, {"Glory", "Buffy"}, } // hash phase h := map[string][]int{} for i, r := range tableA { h[r.key] = append(h[r.key], i) } // join phase for b := range tableB { for _, a := range h[tableB[b].key] { fmt.Println(tableA[a], tableB[b]) } }
}</lang>
- Output:
{27 Jonah} {Jonah Whales} {27 Jonah} {Jonah Spiders} {18 Alan} {Alan Ghosts} {28 Alan} {Alan Ghosts} {18 Alan} {Alan Zombies} {28 Alan} {Alan Zombies} {28 Glory} {Glory Buffy}
Haskell
The ST monad allows us to utilise mutable memory behind a referentially transparent interface, allowing us to use hashtables (efficiently).
Our hashJoin function takes two lists and two selector functions.
Placing all relations with the same selector value in a list in the hashtable allows us to join many to one/many relations. <lang Haskell>{-# LANGUAGE LambdaCase, TupleSections #-} import qualified Data.HashTable.ST.Basic as H import Data.Hashable import Control.Monad.ST import Control.Monad import Data.STRef
hashJoin :: (Eq k, Hashable k) =>
[t] -> (t -> k) -> [a] -> (a -> k) -> [(t, a)]
hashJoin xs fx ys fy = runST $ do
l <- newSTRef [] ht <- H.new forM_ ys $ \y -> H.insert ht (fy y) =<< (H.lookup ht (fy y) >>= \case Nothing -> return [y] Just v -> return (y:v)) forM_ xs $ \x -> do H.lookup ht (fx x) >>= \case Nothing -> return () Just v -> modifySTRef' l ((map (x,) v) ++) readSTRef l
test = mapM_ print $ hashJoin
[(1, "Jonah"), (2, "Alan"), (3, "Glory"), (4, "Popeye")] snd [("Jonah", "Whales"), ("Jonah", "Spiders"), ("Alan", "Ghosts"), ("Alan", "Zombies"), ("Glory", "Buffy")] fst
</lang>
λ> test ((3,"Glory"),("Glory","Buffy")) ((2,"Alan"),("Alan","Zombies")) ((2,"Alan"),("Alan","Ghosts")) ((1,"Jonah"),("Jonah","Spiders")) ((1,"Jonah"),("Jonah","Whales"))
The task require hashtables; however, a cleaner and more functional solution would be to use Data.Map (based on binary trees): <lang Haskell>{-# LANGUAGE TupleSections #-} import qualified Data.Map as M import Data.List import Data.Maybe import Control.Applicative
mapJoin xs fx ys fy = joined
where yMap = foldl' f M.empty ys f m y = M.insertWith (++) (fy y) [y] m joined = concat . catMaybes . map (\x -> map (x,) <$> M.lookup (fx x) yMap) $ xs
test = mapM_ print $ mapJoin
[(1, "Jonah"), (2, "Alan"), (3, "Glory"), (4, "Popeye")] snd [("Jonah", "Whales"), ("Jonah", "Spiders"), ("Alan", "Ghosts"), ("Alan", "Zombies"), ("Glory", "Buffy")] fst
</lang>
λ> test ((1,"Jonah"),("Jonah","Spiders")) ((1,"Jonah"),("Jonah","Whales")) ((2,"Alan"),("Alan","Zombies")) ((2,"Alan"),("Alan","Ghosts")) ((3,"Glory"),("Glory","Buffy"))
Perl 6
<lang perl6>my @A = [1, "Jonah"],
[2, "Alan"], [3, "Glory"], [4, "Popeye"];
my @B = ["Jonah", "Whales"],
["Jonah", "Spiders"], ["Alan", "Ghosts"], ["Alan", "Zombies"], ["Glory", "Buffy"];
sub hash-join(@a, &a, @b, &b) {
my %hash{Any}; %hash{.&a} = $_ for @a; ([%hash{.&b} // next, $_] for @b);
}
.perl.say for hash-join @A, *.[1], @B, *.[0];</lang>
- Output:
[[1, "Jonah"], ["Jonah", "Whales"]] [[1, "Jonah"], ["Jonah", "Spiders"]] [[2, "Alan"], ["Alan", "Ghosts"]] [[2, "Alan"], ["Alan", "Zombies"]] [[3, "Glory"], ["Glory", "Buffy"]]