Hash join: Difference between revisions
mNo edit summary |
(tidy up task description) |
||
Line 1:
{{draft task}}
The classic [[wp:Hash Join|hash join]] algorithm for an inner join of two relations has the following steps:
<ul>
<li>Hash phase : Creating a hash table for one of the two relations by applying a hash
Line 10:
The algorithm is as follows:
'''for each''' tuple ''s'' '''in''' ''S'' '''do'''
'''place'''
'''for each''' tuple ''r'' '''in''' ''R'' '''do'''
'''if''' ''h'' matches any ''s'' in ''B''
'''place''' relation in ''Q
Implement the Hash Join algorithm in your programming language (optionally providing a test case as well).
|
Revision as of 22:42, 30 November 2013
The classic hash join algorithm for an inner join of two relations has the following steps:
- Hash phase : Creating a hash table for one of the two relations by applying a hash function to the join attribute of each row. Ideally we should create a hash table for the smaller relation. Thus, optimizing for creation time and memory size of the hash table.
- Join phase : Scanning the larger relation and finding the relevant rows by looking in the hash table created before.
The algorithm is as follows:
for each tuple s in S do let h = hash on join attributes s(b) place s in hash table Sh in bucket keyed by hash value h for each tuple r in R do let h = hash on join attributes r(a) if h indicates a nonempty bucket (B) of hash table Sh if h matches any s in B concatenate r and s place relation in Q
Implement the Hash Join algorithm in your programming language (optionally providing a test case as well).
Haskell
The ST monad allows us to utilise mutable memory behind a referentially transparent interface, allowing us to use hashtables (efficiently).
Our hashJoin function takes two lists and two selector functions.
Placing all relations with the same selector value in a list in the hashtable allows us to join many to one/many relations. <lang Haskell>{-# LANGUAGE LambdaCase, TupleSections #-} import qualified Data.HashTable.ST.Basic as H import Data.Hashable import Control.Monad.ST import Control.Monad import Data.STRef
hashJoin :: (Eq k, Hashable k) =>
[t] -> (t -> k) -> [a] -> (a -> k) -> [(t, a)]
hashJoin xs fx ys fy = runST $ do
l <- newSTRef [] ht <- H.new forM_ ys $ \y -> H.insert ht (fy y) =<< (H.lookup ht (fy y) >>= \case Nothing -> return [y] Just v -> return (y:v)) forM_ xs $ \x -> do H.lookup ht (fx x) >>= \case Nothing -> return () Just v -> modifySTRef' l ((map (x,) v) ++) readSTRef l
test = mapM_ print $ hashJoin
[(1, "Jonah"), (2, "Alan"), (3, "Glory"), (4, "Popeye")] snd [("Jonah", "Whales"), ("Jonah", "Spiders"), ("Alan", "Ghosts"), ("Alan", "Zombies"), ("Glory", "Buffy")] fst
</lang>
λ> test ((3,"Glory"),("Glory","Buffy")) ((2,"Alan"),("Alan","Zombies")) ((2,"Alan"),("Alan","Ghosts")) ((1,"Jonah"),("Jonah","Spiders")) ((1,"Jonah"),("Jonah","Whales"))
The task require hashtables; however, a cleaner and more functional solution would be to use Data.Map (based on binary trees): <lang Haskell>{-# LANGUAGE TupleSections #-} import qualified Data.Map as M import Data.List import Data.Maybe import Control.Applicative
mapJoin xs fx ys fy = joined
where yMap = foldl' f M.empty ys f m y = M.insertWith (++) (fy y) [y] m joined = concat . catMaybes . map (\x -> map (x,) <$> M.lookup (fx x) yMap) $ xs
test = mapM_ print $ mapJoin
[(1, "Jonah"), (2, "Alan"), (3, "Glory"), (4, "Popeye")] snd [("Jonah", "Whales"), ("Jonah", "Spiders"), ("Alan", "Ghosts"), ("Alan", "Zombies"), ("Glory", "Buffy")] fst
</lang>
λ> test ((1,"Jonah"),("Jonah","Spiders")) ((1,"Jonah"),("Jonah","Whales")) ((2,"Alan"),("Alan","Zombies")) ((2,"Alan"),("Alan","Ghosts")) ((3,"Glory"),("Glory","Buffy"))