Hash join: Difference between revisions

From Rosetta Code
Content added Content deleted
(Revised the task description to give the sample data to use)
No edit summary
Line 40: Line 40:


=={{header|Common Lisp}}==
=={{header|Common Lisp}}==
<lang lisp>(defparameter *table-A* '((27 "Jonah") (18 "Alan") (28 "Glory") (18 "Popeye") (28 "Alan")))
<lang lisp>;; Uses the same example as in Go.

(defparameter *table-A* '((27 "Jonah") (18 "Alan") (28 "Glory") (18 "Popeye") (28 "Alan")))


(defparameter *table-B* '(("Jonah" "Whales") ("Jonah" "Spiders") ("Alan" "Ghosts") ("Alan" "Zombies") ("Glory" "Buffy")))
(defparameter *table-B* '(("Jonah" "Whales") ("Jonah" "Spiders") ("Alan" "Ghosts") ("Alan" "Zombies") ("Glory" "Buffy")))

Revision as of 22:17, 12 December 2013

Task
Hash join
You are encouraged to solve this task according to the task description, using any language you may know.

The classic hash join algorithm for an inner join of two relations has the following steps:

  • Hash phase: Create a hash table for one of the two relations by applying a hash function to the join attribute of each row. Ideally we should create a hash table for the smaller relation, thus optimizing for creation time and memory size of the hash table.
  • Join phase: Scan the larger relation and find the relevant rows by looking in the hash table created before.

The algorithm is as follows:

for each tuple s in S do
   let h = hash on join attributes s(b)
   place s in hash table Sh in bucket keyed by hash value h
for each tuple r in R do
   let h = hash on join attributes r(a)
   if h indicates a nonempty bucket (B) of hash table Sh
      if h matches any s in B
         concatenate r and s
      place relation in Q

Task: implement the Hash Join algorithm and show the result of joining two tables with it. You should use your implementation to show the joining of these tables:

AgeName
27Jonah
18Alan
28Glory
18Popeye
28Alan
NameNemesis
JonahWhales
JonahSpiders
AlanGhosts
AlanZombies
GloryBuffy

Common Lisp

<lang lisp>(defparameter *table-A* '((27 "Jonah") (18 "Alan") (28 "Glory") (18 "Popeye") (28 "Alan")))

(defparameter *table-B* '(("Jonah" "Whales") ("Jonah" "Spiders") ("Alan" "Ghosts") ("Alan" "Zombies") ("Glory" "Buffy")))

Hash phase

(defparameter *hash-table* (make-hash-table :test #'equal))

(loop for (i r) in *table-A*

  for value = (gethash r *hash-table* (list nil))  do
  (setf (gethash r *hash-table*) value)
  (push (list i r) (first value)))
Join phase

(loop for (i r) in *table-B* do

    (let ((val (car (gethash i *hash-table*))))
      (loop for (a b) in val do 

(format t "{~a ~a} {~a ~a}~%" a b i r)))) </lang>

Output:
{27 Jonah} {Jonah Whales}
{27 Jonah} {Jonah Spiders}
{28 Alan} {Alan Ghosts}
{18 Alan} {Alan Ghosts}
{28 Alan} {Alan Zombies}
{18 Alan} {Alan Zombies}
{28 Glory} {Glory Buffy}

Go

<lang go>package main

import "fmt"

func main() {

   tableA := []struct {
       value int
       key   string
   }{
       {27, "Jonah"}, {18, "Alan"}, {28, "Glory"}, {18, "Popeye"},
       {28, "Alan"},
   }
   tableB := []struct {
       key   string
       value string
   }{
       {"Jonah", "Whales"}, {"Jonah", "Spiders"},
       {"Alan", "Ghosts"}, {"Alan", "Zombies"}, {"Glory", "Buffy"},
   }
   // hash phase
   h := map[string][]int{}
   for i, r := range tableA {
       h[r.key] = append(h[r.key], i)
   }
   // join phase
   for b := range tableB {
       for _, a := range h[tableB[b].key] {
           fmt.Println(tableA[a], tableB[b])
       }
   }

}</lang>

Output:
{27 Jonah} {Jonah Whales}
{27 Jonah} {Jonah Spiders}
{18 Alan} {Alan Ghosts}
{28 Alan} {Alan Ghosts}
{18 Alan} {Alan Zombies}
{28 Alan} {Alan Zombies}
{28 Glory} {Glory Buffy}

Haskell

The ST monad allows us to utilise mutable memory behind a referentially transparent interface, allowing us to use hashtables (efficiently).

Our hashJoin function takes two lists and two selector functions.

Placing all relations with the same selector value in a list in the hashtable allows us to join many to one/many relations. <lang Haskell>{-# LANGUAGE LambdaCase, TupleSections #-} import qualified Data.HashTable.ST.Basic as H import Data.Hashable import Control.Monad.ST import Control.Monad import Data.STRef

hashJoin :: (Eq k, Hashable k) =>

           [t] -> (t -> k) -> [a] -> (a -> k) -> [(t, a)]

hashJoin xs fx ys fy = runST $ do

 l <- newSTRef []
 ht <- H.new
 forM_ ys $ \y -> H.insert ht (fy y) =<< 
   (H.lookup ht (fy y) >>= \case
     Nothing -> return [y]
     Just v -> return (y:v))
 forM_ xs $ \x -> do
   H.lookup ht (fx x) >>= \case
     Nothing -> return ()
     Just v -> modifySTRef' l ((map (x,)  v) ++) 
 readSTRef l

test = mapM_ print $ hashJoin

   [(1, "Jonah"), (2, "Alan"), (3, "Glory"), (4, "Popeye")]
       snd
   [("Jonah", "Whales"), ("Jonah", "Spiders"), 
     ("Alan", "Ghosts"), ("Alan", "Zombies"), ("Glory", "Buffy")]
       fst

</lang>

λ> test
((3,"Glory"),("Glory","Buffy"))
((2,"Alan"),("Alan","Zombies"))
((2,"Alan"),("Alan","Ghosts"))
((1,"Jonah"),("Jonah","Spiders"))
((1,"Jonah"),("Jonah","Whales"))

The task require hashtables; however, a cleaner and more functional solution would be to use Data.Map (based on binary trees): <lang Haskell>{-# LANGUAGE TupleSections #-} import qualified Data.Map as M import Data.List import Data.Maybe import Control.Applicative

mapJoin xs fx ys fy = joined

 where yMap = foldl' f M.empty ys
       f m y = M.insertWith (++) (fy y) [y] m
       joined = concat . catMaybes . 
                map (\x -> map (x,) <$> M.lookup (fx x) yMap) $ xs

test = mapM_ print $ mapJoin

   [(1, "Jonah"), (2, "Alan"), (3, "Glory"), (4, "Popeye")]
       snd
   [("Jonah", "Whales"), ("Jonah", "Spiders"), 
    ("Alan", "Ghosts"), ("Alan", "Zombies"), ("Glory", "Buffy")]
       fst

</lang>

λ> test
((1,"Jonah"),("Jonah","Spiders"))
((1,"Jonah"),("Jonah","Whales"))
((2,"Alan"),("Alan","Zombies"))
((2,"Alan"),("Alan","Ghosts"))
((3,"Glory"),("Glory","Buffy"))

Perl 6

<lang perl6>my @A = [1, "Jonah"],

       [2, "Alan"],
       [3, "Glory"],
       [4, "Popeye"];

my @B = ["Jonah", "Whales"],

       ["Jonah", "Spiders"],
       ["Alan", "Ghosts"],
       ["Alan", "Zombies"],
       ["Glory", "Buffy"];

sub hash-join(@a, &a, @b, &b) {

   my %hash{Any};
   %hash{.&a} = $_ for @a;
   ([%hash{.&b} // next, $_] for @b);

}

.perl.say for hash-join @A, *.[1], @B, *.[0];</lang>

Output:
[[1, "Jonah"], ["Jonah", "Whales"]]
[[1, "Jonah"], ["Jonah", "Spiders"]]
[[2, "Alan"], ["Alan", "Ghosts"]]
[[2, "Alan"], ["Alan", "Zombies"]]
[[3, "Glory"], ["Glory", "Buffy"]]

Tcl

Tcl uses hash tables to implement both its associative arrays and its dictionaries. <lang tcl>package require Tcl 8.6

  1. Only for lmap, which can be replaced with foreach

proc joinTables {tableA indexA tableB indexB} {

   # Optimisation: if the first table is longer, do in reverse order
   if {[llength $tableB] < [llength $tableA]} {

return [lmap pair [joinTables $tableB $indexB $tableA $indexA] { lreverse $pair }]

   }
   foreach value $tableA {

set hashmap([lindex $value $indexA]) $value #dict version# dict set hashmap [lindex $value $indexA] $value

   }
   lmap value $tableB {

set key [lindex $value $indexB] if {![info exist hashmap($key)]} continue #dict version# if {![dict exists $hashmap $key]} continue list $hashmap($key) $value #dict version# list [dict get $hashmap $key] $value

   }

}

set tableA {

   {27 "Jonah"} {18 "Alan"} {28 "Glory"} {18 "Popeye"} {28 "Alan"}

} set tableB {

   {"Jonah" "Whales"} {"Jonah" "Spiders"} {"Alan" "Ghosts"} {"Alan" "Zombies"}
   {"Glory" "Buffy"}

} set joined [joinTables $tableA 1 $tableB 0] foreach row $joined {

   puts $row

}</lang>

Output:
{27 "Jonah"} {"Jonah" "Whales"}
{27 "Jonah"} {"Jonah" "Spiders"}
{28 "Alan"} {"Alan" "Ghosts"}
{28 "Alan"} {"Alan" "Zombies"}
{28 "Glory"} {"Glory" "Buffy"}