Inverted index: Difference between revisions
Content added Content deleted
m (→{{header|Phix}}: added syntax colouring, marked p2js incompatible) |
Thundergnat (talk | contribs) m (syntax highlighting fixup automation) |
||
Line 15: | Line 15: | ||
{{trans|D}} |
{{trans|D}} |
||
< |
<syntaxhighlight lang="11l">DefaultDict[String, Set[String]] index |
||
F parse_file(fname, fcontents) |
F parse_file(fname, fcontents) |
||
Line 31: | Line 31: | ||
print(‘'#.' found in #..’.format(w, sorted(Array(index[w])))) |
print(‘'#.' found in #..’.format(w, sorted(Array(index[w])))) |
||
E |
E |
||
print(‘'#.' not found.’.format(w))</ |
print(‘'#.' not found.’.format(w))</syntaxhighlight> |
||
{{out}} |
{{out}} |
||
Line 58: | Line 58: | ||
Here is the main program (file inverted_index.adb): |
Here is the main program (file inverted_index.adb): |
||
< |
<syntaxhighlight lang="ada">with Ada.Text_IO, Generic_Inverted_Index, Ada.Strings.Hash, Parse_Lines; |
||
use Ada.Text_IO; |
use Ada.Text_IO; |
||
Line 171: | Line 171: | ||
end; |
end; |
||
end loop; |
end loop; |
||
end Inverted_Index;</ |
end Inverted_Index;</syntaxhighlight> |
||
A sample output: |
A sample output: |
||
Line 196: | Line 196: | ||
The real work is actually done in the package Generic_Inverted_Index. Here is the specification (file generic_inverted_index.ads): |
The real work is actually done in the package Generic_Inverted_Index. Here is the specification (file generic_inverted_index.ads): |
||
< |
<syntaxhighlight lang="ada">with Ada.Containers.Indefinite_Vectors; |
||
private with Ada.Containers.Indefinite_Hashed_Maps; |
private with Ada.Containers.Indefinite_Hashed_Maps; |
||
Line 258: | Line 258: | ||
type Storage_Type is new Maps.Map with null record; |
type Storage_Type is new Maps.Map with null record; |
||
end Generic_Inverted_Index;</ |
end Generic_Inverted_Index;</syntaxhighlight> |
||
Here is the implementation (generic_inverted_index.adb): |
Here is the implementation (generic_inverted_index.adb): |
||
< |
<syntaxhighlight lang="ada">package body Generic_Inverted_Index is |
||
use Source_Vecs; |
use Source_Vecs; |
||
Line 356: | Line 356: | ||
end Same_Vector; |
end Same_Vector; |
||
end Generic_Inverted_Index;</ |
end Generic_Inverted_Index;</syntaxhighlight> |
||
===Package Parse_Lines=== |
===Package Parse_Lines=== |
||
Line 362: | Line 362: | ||
The main program also uses an auxiliary package Parse_Lines. Note the usage of Gnat.Regpat, which itself is pattern matching package, specific for gnat/gcc. This package is derived from the [[Regular expressions#Ada|Ada implementation of the regular expressions task]]. Here is the spec (parse_lines.ads): |
The main program also uses an auxiliary package Parse_Lines. Note the usage of Gnat.Regpat, which itself is pattern matching package, specific for gnat/gcc. This package is derived from the [[Regular expressions#Ada|Ada implementation of the regular expressions task]]. Here is the spec (parse_lines.ads): |
||
< |
<syntaxhighlight lang="ada">with Gnat.Regpat; |
||
package Parse_Lines is |
package Parse_Lines is |
||
Line 381: | Line 381: | ||
procedure Iterate_Words(S: String); |
procedure Iterate_Words(S: String); |
||
end Parse_Lines;</ |
end Parse_Lines;</syntaxhighlight> |
||
And here is the implementation (parse_lines.adb): |
And here is the implementation (parse_lines.adb): |
||
< |
<syntaxhighlight lang="ada">with Gnat.Regpat; |
||
package body Parse_Lines is |
package body Parse_Lines is |
||
Line 426: | Line 426: | ||
end Iterate_Words; |
end Iterate_Words; |
||
end Parse_Lines;</ |
end Parse_Lines;</syntaxhighlight> |
||
===Alternative Implementation of Generic_Inverted_Index (Ada 2012)=== |
===Alternative Implementation of Generic_Inverted_Index (Ada 2012)=== |
||
Line 432: | Line 432: | ||
The new standard Ada 2012 simplifies the usage of containers significantly. The following runs under gnat (GNAT GPL 2011 (20110419)), when using the experimental -gnat2012 switch. The main program is the same. Here is the spec for Generic_Inverted_Index: |
The new standard Ada 2012 simplifies the usage of containers significantly. The following runs under gnat (GNAT GPL 2011 (20110419)), when using the experimental -gnat2012 switch. The main program is the same. Here is the spec for Generic_Inverted_Index: |
||
< |
<syntaxhighlight lang="ada">with Ada.Containers.Indefinite_Vectors; |
||
private with Ada.Containers.Indefinite_Hashed_Maps; |
private with Ada.Containers.Indefinite_Hashed_Maps; |
||
Line 487: | Line 487: | ||
type Storage_Type is new Maps.Map with null record; |
type Storage_Type is new Maps.Map with null record; |
||
end Generic_Inverted_Index;</ |
end Generic_Inverted_Index;</syntaxhighlight> |
||
The implementation: |
The implementation: |
||
< |
<syntaxhighlight lang="ada">package body Generic_Inverted_Index is |
||
-- uses some of the new Ada 2012 syntax |
-- uses some of the new Ada 2012 syntax |
||
use Source_Vecs; |
use Source_Vecs; |
||
Line 572: | Line 572: | ||
end Same_Vector; |
end Same_Vector; |
||
end Generic_Inverted_Index;</ |
end Generic_Inverted_Index;</syntaxhighlight> |
||
=={{header|AutoHotkey}}== |
=={{header|AutoHotkey}}== |
||
{{works with|AutoHotkey_L}} |
{{works with|AutoHotkey_L}} |
||
< |
<syntaxhighlight lang="autohotkey">; http://www.autohotkey.com/forum/viewtopic.php?t=41479 |
||
inputbox, files, files, file pattern such as c:\files\*.txt |
inputbox, files, files, file pattern such as c:\files\*.txt |
||
Line 641: | Line 641: | ||
else |
else |
||
return word2docs[word2find] |
return word2docs[word2find] |
||
}</ |
}</syntaxhighlight> |
||
=={{header|BBC BASIC}}== |
=={{header|BBC BASIC}}== |
||
{{works with|BBC BASIC for Windows}} |
{{works with|BBC BASIC for Windows}} |
||
This uses a hashed index and linked lists to hold the file numbers. |
This uses a hashed index and linked lists to hold the file numbers. |
||
< |
<syntaxhighlight lang="bbcbasic"> DIM FileList$(4) |
||
FileList$() = "BBCKEY0.TXT", "BBCKEY1.TXT", "BBCKEY2.TXT", \ |
FileList$() = "BBCKEY0.TXT", "BBCKEY1.TXT", "BBCKEY2.TXT", \ |
||
\ "BBCKEY3.TXT", "BBCKEY4.TXT" |
\ "BBCKEY3.TXT", "BBCKEY4.TXT" |
||
Line 717: | Line 717: | ||
IF C% >= 65 IF C% <= 90 MID$(A$,A%,1) = CHR$(C%+32) |
IF C% >= 65 IF C% <= 90 MID$(A$,A%,1) = CHR$(C%+32) |
||
NEXT |
NEXT |
||
= A$</ |
= A$</syntaxhighlight> |
||
'''Output:''' |
'''Output:''' |
||
<pre> |
<pre> |
||
Line 728: | Line 728: | ||
=={{header|C}}== |
=={{header|C}}== |
||
The code is stupidly long, having to implement a Trie to store strings and all -- the program doesn't do anything shiny, but Tries may be interesting to look at. |
The code is stupidly long, having to implement a Trie to store strings and all -- the program doesn't do anything shiny, but Tries may be interesting to look at. |
||
< |
<syntaxhighlight lang="c">#include <stdio.h> |
||
#include <stdlib.h> |
#include <stdlib.h> |
||
Line 878: | Line 878: | ||
search_index(root, "boo"); |
search_index(root, "boo"); |
||
return 0; |
return 0; |
||
}</ |
}</syntaxhighlight>Output:<syntaxhighlight lang="text">Search for "what": f1.txt source/f2.txt |
||
Search for "is": f1.txt other_file source/f2.txt |
Search for "is": f1.txt other_file source/f2.txt |
||
Search for "banana": other_file |
Search for "banana": other_file |
||
Search for "boo": not found</ |
Search for "boo": not found</syntaxhighlight> |
||
=={{header|C sharp|C#}}== |
=={{header|C sharp|C#}}== |
||
< |
<syntaxhighlight lang="csharp">using System; |
||
using System.Collections.Generic; |
using System.Collections.Generic; |
||
using System.IO; |
using System.IO; |
||
Line 908: | Line 908: | ||
Console.WriteLine("{0} found in: {1}", find, string.Join(" ", Invert(dictionary)[find])); |
Console.WriteLine("{0} found in: {1}", find, string.Join(" ", Invert(dictionary)[find])); |
||
} |
} |
||
}</ |
}</syntaxhighlight> |
||
Sample output: |
Sample output: |
||
<lang>files: file1 file2 file3 |
<syntaxhighlight lang="text">files: file1 file2 file3 |
||
find: what |
find: what |
||
what found in: file1 file2</ |
what found in: file1 file2</syntaxhighlight> |
||
=={{header|C++}}== |
=={{header|C++}}== |
||
Same idea as the C implementation - trie to store the words |
Same idea as the C implementation - trie to store the words |
||
< |
<syntaxhighlight lang="cpp"> |
||
#include <algorithm> |
#include <algorithm> |
||
#include <fstream> |
#include <fstream> |
||
Line 1,033: | Line 1,033: | ||
return 0; |
return 0; |
||
} |
} |
||
</syntaxhighlight> |
|||
</lang> |
|||
{{out}} |
{{out}} |
||
<pre> |
<pre> |
||
Line 1,054: | Line 1,054: | ||
=={{header|Clojure}}== |
=={{header|Clojure}}== |
||
< |
<syntaxhighlight lang="clojure">(ns inverted-index.core |
||
(:require [clojure.set :as sets] |
(:require [clojure.set :as sets] |
||
[clojure.java.io :as io])) |
[clojure.java.io :as io])) |
||
Line 1,079: | Line 1,079: | ||
(defn search [index query] |
(defn search [index query] |
||
(apply sets/intersection (map index (term-seq query)))) |
(apply sets/intersection (map index (term-seq query)))) |
||
</syntaxhighlight> |
|||
</lang> |
|||
=={{header|CoffeeScript}}== |
=={{header|CoffeeScript}}== |
||
< |
<syntaxhighlight lang="coffeescript"> |
||
fs = require 'fs' |
fs = require 'fs' |
||
Line 1,118: | Line 1,118: | ||
grep index, 'make_index' |
grep index, 'make_index' |
||
grep index, 'sort' |
grep index, 'sort' |
||
</syntaxhighlight> |
|||
</lang> |
|||
output |
output |
||
<lang> |
<syntaxhighlight lang="text"> |
||
> coffee inverted_index.coffee |
> coffee inverted_index.coffee |
||
locations for 'make_index': |
locations for 'make_index': |
||
Line 1,136: | Line 1,136: | ||
inverted_index.coffee:35 |
inverted_index.coffee:35 |
||
knuth_sample.coffee:12 |
knuth_sample.coffee:12 |
||
</syntaxhighlight> |
|||
</lang> |
|||
=={{header|Common Lisp}}== |
=={{header|Common Lisp}}== |
||
< |
<syntaxhighlight lang="lisp">(defpackage rosettacode.inverted-index |
||
(:use cl)) |
(:use cl)) |
||
(in-package rosettacode.inverted-index) |
(in-package rosettacode.inverted-index) |
||
Line 1,174: | Line 1,174: | ||
:test #'equal)) |
:test #'equal)) |
||
</syntaxhighlight> |
|||
</lang> |
|||
Example: |
Example: |
||
< |
<syntaxhighlight lang="lisp">(defparameter *index* (build-index '("file1.txt" "file2.txt" "file3.txt"))) |
||
(defparameter *query* "foo bar") |
(defparameter *query* "foo bar") |
||
(defparameter *result* (lookup *index* *query*)) |
(defparameter *result* (lookup *index* *query*)) |
||
(format t "Result for query ~s: ~{~a~^, ~}~%" *query* *result*)</ |
(format t "Result for query ~s: ~{~a~^, ~}~%" *query* *result*)</syntaxhighlight> |
||
=={{header|D}}== |
=={{header|D}}== |
||
< |
<syntaxhighlight lang="d">import std.stdio, std.algorithm, std.string, std.file, std.regex; |
||
void main() { |
void main() { |
||
Line 1,214: | Line 1,214: | ||
writefln("'%s' not found.", w); |
writefln("'%s' not found.", w); |
||
} |
} |
||
}</ |
}</syntaxhighlight> |
||
Both the demo text files and the queries are from the Wikipedia page, they contain: |
Both the demo text files and the queries are from the Wikipedia page, they contain: |
||
It is what it is. |
It is what it is. |
||
Line 1,243: | Line 1,243: | ||
{{libheader| system.Generics.Collections}} |
{{libheader| system.Generics.Collections}} |
||
{{libheader| SYstem.IOUtils}} |
{{libheader| SYstem.IOUtils}} |
||
<syntaxhighlight lang="delphi"> |
|||
<lang Delphi> |
|||
program Inverted_index; |
program Inverted_index; |
||
Line 1,433: | Line 1,433: | ||
Index.Free; |
Index.Free; |
||
end.</ |
end.</syntaxhighlight> |
||
Input: Same of [[#D|D]]. |
Input: Same of [[#D|D]]. |
||
{{out}} |
{{out}} |
||
Line 1,453: | Line 1,453: | ||
===Indexing=== |
===Indexing=== |
||
Index values are sets associated with each word (key). We use the local-put-value function to permanently store the index, in the browser local storage. |
Index values are sets associated with each word (key). We use the local-put-value function to permanently store the index, in the browser local storage. |
||
< |
<syntaxhighlight lang="lisp"> |
||
;; set of input files |
;; set of input files |
||
(define FILES {T0.txt T1.txt T2.txt}) |
(define FILES {T0.txt T1.txt T2.txt}) |
||
Line 1,477: | Line 1,477: | ||
(let ((text (text-parse text))) |
(let ((text (text-parse text))) |
||
(for ((word text)) (invert-word (string-downcase word) file INVERT)))) |
(for ((word text)) (invert-word (string-downcase word) file INVERT)))) |
||
</syntaxhighlight> |
|||
</lang> |
|||
=== Query === |
=== Query === |
||
Intersect sets values of each word. |
Intersect sets values of each word. |
||
< |
<syntaxhighlight lang="lisp"> |
||
;; usage : (inverted-search w1 w2 ..) |
;; usage : (inverted-search w1 w2 ..) |
||
(define-syntax-rule (inverted-search w ...) |
(define-syntax-rule (inverted-search w ...) |
||
Line 1,493: | Line 1,493: | ||
(set-intersect res (local-get-value word INVERT))) |
(set-intersect res (local-get-value word INVERT))) |
||
FILES words)) |
FILES words)) |
||
</syntaxhighlight> |
|||
</lang> |
|||
Output : |
Output : |
||
< |
<syntaxhighlight lang="lisp"> |
||
(map-files invert-text FILES) |
(map-files invert-text FILES) |
||
(inverted-search is it) |
(inverted-search is it) |
||
Line 1,505: | Line 1,505: | ||
(inverted-search boule) |
(inverted-search boule) |
||
[3]→ null |
[3]→ null |
||
</syntaxhighlight> |
|||
</lang> |
|||
=={{header|Erlang}}== |
=={{header|Erlang}}== |
||
Line 1,512: | Line 1,512: | ||
Ditto for any other character. |
Ditto for any other character. |
||
<syntaxhighlight lang="erlang"> |
|||
<lang Erlang> |
|||
-module( inverted_index ). |
-module( inverted_index ). |
||
Line 1,547: | Line 1,547: | ||
search_common( Files, Acc ) -> [X || X <- Acc, lists:member(X, Files)]. |
search_common( Files, Acc ) -> [X || X <- Acc, lists:member(X, Files)]. |
||
</syntaxhighlight> |
|||
</lang> |
|||
=={{header|F Sharp|F#}}== |
=={{header|F Sharp|F#}}== |
||
< |
<syntaxhighlight lang="fsharp">open System |
||
open System.IO |
open System.IO |
||
Line 1,584: | Line 1,584: | ||
|> Set.intersectMany |
|> Set.intersectMany |
||
printf "Found in: " ; searchResults |> Set.iter (printf "%s ") ; printfn ""</ |
printf "Found in: " ; searchResults |> Set.iter (printf "%s ") ; printfn ""</syntaxhighlight> |
||
Sample usage: |
Sample usage: |
||
<pre> |
<pre> |
||
Line 1,594: | Line 1,594: | ||
=={{header|Factor}}== |
=={{header|Factor}}== |
||
< |
<syntaxhighlight lang="factor">USING: assocs fry io.encodings.utf8 io.files kernel sequences |
||
sets splitting vectors ; |
sets splitting vectors ; |
||
IN: rosettacode.inverted-index |
IN: rosettacode.inverted-index |
||
Line 1,610: | Line 1,610: | ||
: query ( terms index -- files ) |
: query ( terms index -- files ) |
||
[ at ] curry map [ ] [ intersect ] map-reduce ; |
[ at ] curry map [ ] [ intersect ] map-reduce ; |
||
</syntaxhighlight> |
|||
</lang> |
|||
Example use : |
Example use : |
||
<lang>( scratchpad ) { "f1" "f2" "f3" } index-files |
<syntaxhighlight lang="text">( scratchpad ) { "f1" "f2" "f3" } index-files |
||
--- Data stack: |
--- Data stack: |
||
Line 1,619: | Line 1,619: | ||
( scratchpad ) { "what" "is" "it" } swap query . |
( scratchpad ) { "what" "is" "it" } swap query . |
||
V{ "f1" "f2" } |
V{ "f1" "f2" } |
||
</syntaxhighlight> |
|||
</lang> |
|||
=={{header|Go}}== |
=={{header|Go}}== |
||
< |
<syntaxhighlight lang="go">package main |
||
import ( |
import ( |
||
Line 1,756: | Line 1,756: | ||
} |
} |
||
} |
} |
||
}</ |
}</syntaxhighlight> |
||
Session: |
Session: |
||
<pre> |
<pre> |
||
Line 1,779: | Line 1,779: | ||
=={{header|Haskell}}== |
=={{header|Haskell}}== |
||
< |
<syntaxhighlight lang="haskell">import Control.Monad |
||
import Data.Char (isAlpha, toLower) |
import Data.Char (isAlpha, toLower) |
||
import qualified Data.Map as M |
import qualified Data.Map as M |
||
Line 1,810: | Line 1,810: | ||
intersections xs = foldl1 S.intersection xs |
intersections xs = foldl1 S.intersection xs |
||
lowercase = map toLower</ |
lowercase = map toLower</syntaxhighlight> |
||
An example of use, assuming the program is named <code>iindex</code> and there exist files <code>t0</code>, <code>t1</code>, and <code>t2</code> with contents "It is what it is.", "What is it?", and "It is a banana.": |
An example of use, assuming the program is named <code>iindex</code> and there exist files <code>t0</code>, <code>t1</code>, and <code>t2</code> with contents "It is what it is.", "What is it?", and "It is a banana.": |
||
Line 1,821: | Line 1,821: | ||
The following implements a simple case insensitive inverse index using lists simulating texts. |
The following implements a simple case insensitive inverse index using lists simulating texts. |
||
< |
<syntaxhighlight lang="icon">procedure main() |
||
texts := table() # substitute for read and parse files |
texts := table() # substitute for read and parse files |
||
Line 1,875: | Line 1,875: | ||
write() |
write() |
||
return |
return |
||
end</ |
end</syntaxhighlight> |
||
Output:<pre>Enter search terms (^z to quit) : is it |
Output:<pre>Enter search terms (^z to quit) : is it |
||
Line 1,891: | Line 1,891: | ||
The following code will build a full index. Modification of search routines is left as an exercise: |
The following code will build a full index. Modification of search routines is left as an exercise: |
||
< |
<syntaxhighlight lang="unicon">record InvertedIndexRec(simple,full) |
||
procedure FullInvertedIndex(ii,k,words) #: accumulate a full inverted index |
procedure FullInvertedIndex(ii,k,words) #: accumulate a full inverted index |
||
Line 1,909: | Line 1,909: | ||
return ii |
return ii |
||
end</ |
end</syntaxhighlight> |
||
=={{header|J}}== |
=={{header|J}}== |
||
Line 1,915: | Line 1,915: | ||
This just implements the required spec, with a simplistic definition for what a word is, and with no support for stop words, nor for phrase searching. |
This just implements the required spec, with a simplistic definition for what a word is, and with no support for stop words, nor for phrase searching. |
||
< |
<syntaxhighlight lang="j">require'files regex strings' |
||
rxutf8 0 NB. support latin1 searches for this example, instead of utf8 |
rxutf8 0 NB. support latin1 searches for this example, instead of utf8 |
||
Line 1,938: | Line 1,938: | ||
hits=. buckets{~words i.~.parse tolower y |
hits=. buckets{~words i.~.parse tolower y |
||
files {~ >([-.-.)each/hits |
files {~ >([-.-.)each/hits |
||
)</ |
)</syntaxhighlight> |
||
Example use: |
Example use: |
||
< |
<syntaxhighlight lang="j"> invert '~help/primer/cut.htm';'~help/primer/end.htm';'~help/primer/gui.htm' |
||
>search 'finally learning' |
>search 'finally learning' |
||
~help/primer/end.htm |
~help/primer/end.htm |
||
Line 1,950: | Line 1,950: | ||
~help/primer/gui.htm |
~help/primer/gui.htm |
||
>search 'around' |
>search 'around' |
||
~help/primer/gui.htm</ |
~help/primer/gui.htm</syntaxhighlight> |
||
=={{header|Java}}== |
=={{header|Java}}== |
||
<syntaxhighlight lang="java"> |
|||
<lang Java> |
|||
package org.rosettacode; |
package org.rosettacode; |
||
Line 2,058: | Line 2,058: | ||
} |
} |
||
</syntaxhighlight> |
|||
</lang> |
|||
Example output: |
Example output: |
||
<syntaxhighlight lang="java"> |
|||
<lang Java> |
|||
java -cp bin org.rosettacode.InvertedIndex "huntsman,merit,dog,the,gutenberg,lovecraft,olympian" pg30637.txt pg7025.txt pg82.txt pg9090.txt |
java -cp bin org.rosettacode.InvertedIndex "huntsman,merit,dog,the,gutenberg,lovecraft,olympian" pg30637.txt pg7025.txt pg82.txt pg9090.txt |
||
indexed pg30637.txt 106473 words |
indexed pg30637.txt 106473 words |
||
Line 2,075: | Line 2,075: | ||
olympian pg30637.txt |
olympian pg30637.txt |
||
</syntaxhighlight> |
|||
</lang> |
|||
=={{header|jq}}== |
=={{header|jq}}== |
||
Line 2,085: | Line 2,085: | ||
'''Part 1: inverted_index and search''' |
'''Part 1: inverted_index and search''' |
||
< |
<syntaxhighlight lang="jq"># Given an array of [ doc, array_of_distinct_words ] |
||
# construct a lookup table: { word: array_of_docs } |
# construct a lookup table: { word: array_of_docs } |
||
def inverted_index: |
def inverted_index: |
||
Line 2,102: | Line 2,102: | ||
else reduce words[1:][] as $word |
else reduce words[1:][] as $word |
||
( $dict[words[0]]; overlap( $dict[$word] ) ) |
( $dict[words[0]]; overlap( $dict[$word] ) ) |
||
end ; </ |
end ; </syntaxhighlight> |
||
'''Part 2: Interactive Search''' |
'''Part 2: Interactive Search''' |
||
Line 2,111: | Line 2,111: | ||
could create a temporary file to hold the parsed output. |
could create a temporary file to hold the parsed output. |
||
< |
<syntaxhighlight lang="jq">def prompt_search: |
||
"Enter a string or an array of strings to search for, quoting each string, or 0 to exit:", |
"Enter a string or an array of strings to search for, quoting each string, or 0 to exit:", |
||
( (input | if type == "array" then . elif type == "string" then [.] |
( (input | if type == "array" then . elif type == "string" then [.] |
||
Line 2,118: | Line 2,118: | ||
| search($in), prompt_search ) ; |
| search($in), prompt_search ) ; |
||
$in | inverted_index | prompt_search</ |
$in | inverted_index | prompt_search</syntaxhighlight> |
||
'''Example''': |
'''Example''': |
||
< |
<syntaxhighlight lang="sh">$ jq -r -c -n --argfile in <(jq -R 'split(" ") | select(length>0) | [input_filename, unique]' T?.txt) -f Inverted_index.jq |
||
Enter a string or an array of strings to search for, quoting each string, or 0 to exit: |
Enter a string or an array of strings to search for, quoting each string, or 0 to exit: |
||
"is" |
"is" |
||
Line 2,130: | Line 2,130: | ||
Enter a string or an array of strings to search for, quoting each string, or 0 to exit: |
Enter a string or an array of strings to search for, quoting each string, or 0 to exit: |
||
0 |
0 |
||
$</ |
$</syntaxhighlight> |
||
=={{header|Julia}}== |
=={{header|Julia}}== |
||
< |
<syntaxhighlight lang="julia">function makedoubleindex(files) |
||
idx = Dict{String, Dict}() |
idx = Dict{String, Dict}() |
||
for file in files |
for file in files |
||
Line 2,169: | Line 2,169: | ||
const searchterms = ["forehead", "of", "hand", "a", "foot"] |
const searchterms = ["forehead", "of", "hand", "a", "foot"] |
||
wordsearch(didx, searchterms) |
wordsearch(didx, searchterms) |
||
</syntaxhighlight> |
|||
</lang> |
|||
=={{header|Kotlin}}== |
=={{header|Kotlin}}== |
||
< |
<syntaxhighlight lang="scala">// version 1.1.51 |
||
import java.io.File |
import java.io.File |
||
Line 2,229: | Line 2,229: | ||
findWord(word) |
findWord(word) |
||
} |
} |
||
}</ |
}</syntaxhighlight> |
||
Contents of files: |
Contents of files: |
||
Line 2,286: | Line 2,286: | ||
=={{header|Mathematica}}/{{header|Wolfram Language}}== |
=={{header|Mathematica}}/{{header|Wolfram Language}}== |
||
< |
<syntaxhighlight lang="mathematica">si = CreateSearchIndex["ExampleData/Text", Method -> "TFIDF"]; |
||
Manipulate[Grid[sr = TextSearch[si, ToString[s]]; |
Manipulate[Grid[sr = TextSearch[si, ToString[s]]; |
||
{FileBaseName /@ Normal[Dataset[sr][All, "ReferenceLocation"]], |
{FileBaseName /@ Normal[Dataset[sr][All, "ReferenceLocation"]], |
||
Column[#, Frame -> All] & /@ sr[All, "Snippet"]} // Transpose, |
Column[#, Frame -> All] & /@ sr[All, "Snippet"]} // Transpose, |
||
Frame -> All], {s, "tree"}]</ |
Frame -> All], {s, "tree"}]</syntaxhighlight> |
||
{{out}} |
{{out}} |
||
Outputs a graphical user interface with an interactive input field showing filename and the snippet of text that includes the search string. |
Outputs a graphical user interface with an interactive input field showing filename and the snippet of text that includes the search string. |
||
Line 2,297: | Line 2,297: | ||
We have used as examples three files containing the text from Wikipedia article. We used here an index which keep document name and line number. It would be easy to add the position in the line. |
We have used as examples three files containing the text from Wikipedia article. We used here an index which keep document name and line number. It would be easy to add the position in the line. |
||
< |
<syntaxhighlight lang="nim">import os, strformat, strutils, tables |
||
type |
type |
||
Line 2,451: | Line 2,451: | ||
stdout.write &"... in “{docName}”, line{plural(linenums.len)}" |
stdout.write &"... in “{docName}”, line{plural(linenums.len)}" |
||
for num in linenums: stdout.write ' ', num |
for num in linenums: stdout.write ' ', num |
||
echo ""</ |
echo ""</syntaxhighlight> |
||
{{out}} |
{{out}} |
||
Line 2,482: | Line 2,482: | ||
ocamlc -o inv.byte unix.cma bigarray.cma nums.cma -I +sexplib sexplib.cma str.cma inv.cmo |
ocamlc -o inv.byte unix.cma bigarray.cma nums.cma -I +sexplib sexplib.cma str.cma inv.cmo |
||
< |
<syntaxhighlight lang="ocaml">TYPE_CONV_PATH "Inverted_index" |
||
type files = string array with sexp |
type files = string array with sexp |
||
Line 2,562: | Line 2,562: | ||
| "--index-file", in_file -> index_file in_file |
| "--index-file", in_file -> index_file in_file |
||
| "--search-word", word -> search_word word |
| "--search-word", word -> search_word word |
||
| _ -> usage()</ |
| _ -> usage()</syntaxhighlight> |
||
=={{header|Perl}}== |
=={{header|Perl}}== |
||
< |
<syntaxhighlight lang="perl">use Set::Object 'set'; |
||
# given an array of files, returns the index |
# given an array of files, returns the index |
||
Line 2,618: | Line 2,618: | ||
# first arg is a comma-separated list of words to search for |
# first arg is a comma-separated list of words to search for |
||
print "$_\n" |
print "$_\n" |
||
foreach search_words_with_index({createindex(@ARGV)}, @searchwords);</ |
foreach search_words_with_index({createindex(@ARGV)}, @searchwords);</syntaxhighlight> |
||
=={{header|Phix}}== |
=={{header|Phix}}== |
||
Line 2,625: | Line 2,625: | ||
Might be better (and almost as easy) for the dictionary values to be say |
Might be better (and almost as easy) for the dictionary values to be say |
||
{total_count, {file nos}, {file counts}}. |
{total_count, {file nos}, {file counts}}. |
||
<!--< |
<!--<syntaxhighlight lang="phix">(notonline)--> |
||
<span style="color: #000080;font-style:italic;">-- demo\rosetta\Inverted_index.exw</span> |
<span style="color: #000080;font-style:italic;">-- demo\rosetta\Inverted_index.exw</span> |
||
<span style="color: #008080;">without</span> <span style="color: #008080;">js</span> <span style="color: #000080;font-style:italic;">-- (file i/o)</span> |
<span style="color: #008080;">without</span> <span style="color: #008080;">js</span> <span style="color: #000080;font-style:italic;">-- (file i/o)</span> |
||
Line 2,709: | Line 2,709: | ||
<span style="color: #0000FF;">?</span><span style="color: #000000;">lookup</span><span style="color: #0000FF;">({</span><span style="color: #008000;">"dir"</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"lower"</span><span style="color: #0000FF;">})</span> <span style="color: #000080;font-style:italic;">-- should be the same two</span> |
<span style="color: #0000FF;">?</span><span style="color: #000000;">lookup</span><span style="color: #0000FF;">({</span><span style="color: #008000;">"dir"</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"lower"</span><span style="color: #0000FF;">})</span> <span style="color: #000080;font-style:italic;">-- should be the same two</span> |
||
<span style="color: #0000FF;">?</span><span style="color: #000000;">lookup</span><span style="color: #0000FF;">({</span><span style="color: #008000;">"ban"</span><span style="color: #0000FF;">&</span><span style="color: #008000;">"anafish"</span><span style="color: #0000FF;">})</span> <span style="color: #000080;font-style:italic;">-- should be none ({})</span> |
<span style="color: #0000FF;">?</span><span style="color: #000000;">lookup</span><span style="color: #0000FF;">({</span><span style="color: #008000;">"ban"</span><span style="color: #0000FF;">&</span><span style="color: #008000;">"anafish"</span><span style="color: #0000FF;">})</span> <span style="color: #000080;font-style:italic;">-- should be none ({})</span> |
||
<!--</ |
<!--</syntaxhighlight>--> |
||
{{out}} |
{{out}} |
||
Note the distributed version has been modified and is occasionally used for real, so the output will likely differ. |
Note the distributed version has been modified and is occasionally used for real, so the output will likely differ. |
||
Line 2,724: | Line 2,724: | ||
=={{header|PHP}}== |
=={{header|PHP}}== |
||
< |
<syntaxhighlight lang="php"><?php |
||
function buildInvertedIndex($filenames) |
function buildInvertedIndex($filenames) |
||
Line 2,769: | Line 2,769: | ||
echo "Unable to find the word \"$word\" in the index\n"; |
echo "Unable to find the word \"$word\" in the index\n"; |
||
} |
} |
||
}</ |
}</syntaxhighlight> |
||
=={{header|PicoLisp}}== |
=={{header|PicoLisp}}== |
||
Line 2,782: | Line 2,782: | ||
it is a banana</pre> |
it is a banana</pre> |
||
we can read them into a binary tree in the global variable '*MyIndex' |
we can read them into a binary tree in the global variable '*MyIndex' |
||
< |
<syntaxhighlight lang="picolisp">(off *MyIndex) |
||
(use Word |
(use Word |
||
Line 2,796: | Line 2,796: | ||
(extract |
(extract |
||
'((Word) (val (car (idx '*MyIndex Word)))) |
'((Word) (val (car (idx '*MyIndex Word)))) |
||
(rest) ) ) )</ |
(rest) ) ) )</syntaxhighlight> |
||
Output: |
Output: |
||
<pre>: (searchFor "what" "is" "it") |
<pre>: (searchFor "what" "is" "it") |
||
Line 2,809: | Line 2,809: | ||
=={{header|PowerShell}}== |
=={{header|PowerShell}}== |
||
{{works with|PowerShell|2}} |
{{works with|PowerShell|2}} |
||
< |
<syntaxhighlight lang="powershell">function Index-File ( [string[]]$FileList ) |
||
{ |
{ |
||
# Create index hashtable, as needed |
# Create index hashtable, as needed |
||
Line 2,851: | Line 2,851: | ||
{ |
{ |
||
return $WordIndex[$Word] |
return $WordIndex[$Word] |
||
}</ |
}</syntaxhighlight> |
||
< |
<syntaxhighlight lang="powershell"># Populate test files |
||
@' |
@' |
||
Files full of |
Files full of |
||
Line 2,866: | Line 2,866: | ||
Use the index |
Use the index |
||
to find the files. |
to find the files. |
||
'@ | Out-File -FilePath C:\Test\File3.txt</ |
'@ | Out-File -FilePath C:\Test\File3.txt</syntaxhighlight> |
||
< |
<syntaxhighlight lang="powershell"># Index files |
||
Index-File C:\Test\File1.txt, C:\Test\File2.txt, C:\Test\File3.txt</ |
Index-File C:\Test\File1.txt, C:\Test\File2.txt, C:\Test\File3.txt</syntaxhighlight> |
||
Because PowerShell is a shell language, it is "a user interface to do a search". After running the script defining the functions and running a command to index the files, the user can simply run the search function at the PowerShell command prompt. |
Because PowerShell is a shell language, it is "a user interface to do a search". After running the script defining the functions and running a command to index the files, the user can simply run the search function at the PowerShell command prompt. |
||
Alternatively, one could create a more complex custom UI or GUI if desired. |
Alternatively, one could create a more complex custom UI or GUI if desired. |
||
< |
<syntaxhighlight lang="powershell"># Query index |
||
Find-Word files</ |
Find-Word files</syntaxhighlight> |
||
{{out}} |
{{out}} |
||
<pre>C:\Test\File1.txt |
<pre>C:\Test\File1.txt |
||
Line 2,882: | Line 2,882: | ||
First the simple inverted index from [[wp:Inverted index|here]] together with an implementation of a search for (multiple) terms from that index. |
First the simple inverted index from [[wp:Inverted index|here]] together with an implementation of a search for (multiple) terms from that index. |
||
< |
<syntaxhighlight lang="python">''' |
||
This implements: http://en.wikipedia.org/wiki/Inverted_index of 28/07/10 |
This implements: http://en.wikipedia.org/wiki/Inverted_index of 28/07/10 |
||
''' |
''' |
||
Line 2,922: | Line 2,922: | ||
terms = ["what", "is", "it"] |
terms = ["what", "is", "it"] |
||
print('\nTerm Search for: ' + repr(terms)) |
print('\nTerm Search for: ' + repr(terms)) |
||
pp(sorted(termsearch(terms)))</ |
pp(sorted(termsearch(terms)))</syntaxhighlight> |
||
'''Sample Output''' |
'''Sample Output''' |
||
Line 2,950: | Line 2,950: | ||
<small>It is assumed that the following code is added to the end of the code for the simple case above and so shares its file opening and parsing results</small> |
<small>It is assumed that the following code is added to the end of the code for the simple case above and so shares its file opening and parsing results</small> |
||
< |
<syntaxhighlight lang="python">from collections import Counter |
||
Line 3,001: | Line 3,001: | ||
print(ans) |
print(ans) |
||
ans = Counter(ans) |
ans = Counter(ans) |
||
print(' The phrase is found most commonly in text: ' + repr(ans.most_common(1)[0][0]))</ |
print(' The phrase is found most commonly in text: ' + repr(ans.most_common(1)[0][0]))</syntaxhighlight> |
||
'''Sample Output''' |
'''Sample Output''' |
||
Line 3,022: | Line 3,022: | ||
=={{header|Racket}}== |
=={{header|Racket}}== |
||
< |
<syntaxhighlight lang="racket"> |
||
#!/usr/bin/env racket |
#!/usr/bin/env racket |
||
#lang racket |
#lang racket |
||
Line 3,045: | Line 3,045: | ||
(printf "No matching files.\n") |
(printf "No matching files.\n") |
||
(printf "Terms found at: ~a.\n" (string-join all ", ")))) |
(printf "Terms found at: ~a.\n" (string-join all ", ")))) |
||
</syntaxhighlight> |
|||
</lang> |
|||
{{out}} |
{{out}} |
||
<pre> |
<pre> |
||
Line 3,064: | Line 3,064: | ||
(formerly Perl 6) |
(formerly Perl 6) |
||
{{works with|rakudo|2015-09-16}} |
{{works with|rakudo|2015-09-16}} |
||
<lang |
<syntaxhighlight lang="raku" line>sub MAIN (*@files) { |
||
my %norm; |
my %norm; |
||
do for @files -> $file { |
do for @files -> $file { |
||
Line 3,076: | Line 3,076: | ||
} |
} |
||
} |
} |
||
}</ |
}</syntaxhighlight> |
||
=={{header|REXX}}== |
=={{header|REXX}}== |
||
Line 3,084: | Line 3,084: | ||
To see more about Burma Shave signs, see the Wikipedia entry: [http://en.wikipedia.org/wiki/Burma-Shave Burma Shave signs.] |
To see more about Burma Shave signs, see the Wikipedia entry: [http://en.wikipedia.org/wiki/Burma-Shave Burma Shave signs.] |
||
< |
<syntaxhighlight lang="rexx">/*REXX program illustrates building a simple inverted index and a method of word find.*/ |
||
@.= /*a dictionary of words (so far). */ |
@.= /*a dictionary of words (so far). */ |
||
!= /*a list of found words (so far). */ |
!= /*a list of found words (so far). */ |
||
Line 3,141: | Line 3,141: | ||
q=strip(q, 'T', substr(@punctuation, j, 1) ) |
q=strip(q, 'T', substr(@punctuation, j, 1) ) |
||
end /*j*/ |
end /*j*/ |
||
return q</ |
return q</syntaxhighlight> |
||
{{out|output|text= when using the default inputs:}} |
{{out|output|text= when using the default inputs:}} |
||
Line 3,240: | Line 3,240: | ||
'''indexmerge.rb''' |
'''indexmerge.rb''' |
||
< |
<syntaxhighlight lang="ruby">if File.exist? "index.dat" |
||
@data = Marshal.load open("index.dat") |
@data = Marshal.load open("index.dat") |
||
else |
else |
||
Line 3,268: | Line 3,268: | ||
open("index.dat", "w") do |index| |
open("index.dat", "w") do |index| |
||
index.write Marshal.dump(@data) |
index.write Marshal.dump(@data) |
||
end</ |
end</syntaxhighlight> |
||
'''indexsearch.rb''' |
'''indexsearch.rb''' |
||
< |
<syntaxhighlight lang="ruby">if File.exist? "index.dat" |
||
@data = Marshal.load open("index.dat") |
@data = Marshal.load open("index.dat") |
||
else |
else |
||
Line 3,292: | Line 3,292: | ||
end |
end |
||
p @result</ |
p @result</syntaxhighlight> |
||
'''Output''' |
'''Output''' |
||
Line 3,305: | Line 3,305: | ||
=={{header|Rust}}== |
=={{header|Rust}}== |
||
< |
<syntaxhighlight lang="rust">// Part 1: Inverted index structure |
||
use std::{ |
use std::{ |
||
Line 3,545: | Line 3,545: | ||
Ok(()) |
Ok(()) |
||
}</ |
}</syntaxhighlight> |
||
{{out}} |
{{out}} |
||
Line 3,580: | Line 3,580: | ||
=={{header|Scala}}== |
=={{header|Scala}}== |
||
< |
<syntaxhighlight lang="scala">object InvertedIndex extends App { |
||
import java.io.File |
import java.io.File |
||
Line 3,610: | Line 3,610: | ||
} |
} |
||
} |
} |
||
}</ |
}</syntaxhighlight> |
||
{{out}} |
{{out}} |
||
<pre>> InvertedIndex "the" file1.txt file2.txt file3.txt |
<pre>> InvertedIndex "the" file1.txt file2.txt file3.txt |
||
Line 3,632: | Line 3,632: | ||
=={{header|Tcl}}== |
=={{header|Tcl}}== |
||
< |
<syntaxhighlight lang="tcl">package require Tcl 8.5 |
||
proc wordsInString str { |
proc wordsInString str { |
||
# We define "words" to be "maximal sequences of 'word' characters". |
# We define "words" to be "maximal sequences of 'word' characters". |
||
Line 3,720: | Line 3,720: | ||
} |
} |
||
return $files |
return $files |
||
}</ |
}</syntaxhighlight> |
||
For the GUI: |
For the GUI: |
||
< |
<syntaxhighlight lang="tcl">package require Tk |
||
pack [labelframe .files -text Files] -side left -fill y |
pack [labelframe .files -text Files] -side left -fill y |
||
pack [listbox .files.list -listvariable files] |
pack [listbox .files.list -listvariable files] |
||
Line 3,756: | Line 3,756: | ||
lappend found "Searching for files with \"$terms\"" {*}$fs \ |
lappend found "Searching for files with \"$terms\"" {*}$fs \ |
||
"---------------------" |
"---------------------" |
||
}</ |
}</syntaxhighlight> |
||
=={{header|TUSCRIPT}}== |
=={{header|TUSCRIPT}}== |
||
< |
<syntaxhighlight lang="tuscript"> |
||
$$ MODE TUSCRIPT |
$$ MODE TUSCRIPT |
||
Line 3,790: | Line 3,790: | ||
ENDLOOP |
ENDLOOP |
||
PRINT "-> ",files |
PRINT "-> ",files |
||
</syntaxhighlight> |
|||
</lang> |
|||
Output: |
Output: |
||
<pre> |
<pre> |
||
Line 3,807: | Line 3,807: | ||
{{works with|ksh93}} |
{{works with|ksh93}} |
||
< |
<syntaxhighlight lang="bash">#!/bin/ksh |
||
typeset -A INDEX |
typeset -A INDEX |
||
Line 3,831: | Line 3,831: | ||
(( count == $# )) && echo $file |
(( count == $# )) && echo $file |
||
done |
done |
||
}</ |
}</syntaxhighlight> |
||
Example use: |
Example use: |
||
< |
<syntaxhighlight lang="korn">index *.txt |
||
search hello world |
search hello world |
||
</syntaxhighlight> |
|||
</lang> |
|||
===Directory on filesystem=== |
===Directory on filesystem=== |
||
Line 3,846: | Line 3,846: | ||
* Add note about slowness. |
* Add note about slowness. |
||
< |
<syntaxhighlight lang="bash">#!/bin/sh |
||
# index.sh - create an inverted index |
# index.sh - create an inverted index |
||
Line 3,894: | Line 3,894: | ||
echo >&2 |
echo >&2 |
||
: $((fi += 1)) |
: $((fi += 1)) |
||
done</ |
done</syntaxhighlight> |
||
< |
<syntaxhighlight lang="bash">#!/bin/sh |
||
# search.sh - search an inverted index |
# search.sh - search an inverted index |
||
Line 3,928: | Line 3,928: | ||
} |
} |
||
$want "$@"</ |
$want "$@"</syntaxhighlight> |
||
=={{header|Wren}}== |
=={{header|Wren}}== |
||
Line 3,935: | Line 3,935: | ||
{{libheader|Wren-pattern}} |
{{libheader|Wren-pattern}} |
||
{{libheader|Wren-str}} |
{{libheader|Wren-str}} |
||
< |
<syntaxhighlight lang="ecmascript">import "/ioutil" for FileUtil, Input |
||
import "/pattern" for Pattern |
import "/pattern" for Pattern |
||
import "/str" for Str |
import "/str" for Str |
||
Line 4,000: | Line 4,000: | ||
if (word == "q" || word == "Q") return |
if (word == "q" || word == "Q") return |
||
findWord.call(word) |
findWord.call(word) |
||
}</ |
}</syntaxhighlight> |
||
{{out}} |
{{out}} |