Inverted index: Difference between revisions

m
(Added 11l)
m (→‎{{header|Wren}}: Minor tidy)
 
(3 intermediate revisions by 3 users not shown)
Line 15:
{{trans|D}}
 
<langsyntaxhighlight lang="11l">DefaultDict[String, Set[String]] index
 
F parse_file(fname, fcontents)
Line 31:
print(‘'#.' found in #..’.format(w, sorted(Array(index[w]))))
E
print(‘'#.' not found.’.format(w))</langsyntaxhighlight>
 
{{out}}
Line 58:
Here is the main program (file inverted_index.adb):
 
<langsyntaxhighlight Adalang="ada">with Ada.Text_IO, Generic_Inverted_Index, Ada.Strings.Hash, Parse_Lines;
use Ada.Text_IO;
 
Line 171:
end;
end loop;
end Inverted_Index;</langsyntaxhighlight>
 
A sample output:
Line 196:
The real work is actually done in the package Generic_Inverted_Index. Here is the specification (file generic_inverted_index.ads):
 
<langsyntaxhighlight Adalang="ada">with Ada.Containers.Indefinite_Vectors;
private with Ada.Containers.Indefinite_Hashed_Maps;
 
Line 258:
type Storage_Type is new Maps.Map with null record;
 
end Generic_Inverted_Index;</langsyntaxhighlight>
 
Here is the implementation (generic_inverted_index.adb):
 
<langsyntaxhighlight Adalang="ada">package body Generic_Inverted_Index is
 
use Source_Vecs;
Line 356:
end Same_Vector;
 
end Generic_Inverted_Index;</langsyntaxhighlight>
 
===Package Parse_Lines===
Line 362:
The main program also uses an auxiliary package Parse_Lines. Note the usage of Gnat.Regpat, which itself is pattern matching package, specific for gnat/gcc. This package is derived from the [[Regular expressions#Ada|Ada implementation of the regular expressions task]]. Here is the spec (parse_lines.ads):
 
<langsyntaxhighlight Adalang="ada">with Gnat.Regpat;
 
package Parse_Lines is
Line 381:
procedure Iterate_Words(S: String);
 
end Parse_Lines;</langsyntaxhighlight>
 
And here is the implementation (parse_lines.adb):
 
<langsyntaxhighlight Adalang="ada">with Gnat.Regpat;
 
package body Parse_Lines is
Line 426:
end Iterate_Words;
 
end Parse_Lines;</langsyntaxhighlight>
 
===Alternative Implementation of Generic_Inverted_Index (Ada 2012)===
Line 432:
The new standard Ada 2012 simplifies the usage of containers significantly. The following runs under gnat (GNAT GPL 2011 (20110419)), when using the experimental -gnat2012 switch. The main program is the same. Here is the spec for Generic_Inverted_Index:
 
<langsyntaxhighlight Adalang="ada">with Ada.Containers.Indefinite_Vectors;
private with Ada.Containers.Indefinite_Hashed_Maps;
 
Line 487:
type Storage_Type is new Maps.Map with null record;
 
end Generic_Inverted_Index;</langsyntaxhighlight>
 
The implementation:
 
<langsyntaxhighlight Adalang="ada">package body Generic_Inverted_Index is
-- uses some of the new Ada 2012 syntax
use Source_Vecs;
Line 572:
end Same_Vector;
 
end Generic_Inverted_Index;</langsyntaxhighlight>
 
=={{header|AutoHotkey}}==
{{works with|AutoHotkey_L}}
 
<langsyntaxhighlight AutoHotkeylang="autohotkey">; http://www.autohotkey.com/forum/viewtopic.php?t=41479
inputbox, files, files, file pattern such as c:\files\*.txt
 
Line 641:
else
return word2docs[word2find]
}</langsyntaxhighlight>
 
=={{header|BBC BASIC}}==
{{works with|BBC BASIC for Windows}}
This uses a hashed index and linked lists to hold the file numbers.
<langsyntaxhighlight lang="bbcbasic"> DIM FileList$(4)
FileList$() = "BBCKEY0.TXT", "BBCKEY1.TXT", "BBCKEY2.TXT", \
\ "BBCKEY3.TXT", "BBCKEY4.TXT"
Line 717:
IF C% >= 65 IF C% <= 90 MID$(A$,A%,1) = CHR$(C%+32)
NEXT
= A$</langsyntaxhighlight>
'''Output:'''
<pre>
Line 728:
=={{header|C}}==
The code is stupidly long, having to implement a Trie to store strings and all -- the program doesn't do anything shiny, but Tries may be interesting to look at.
<langsyntaxhighlight Clang="c">#include <stdio.h>
#include <stdlib.h>
 
Line 878:
search_index(root, "boo");
return 0;
}</langsyntaxhighlight>Output:<syntaxhighlight lang="text">Search for "what": f1.txt source/f2.txt
Search for "is": f1.txt other_file source/f2.txt
Search for "banana": other_file
Search for "boo": not found</langsyntaxhighlight>
 
=={{header|C sharp|C#}}==
<langsyntaxhighlight lang="csharp">using System;
using System.Collections.Generic;
using System.IO;
Line 908:
Console.WriteLine("{0} found in: {1}", find, string.Join(" ", Invert(dictionary)[find]));
}
}</langsyntaxhighlight>
Sample output:
<syntaxhighlight lang="text">files: file1 file2 file3
find: what
what found in: file1 file2</langsyntaxhighlight>
 
=={{header|C++}}==
Same idea as the C implementation - trie to store the words
<langsyntaxhighlight lang="cpp">
#include <algorithm>
#include <fstream>
Line 1,033:
return 0;
}
</syntaxhighlight>
</lang>
{{out}}
<pre>
Line 1,054:
 
=={{header|Clojure}}==
<langsyntaxhighlight lang="clojure">(ns inverted-index.core
(:require [clojure.set :as sets]
[clojure.java.io :as io]))
Line 1,079:
(defn search [index query]
(apply sets/intersection (map index (term-seq query))))
</syntaxhighlight>
</lang>
 
=={{header|CoffeeScript}}==
<langsyntaxhighlight lang="coffeescript">
fs = require 'fs'
 
Line 1,118:
grep index, 'make_index'
grep index, 'sort'
</syntaxhighlight>
</lang>
output
<syntaxhighlight lang="text">
> coffee inverted_index.coffee
locations for 'make_index':
Line 1,136:
inverted_index.coffee:35
knuth_sample.coffee:12
</syntaxhighlight>
</lang>
 
=={{header|Common Lisp}}==
<langsyntaxhighlight Lisplang="lisp">(defpackage rosettacode.inverted-index
(:use cl))
(in-package rosettacode.inverted-index)
Line 1,174:
:test #'equal))
 
</syntaxhighlight>
</lang>
Example:
<langsyntaxhighlight lang="lisp">(defparameter *index* (build-index '("file1.txt" "file2.txt" "file3.txt")))
(defparameter *query* "foo bar")
(defparameter *result* (lookup *index* *query*))
(format t "Result for query ~s: ~{~a~^, ~}~%" *query* *result*)</langsyntaxhighlight>
 
=={{header|D}}==
<langsyntaxhighlight lang="d">import std.stdio, std.algorithm, std.string, std.file, std.regex;
 
void main() {
Line 1,214:
writefln("'%s' not found.", w);
}
}</langsyntaxhighlight>
Both the demo text files and the queries are from the Wikipedia page, they contain:
It is what it is.
Line 1,243:
{{libheader| system.Generics.Collections}}
{{libheader| SYstem.IOUtils}}
<syntaxhighlight lang="delphi">
<lang Delphi>
program Inverted_index;
 
Line 1,433:
 
Index.Free;
end.</langsyntaxhighlight>
Input: Same of [[#D|D]].
{{out}}
Line 1,453:
===Indexing===
Index values are sets associated with each word (key). We use the local-put-value function to permanently store the index, in the browser local storage.
<langsyntaxhighlight lang="lisp">
;; set of input files
(define FILES {T0.txt T1.txt T2.txt})
Line 1,477:
(let ((text (text-parse text)))
(for ((word text)) (invert-word (string-downcase word) file INVERT))))
</syntaxhighlight>
</lang>
 
=== Query ===
Intersect sets values of each word.
<langsyntaxhighlight lang="lisp">
;; usage : (inverted-search w1 w2 ..)
(define-syntax-rule (inverted-search w ...)
Line 1,493:
(set-intersect res (local-get-value word INVERT)))
FILES words))
</syntaxhighlight>
</lang>
Output :
<langsyntaxhighlight lang="lisp">
(map-files invert-text FILES)
(inverted-search is it)
Line 1,505:
(inverted-search boule)
[3]→ null
</syntaxhighlight>
</lang>
 
=={{header|Erlang}}==
Line 1,512:
Ditto for any other character.
 
<syntaxhighlight lang="erlang">
<lang Erlang>
-module( inverted_index ).
 
Line 1,547:
 
search_common( Files, Acc ) -> [X || X <- Acc, lists:member(X, Files)].
</syntaxhighlight>
</lang>
 
=={{header|F Sharp|F#}}==
<langsyntaxhighlight lang="fsharp">open System
open System.IO
 
Line 1,584:
|> Set.intersectMany
 
printf "Found in: " ; searchResults |> Set.iter (printf "%s ") ; printfn ""</langsyntaxhighlight>
Sample usage:
<pre>
Line 1,594:
 
=={{header|Factor}}==
<langsyntaxhighlight lang="factor">USING: assocs fry io.encodings.utf8 io.files kernel sequences
sets splitting vectors ;
IN: rosettacode.inverted-index
Line 1,610:
: query ( terms index -- files )
[ at ] curry map [ ] [ intersect ] map-reduce ;
</syntaxhighlight>
</lang>
 
Example use :
<syntaxhighlight lang="text">( scratchpad ) { "f1" "f2" "f3" } index-files
 
--- Data stack:
Line 1,619:
( scratchpad ) { "what" "is" "it" } swap query .
V{ "f1" "f2" }
</syntaxhighlight>
</lang>
 
=={{header|Go}}==
<langsyntaxhighlight lang="go">package main
 
import (
Line 1,756:
}
}
}</langsyntaxhighlight>
Session:
<pre>
Line 1,779:
=={{header|Haskell}}==
 
<langsyntaxhighlight lang="haskell">import Control.Monad
import Data.Char (isAlpha, toLower)
import qualified Data.Map as M
Line 1,810:
intersections xs = foldl1 S.intersection xs
 
lowercase = map toLower</langsyntaxhighlight>
 
An example of use, assuming the program is named <code>iindex</code> and there exist files <code>t0</code>, <code>t1</code>, and <code>t2</code> with contents "It is what it is.", "What is it?", and "It is a banana.":
Line 1,821:
 
The following implements a simple case insensitive inverse index using lists simulating texts.
<langsyntaxhighlight Iconlang="icon">procedure main()
 
texts := table() # substitute for read and parse files
Line 1,875:
write()
return
end</langsyntaxhighlight>
 
Output:<pre>Enter search terms (^z to quit) : is it
Line 1,891:
 
The following code will build a full index. Modification of search routines is left as an exercise:
<langsyntaxhighlight Uniconlang="unicon">record InvertedIndexRec(simple,full)
 
procedure FullInvertedIndex(ii,k,words) #: accumulate a full inverted index
Line 1,909:
 
return ii
end</langsyntaxhighlight>
 
=={{header|J}}==
Line 1,915:
This just implements the required spec, with a simplistic definition for what a word is, and with no support for stop words, nor for phrase searching.
 
<langsyntaxhighlight Jlang="j">require'files regex strings'
 
rxutf8 0 NB. support latin1 searches for this example, instead of utf8
Line 1,938:
hits=. buckets{~words i.~.parse tolower y
files {~ >([-.-.)each/hits
)</langsyntaxhighlight>
 
Example use:
 
<langsyntaxhighlight Jlang="j"> invert '~help/primer/cut.htm';'~help/primer/end.htm';'~help/primer/gui.htm'
>search 'finally learning'
~help/primer/end.htm
Line 1,950:
~help/primer/gui.htm
>search 'around'
~help/primer/gui.htm</langsyntaxhighlight>
 
=={{header|Java}}==
<syntaxhighlight lang="java">
<lang Java>
package org.rosettacode;
 
Line 2,058:
}
 
</syntaxhighlight>
</lang>
 
Example output:
<syntaxhighlight lang="java">
<lang Java>
java -cp bin org.rosettacode.InvertedIndex "huntsman,merit,dog,the,gutenberg,lovecraft,olympian" pg30637.txt pg7025.txt pg82.txt pg9090.txt
indexed pg30637.txt 106473 words
Line 2,075:
olympian pg30637.txt
 
</syntaxhighlight>
</lang>
 
=={{header|jq}}==
Line 2,085:
 
'''Part 1: inverted_index and search'''
<langsyntaxhighlight lang="jq"># Given an array of [ doc, array_of_distinct_words ]
# construct a lookup table: { word: array_of_docs }
def inverted_index:
Line 2,102:
else reduce words[1:][] as $word
( $dict[words[0]]; overlap( $dict[$word] ) )
end ; </langsyntaxhighlight>
 
'''Part 2: Interactive Search'''
Line 2,111:
could create a temporary file to hold the parsed output.
 
<langsyntaxhighlight lang="jq">def prompt_search:
"Enter a string or an array of strings to search for, quoting each string, or 0 to exit:",
( (input | if type == "array" then . elif type == "string" then [.]
Line 2,118:
| search($in), prompt_search ) ;
 
$in | inverted_index | prompt_search</langsyntaxhighlight>
 
'''Example''':
<langsyntaxhighlight lang="sh">$ jq -r -c -n --argfile in <(jq -R 'split(" ") | select(length>0) | [input_filename, unique]' T?.txt) -f Inverted_index.jq
Enter a string or an array of strings to search for, quoting each string, or 0 to exit:
"is"
Line 2,130:
Enter a string or an array of strings to search for, quoting each string, or 0 to exit:
0
$</langsyntaxhighlight>
 
=={{header|Julia}}==
<langsyntaxhighlight lang="julia">function makedoubleindex(files)
idx = Dict{String, Dict}()
for file in files
Line 2,169:
const searchterms = ["forehead", "of", "hand", "a", "foot"]
wordsearch(didx, searchterms)
</syntaxhighlight>
</lang>
 
=={{header|Kotlin}}==
<langsyntaxhighlight lang="scala">// version 1.1.51
 
import java.io.File
Line 2,229:
findWord(word)
}
}</langsyntaxhighlight>
 
Contents of files:
Line 2,284:
? : q
</pre>
 
=={{header|Mathematica}}/{{header|Wolfram Language}}==
<syntaxhighlight lang="mathematica">si = CreateSearchIndex["ExampleData/Text", Method -> "TFIDF"];
Manipulate[Grid[sr = TextSearch[si, ToString[s]];
{FileBaseName /@ Normal[Dataset[sr][All, "ReferenceLocation"]],
Column[#, Frame -> All] & /@ sr[All, "Snippet"]} // Transpose,
Frame -> All], {s, "tree"}]</syntaxhighlight>
{{out}}
Outputs a graphical user interface with an interactive input field showing filename and the snippet of text that includes the search string.
 
=={{header|Nim}}==
We have used as examples three files containing the text from Wikipedia article. We used here an index which keep document name and line number. It would be easy to add the position in the line.
 
<langsyntaxhighlight Nimlang="nim">import os, strformat, strutils, tables
 
type
Line 2,442 ⟶ 2,451:
stdout.write &"... in “{docName}”, line{plural(linenums.len)}"
for num in linenums: stdout.write ' ', num
echo ""</langsyntaxhighlight>
 
{{out}}
Line 2,473 ⟶ 2,482:
ocamlc -o inv.byte unix.cma bigarray.cma nums.cma -I +sexplib sexplib.cma str.cma inv.cmo
 
<langsyntaxhighlight lang="ocaml">TYPE_CONV_PATH "Inverted_index"
 
type files = string array with sexp
Line 2,553 ⟶ 2,562:
| "--index-file", in_file -> index_file in_file
| "--search-word", word -> search_word word
| _ -> usage()</langsyntaxhighlight>
 
=={{header|Perl}}==
<langsyntaxhighlight lang="perl">use Set::Object 'set';
 
# given an array of files, returns the index
Line 2,609 ⟶ 2,618:
# first arg is a comma-separated list of words to search for
print "$_\n"
foreach search_words_with_index({createindex(@ARGV)}, @searchwords);</langsyntaxhighlight>
 
=={{header|Phix}}==
Line 2,616 ⟶ 2,625:
Might be better (and almost as easy) for the dictionary values to be say
{total_count, {file nos}, {file counts}}.
<!--<syntaxhighlight lang="phix">(notonline)-->
<lang Phix>-- demo\rosetta\Inverted_index.exw
<span style="color: #000080;font-style:italic;">-- demo\rosetta\Inverted_index.exw</span>
integer word_count = 0
<span style="color: #008080;">without</span> <span style="color: #008080;">js</span> <span style="color: #000080;font-style:italic;">-- (file i/o)</span>
sequence filenames = {}
<span style="color: #004080;">integer</span> <span style="color: #000000;">word_count</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">0</span>
 
<span style="color: #004080;">sequence</span> <span style="color: #000000;">filenames</span> <span style="color: #0000FF;">=</span> <span style="color: #0000FF;">{}</span>
function is_ascii(string txt)
for i=1 to length(txt) do
<span style="color: #008080;">function</span> <span style="color: #000000;">is_ascii</span><span style="color: #0000FF;">(</span><span style="color: #004080;">string</span> <span style="color: #000000;">txt</span><span style="color: #0000FF;">)</span>
integer ch = txt[i]
<span style="color: #008080;">for</span> <span style="color: #000000;">i</span><span style="color: #0000FF;">=</span><span style="color: #000000;">1</span> <span style="color: #008080;">to</span> <span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #000000;">txt</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">do</span>
if ch='\0' or ch>=#7F then return 0 end if
<span style="color: #004080;">integer</span> <span style="color: #000000;">ch</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">txt</span><span style="color: #0000FF;">[</span><span style="color: #000000;">i</span><span style="color: #0000FF;">]</span>
end for
<span style="color: #008080;">if</span> <span style="color: #000000;">ch</span><span style="color: #0000FF;">=</span><span style="color: #008000;">'\0'</span> <span style="color: #008080;">or</span> <span style="color: #000000;">ch</span><span style="color: #0000FF;">>=</span><span style="color: #000000;">#7F</span> <span style="color: #008080;">then</span> <span style="color: #008080;">return</span> <span style="color: #000000;">0</span> <span style="color: #008080;">end</span> <span style="color: #008080;">if</span>
return 1
<span style="color: #008080;">end</span> <span style="color: #008080;">for</span>
end function
<span style="color: #008080;">return</span> <span style="color: #000000;">1</span>
 
<span style="color: #008080;">end</span> <span style="color: #008080;">function</span>
procedure add_words(string name, sequence words)
filenames = append(filenames,name)
<span style="color: #008080;">procedure</span> <span style="color: #000000;">add_words</span><span style="color: #0000FF;">(</span><span style="color: #004080;">string</span> <span style="color: #000000;">name</span><span style="color: #0000FF;">,</span> <span style="color: #004080;">sequence</span> <span style="color: #000000;">words</span><span style="color: #0000FF;">)</span>
integer fn = length(filenames)
<span style="color: #000000;">filenames</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">append</span><span style="color: #0000FF;">(</span><span style="color: #000000;">filenames</span><span style="color: #0000FF;">,</span><span style="color: #000000;">name</span><span style="color: #0000FF;">)</span>
for i=1 to length(words) do
<span style="color: #004080;">integer</span> <span style="color: #000000;">fn</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #000000;">filenames</span><span style="color: #0000FF;">)</span>
string word = words[i]
<span style="color: #008080;">for</span> <span style="color: #000000;">i</span><span style="color: #0000FF;">=</span><span style="color: #000000;">1</span> <span style="color: #008080;">to</span> <span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #000000;">words</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">do</span>
if word[1]>='a' -- skip numbers
<span style="color: #004080;">string</span> <span style="color: #000000;">word</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">words</span><span style="color: #0000FF;">[</span><span style="color: #000000;">i</span><span style="color: #0000FF;">]</span>
and word[1]<='z' then
<span style="color: #008080;">if</span> <span style="color: #000000;">word</span><span style="color: #0000FF;">[</span><span style="color: #000000;">1</span><span style="color: #0000FF;">]>=</span><span style="color: #008000;">'a'</span> <span style="color: #000080;font-style:italic;">-- skip numbers</span>
integer node = getd_index(word)
<span style="color: #008080;">and</span> <span style="color: #000000;">word</span><span style="color: #0000FF;">[</span><span style="color: #000000;">1</span><span style="color: #0000FF;">]<=</span><span style="color: #008000;">'z'</span> <span style="color: #008080;">then</span>
if node=0 then -- not present
<span style="color: #004080;">integer</span> <span style="color: #000000;">node</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">getd_index</span><span style="color: #0000FF;">(</span><span style="color: #000000;">word</span><span style="color: #0000FF;">)</span>
setd(word,{fn})
<span style="color: #008080;">if</span> <span style="color: #000000;">node</span><span style="color: #0000FF;">=</span><span style="color: #000000;">0</span> <span style="color: #008080;">then</span> <span style="color: #000080;font-style:italic;">-- not present</span>
word_count += 1
<span style="color: #7060A8;">setd</span><span style="color: #0000FF;">(</span><span style="color: #000000;">word</span><span style="color: #0000FF;">,{</span><span style="color: #000000;">fn</span><span style="color: #0000FF;">})</span>
else
<span style="color: #000000;">word_count</span> <span style="color: #0000FF;">+=</span> <span style="color: #000000;">1</span>
sequence val = getd_by_index(node)
<span if find(fn,val)style=0"color: then#008080;">else</span>
<span style="color: #004080;">sequence</span> <span style="color: #000000;">val</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">getd_by_index</span><span style="color: #0000FF;">(</span><span style="color: #000000;">node</span><span style="color: #0000FF;">)</span>
setd(word,append(val,fn))
<span style="color: #008080;">if</span> <span style="color: #7060A8;">find</span><span style="color: #0000FF;">(</span><span style="color: #000000;">fn</span><span style="color: #0000FF;">,</span><span style="color: #000000;">val</span><span style="color: #0000FF;">)=</span><span style="color: #000000;">0</span> <span style="color: #008080;">then</span>
end if
<span style="color: #7060A8;">setd</span><span style="color: #0000FF;">(</span><span style="color: #000000;">word</span><span style="color: #0000FF;">,</span><span style="color: #7060A8;">append</span><span style="color: #0000FF;">(</span><span style="color: #000000;">val</span><span style="color: #0000FF;">,</span><span style="color: #000000;">fn</span><span style="color: #0000FF;">))</span>
end if
<span style="color: #008080;">end</span> <span style="color: #008080;">if</span>
end if
<span style="color: #008080;">end</span> <span style="color: #008080;">if</span>
end for
<span style="color: #008080;">end</span> <span style="color: #008080;">if</span>
end procedure
<span style="color: #008080;">end</span> <span style="color: #008080;">for</span>
 
<span style="color: #008080;">end</span> <span style="color: #008080;">procedure</span>
procedure load_directory()
sequence d = dir(".")
<span style="color: #008080;">procedure</span> <span style="color: #000000;">load_directory</span><span style="color: #0000FF;">()</span>
for i=1 to length(d) do
<span style="color: #004080;">sequence</span> <span style="color: #000000;">d</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">dir</span><span style="color: #0000FF;">(</span><span style="color: #008000;">"."</span><span style="color: #0000FF;">)</span>
if not find('d',d[i][D_ATTRIBUTES]) -- skip directories
<span style="color: #008080;">for</span> <span style="color: #000000;">i</span><span style="color: #0000FF;">=</span><span style="color: #000000;">1</span> <span style="color: #008080;">to</span> <span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #000000;">d</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">do</span>
and d[i][D_SIZE]<1024*1024*1024 then -- and files > 1GB
<span style="color: #008080;">if</span> <span style="color: #008080;">not</span> <span style="color: #7060A8;">find</span><span style="color: #0000FF;">(</span><span style="color: #008000;">'d'</span><span style="color: #0000FF;">,</span><span style="color: #000000;">d</span><span style="color: #0000FF;">[</span><span style="color: #000000;">i</span><span style="color: #0000FF;">][</span><span style="color: #004600;">D_ATTRIBUTES</span><span style="color: #0000FF;">])</span> <span style="color: #000080;font-style:italic;">-- skip directories</span>
string name = d[i][D_NAME]
<span style="color: #008080;">and</span> <span style="color: #000000;">d</span><span style="color: #0000FF;">[</span><span style="color: #000000;">i</span><span style="color: #0000FF;">][</span><span style="color: #004600;">D_SIZE</span><span style="color: #0000FF;">]<</span><span style="color: #000000;">1024</span><span style="color: #0000FF;">*</span><span style="color: #000000;">1024</span><span style="color: #0000FF;">*</span><span style="color: #000000;">1024</span> <span style="color: #008080;">then</span> <span style="color: #000080;font-style:italic;">-- and files &gt; 1GB</span>
integer fn = open(name,"rb")
<span style="color: #004080;">string</span> <span style="color: #000000;">name</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">d</span><span style="color: #0000FF;">[</span><span style="color: #000000;">i</span><span style="color: #0000FF;">][</span><span style="color: #004600;">D_NAME</span><span style="color: #0000FF;">]</span>
string txt = lower(get_text(fn))
<span style="color: #004080;">integer</span> <span style="color: #000000;">fn</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">open</span><span style="color: #0000FF;">(</span><span style="color: #000000;">name</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"rb"</span><span style="color: #0000FF;">)</span>
close(fn)
<span style="color: #004080;">string</span> <span style="color: #000000;">txt</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">lower</span><span style="color: #0000FF;">(</span><span style="color: #7060A8;">get_text</span><span style="color: #0000FF;">(</span><span style="color: #000000;">fn</span><span style="color: #0000FF;">))</span>
if is_ascii(txt) then -- skip any bitmaps etc
<span style="color: #7060A8;">close</span><span style="color: #0000FF;">(</span><span style="color: #000000;">fn</span><span style="color: #0000FF;">)</span>
sequence words = split_any(txt,"\0\r\n\t !\"#$%&\'()*+,-./:;<=>?@[\\]^`{|}~")
<span style="color: #008080;">if</span> <span style="color: #000000;">is_ascii</span><span style="color: #0000FF;">(</span><span style="color: #000000;">txt</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">then</span> <span style="color: #000080;font-style:italic;">-- skip any bitmaps etc</span>
add_words(name,words)
<span style="color: #004080;">sequence</span> <span style="color: #000000;">words</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">split_any</span><span style="color: #0000FF;">(</span><span style="color: #000000;">txt</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"\0\r\n\t !\"#$%&\'()*+,-./:;&lt;=&gt;?@[\\]^`{|}~"</span><span style="color: #0000FF;">)</span>
end if
<span style="color: #000000;">add_words</span><span style="color: #0000FF;">(</span><span style="color: #000000;">name</span><span style="color: #0000FF;">,</span><span style="color: #000000;">words</span><span style="color: #0000FF;">)</span>
end if
<span style="color: #008080;">end</span> <span style="color: #008080;">if</span>
end for
<span style="color: #008080;">end</span> <span style="color: #008080;">if</span>
end procedure
<span style="color: #008080;">end</span> <span style="color: #008080;">for</span>
 
<span style="color: #008080;">end</span> <span style="color: #008080;">procedure</span>
function lookup(sequence words)
sequence files = {} -- indexes to filenames
<span style="color: #008080;">function</span> <span style="color: #000000;">lookup</span><span style="color: #0000FF;">(</span><span style="color: #004080;">sequence</span> <span style="color: #000000;">words</span><span style="color: #0000FF;">)</span>
for i=1 to length(words) do
<span style="color: #004080;">sequence</span> <span style="color: #000000;">files</span> <span style="color: #0000FF;">=</span> <span style="color: #0000FF;">{}</span> <span style="color: #000080;font-style:italic;">-- indexes to filenames</span>
string word = words[i]
<span style="color: #008080;">for</span> <span style="color: #000000;">i</span><span style="color: #0000FF;">=</span><span style="color: #000000;">1</span> <span style="color: #008080;">to</span> <span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #000000;">words</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">do</span>
integer node = getd_index(word)
<span style="color: #004080;">string</span> <span style="color: #000000;">word</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">words</span><span style="color: #0000FF;">[</span><span style="color: #000000;">i</span><span style="color: #0000FF;">]</span>
if node=0 then return {} end if
<span style="color: #004080;">integer</span> <span style="color: #000000;">node</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">getd_index</span><span style="color: #0000FF;">(</span><span style="color: #000000;">word</span><span style="color: #0000FF;">)</span>
sequence val = getd_by_index(node)
<span style="color: #008080;">if</span> <span style="color: #000000;">node</span><span style="color: #0000FF;">=</span><span style="color: #000000;">0</span> <span style="color: #008080;">then</span> <span style="color: #008080;">return</span> <span style="color: #0000FF;">{}</span> <span style="color: #008080;">end</span> <span style="color: #008080;">if</span>
if i=1 then
<span style="color: #004080;">sequence</span> <span style="color: #000000;">val</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">getd_by_index</span><span style="color: #0000FF;">(</span><span style="color: #000000;">node</span><span style="color: #0000FF;">)</span>
files = val
<span style="color: #008080;">if</span> <span style="color: #000000;">i</span><span style="color: #0000FF;">=</span><span style="color: #000000;">1</span> <span style="color: #008080;">then</span>
else
<span style="color: #000000;">files</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">val</span>
for j=length(files) to 1 by -1 do
<span style="color: #008080;">else</span>
if not find(files[j],val) then
<span style="color: #008080;">for</span> <span style="color: #000000;">j</span><span style="color: #0000FF;">=</span><span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #000000;">files</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">to</span> <span style="color: #000000;">1</span> <span style="color: #008080;">by</span> <span style="color: #0000FF;">-</span><span style="color: #000000;">1</span> <span style="color: #008080;">do</span>
files[j..j] = {}
<span style="color: #008080;">if</span> <span style="color: #008080;">not</span> <span style="color: #7060A8;">find</span><span style="color: #0000FF;">(</span><span style="color: #000000;">files</span><span style="color: #0000FF;">[</span><span style="color: #000000;">j</span><span style="color: #0000FF;">],</span><span style="color: #000000;">val</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">then</span>
end if
<span style="color: #000000;">files</span><span style="color: #0000FF;">[</span><span style="color: #000000;">j</span><span style="color: #0000FF;">..</span><span style="color: #000000;">j</span><span style="color: #0000FF;">]</span> <span style="color: #0000FF;">=</span> <span style="color: #0000FF;">{}</span>
end for
<span style="color: #008080;">end</span> <span style="color: #008080;">if</span>
if length(files)=0 then return {} end if
<span style="color: #008080;">end</span> <span style="color: #008080;">for</span>
end if
<span style="color: #008080;">if</span> <span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #000000;">files</span><span style="color: #0000FF;">)=</span><span style="color: #000000;">0</span> <span style="color: #008080;">then</span> <span style="color: #008080;">return</span> <span style="color: #0000FF;">{}</span> <span style="color: #008080;">end</span> <span style="color: #008080;">if</span>
end for
<span style="color: #008080;">end</span> <span style="color: #008080;">if</span>
for i=1 to length(files) do
<span style="color: #008080;">end</span> <span style="color: #008080;">for</span>
files[i] = filenames[files[i]]
<span style="color: #008080;">for</span> <span style="color: #000000;">i</span><span style="color: #0000FF;">=</span><span style="color: #000000;">1</span> <span style="color: #008080;">to</span> <span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #000000;">files</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">do</span>
end for
<span style="color: #000000;">files</span><span style="color: #0000FF;">[</span><span style="color: #000000;">i</span><span style="color: #0000FF;">]</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">filenames</span><span style="color: #0000FF;">[</span><span style="color: #000000;">files</span><span style="color: #0000FF;">[</span><span style="color: #000000;">i</span><span style="color: #0000FF;">]]</span>
return files
<span style="color: #008080;">end</span> <span style="color: #008080;">for</span>
end function
<span style="color: #008080;">return</span> <span style="color: #000000;">files</span>
 
<span style="color: #008080;">end</span> <span style="color: #008080;">function</span>
load_directory()
?word_count
<span style="color: #000000;">load_directory</span><span style="color: #0000FF;">()</span>
?lookup({"load_directory"}) -- should only be this file
<span style="color: #0000FF;">?</span><span style="color: #000000;">word_count</span>
?lookup({"dir"}) -- currently two use this
<span style="color: #0000FF;">?</span><span style="color: #000000;">lookup</span><span style="color: #0000FF;">({</span><span style="color: #008000;">"load_directory"</span><span style="color: #0000FF;">})</span> <span style="color: #000080;font-style:italic;">-- should only be this file</span>
?lookup({"lower"}) -- currently four use this
<span style="color: #0000FF;">?</span><span style="color: #000000;">lookup</span><span style="color: #0000FF;">({</span><span style="color: #008000;">"dir"</span><span style="color: #0000FF;">})</span> <span style="color: #000080;font-style:italic;">-- currently two use this</span>
?lookup({"lower","dir"}) -- currently two use both
<span style="color: #0000FF;">?</span><span style="color: #000000;">lookup</span><span style="color: #0000FF;">({</span><span style="color: #008000;">"lower"</span><span style="color: #0000FF;">})</span> <span style="color: #000080;font-style:italic;">-- currently four use this</span>
?lookup({"dir","lower"}) -- should be the same two
<span style="color: #0000FF;">?</span><span style="color: #000000;">lookup</span><span style="color: #0000FF;">({</span><span style="color: #008000;">"lower"</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"dir"</span><span style="color: #0000FF;">})</span> <span style="color: #000080;font-style:italic;">-- currently two use both</span>
?lookup({"ban"&"anafish"}) -- should be none ({})</lang>
<span style="color: #0000FF;">?</span><span style="color: #000000;">lookup</span><span style="color: #0000FF;">({</span><span style="color: #008000;">"dir"</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"lower"</span><span style="color: #0000FF;">})</span> <span style="color: #000080;font-style:italic;">-- should be the same two</span>
<span style="color: #0000FF;">?</span><span style="color: #000000;">lookup</span><span style="color: #0000FF;">({</span><span style="color: #008000;">"ban"</span><span style="color: #0000FF;">&</span><span style="color: #008000;">"anafish"</span><span style="color: #0000FF;">})</span> <span style="color: #000080;font-style:italic;">-- should be none ({})</span>
<!--</syntaxhighlight>-->
{{out}}
Note the distributed version has been modified and is occasionally used for real, so the output will likely differ.
<pre>
3365
Line 2,711 ⟶ 2,724:
=={{header|PHP}}==
 
<langsyntaxhighlight lang="php"><?php
 
function buildInvertedIndex($filenames)
Line 2,756 ⟶ 2,769:
echo "Unable to find the word \"$word\" in the index\n";
}
}</langsyntaxhighlight>
 
=={{header|PicoLisp}}==
Line 2,769 ⟶ 2,782:
it is a banana</pre>
we can read them into a binary tree in the global variable '*MyIndex'
<langsyntaxhighlight PicoLisplang="picolisp">(off *MyIndex)
 
(use Word
Line 2,783 ⟶ 2,796:
(extract
'((Word) (val (car (idx '*MyIndex Word))))
(rest) ) ) )</langsyntaxhighlight>
Output:
<pre>: (searchFor "what" "is" "it")
Line 2,796 ⟶ 2,809:
=={{header|PowerShell}}==
{{works with|PowerShell|2}}
<langsyntaxhighlight PowerShelllang="powershell">function Index-File ( [string[]]$FileList )
{
# Create index hashtable, as needed
Line 2,838 ⟶ 2,851:
{
return $WordIndex[$Word]
}</langsyntaxhighlight>
<langsyntaxhighlight PowerShelllang="powershell"># Populate test files
@'
Files full of
Line 2,853 ⟶ 2,866:
Use the index
to find the files.
'@ | Out-File -FilePath C:\Test\File3.txt</langsyntaxhighlight>
<langsyntaxhighlight PowerShelllang="powershell"># Index files
Index-File C:\Test\File1.txt, C:\Test\File2.txt, C:\Test\File3.txt</langsyntaxhighlight>
Because PowerShell is a shell language, it is "a user interface to do a search". After running the script defining the functions and running a command to index the files, the user can simply run the search function at the PowerShell command prompt.
Alternatively, one could create a more complex custom UI or GUI if desired.
<langsyntaxhighlight PowerShelllang="powershell"># Query index
Find-Word files</langsyntaxhighlight>
{{out}}
<pre>C:\Test\File1.txt
Line 2,869 ⟶ 2,882:
 
First the simple inverted index from [[wp:Inverted index|here]] together with an implementation of a search for (multiple) terms from that index.
<langsyntaxhighlight lang="python">'''
This implements: http://en.wikipedia.org/wiki/Inverted_index of 28/07/10
'''
Line 2,909 ⟶ 2,922:
terms = ["what", "is", "it"]
print('\nTerm Search for: ' + repr(terms))
pp(sorted(termsearch(terms)))</langsyntaxhighlight>
 
'''Sample Output'''
Line 2,937 ⟶ 2,950:
 
<small>It is assumed that the following code is added to the end of the code for the simple case above and so shares its file opening and parsing results</small>
<langsyntaxhighlight lang="python">from collections import Counter
 
 
Line 2,988 ⟶ 3,001:
print(ans)
ans = Counter(ans)
print(' The phrase is found most commonly in text: ' + repr(ans.most_common(1)[0][0]))</langsyntaxhighlight>
 
'''Sample Output'''
Line 3,009 ⟶ 3,022:
 
=={{header|Racket}}==
<langsyntaxhighlight lang="racket">
#!/usr/bin/env racket
#lang racket
Line 3,032 ⟶ 3,045:
(printf "No matching files.\n")
(printf "Terms found at: ~a.\n" (string-join all ", "))))
</syntaxhighlight>
</lang>
{{out}}
<pre>
Line 3,051 ⟶ 3,064:
(formerly Perl 6)
{{works with|rakudo|2015-09-16}}
<syntaxhighlight lang="raku" perl6line>sub MAIN (*@files) {
my %norm;
do for @files -> $file {
Line 3,063 ⟶ 3,076:
}
}
}</langsyntaxhighlight>
 
=={{header|REXX}}==
Line 3,071 ⟶ 3,084:
 
To see more about Burma Shave signs, see the Wikipedia entry: &nbsp; [http://en.wikipedia.org/wiki/Burma-Shave Burma Shave signs.]
<langsyntaxhighlight lang="rexx">/*REXX program illustrates building a simple inverted index and a method of word find.*/
@.= /*a dictionary of words (so far). */
!= /*a list of found words (so far). */
Line 3,128 ⟶ 3,141:
q=strip(q, 'T', substr(@punctuation, j, 1) )
end /*j*/
return q</langsyntaxhighlight>
{{out|output|text=&nbsp; when using the default inputs:}}
 
Line 3,227 ⟶ 3,240:
 
'''indexmerge.rb'''
<langsyntaxhighlight lang="ruby">if File.exist? "index.dat"
@data = Marshal.load open("index.dat")
else
Line 3,255 ⟶ 3,268:
open("index.dat", "w") do |index|
index.write Marshal.dump(@data)
end</langsyntaxhighlight>
 
'''indexsearch.rb'''
<langsyntaxhighlight lang="ruby">if File.exist? "index.dat"
@data = Marshal.load open("index.dat")
else
Line 3,279 ⟶ 3,292:
end
 
p @result</langsyntaxhighlight>
 
'''Output'''
Line 3,292 ⟶ 3,305:
 
=={{header|Rust}}==
<langsyntaxhighlight Rustlang="rust">// Part 1: Inverted index structure
 
use std::{
Line 3,532 ⟶ 3,545:
 
Ok(())
}</langsyntaxhighlight>
 
{{out}}
Line 3,567 ⟶ 3,580:
 
=={{header|Scala}}==
<langsyntaxhighlight Scalalang="scala">object InvertedIndex extends App {
import java.io.File
 
Line 3,597 ⟶ 3,610:
}
}
}</langsyntaxhighlight>
{{out}}
<pre>> InvertedIndex "the" file1.txt file2.txt file3.txt
Line 3,619 ⟶ 3,632:
 
=={{header|Tcl}}==
<langsyntaxhighlight lang="tcl">package require Tcl 8.5
proc wordsInString str {
# We define "words" to be "maximal sequences of 'word' characters".
Line 3,707 ⟶ 3,720:
}
return $files
}</langsyntaxhighlight>
For the GUI:
<langsyntaxhighlight lang="tcl">package require Tk
pack [labelframe .files -text Files] -side left -fill y
pack [listbox .files.list -listvariable files]
Line 3,743 ⟶ 3,756:
lappend found "Searching for files with \"$terms\"" {*}$fs \
"---------------------"
}</langsyntaxhighlight>
 
=={{header|TUSCRIPT}}==
<langsyntaxhighlight lang="tuscript">
$$ MODE TUSCRIPT
 
Line 3,777 ⟶ 3,790:
ENDLOOP
PRINT "-> ",files
</syntaxhighlight>
</lang>
Output:
<pre>
Line 3,794 ⟶ 3,807:
{{works with|ksh93}}
 
<langsyntaxhighlight lang="bash">#!/bin/ksh
 
typeset -A INDEX
Line 3,818 ⟶ 3,831:
(( count == $# )) && echo $file
done
}</langsyntaxhighlight>
 
Example use:
<langsyntaxhighlight lang="korn">index *.txt
search hello world
</syntaxhighlight>
</lang>
 
===Directory on filesystem===
Line 3,833 ⟶ 3,846:
* Add note about slowness.
 
<langsyntaxhighlight lang="bash">#!/bin/sh
# index.sh - create an inverted index
 
Line 3,881 ⟶ 3,894:
echo >&2
: $((fi += 1))
done</langsyntaxhighlight>
 
<langsyntaxhighlight lang="bash">#!/bin/sh
# search.sh - search an inverted index
 
Line 3,915 ⟶ 3,928:
}
 
$want "$@"</langsyntaxhighlight>
 
=={{header|Wren}}==
Line 3,922 ⟶ 3,935:
{{libheader|Wren-pattern}}
{{libheader|Wren-str}}
<langsyntaxhighlight ecmascriptlang="wren">import "./ioutil" for FileUtil, Input
import "./pattern" for Pattern
import "./str" for Str
import "os" for Process
 
Line 3,987 ⟶ 4,000:
if (word == "q" || word == "Q") return
findWord.call(word)
}</langsyntaxhighlight>
 
{{out}}
9,476

edits