Inverted index: Difference between revisions

m
m (→‎{{header|REXX}}: added DO-END labels, removed a few superflous blanks. -- ~~~~)
m (→‎{{header|Wren}}: Minor tidy)
 
(48 intermediate revisions by 26 users not shown)
Line 1:
{{task|Classic CS problems and programs}}[[Category:Search]]
 
An [[wp:Inverted_index|Inverted Index]] is a data structure used to create full text search.
 
 
Given a set of text files, implement a program to create an inverted index. Also create a user interface to do a search using that inverted index which returns a list of files that contain the query term / terms. The search index can be in memory.
;Task:
Given a set of text files, implement a program to create an inverted index.
 
Also create a user interface to do a search using that inverted index which returns a list of files that contain the query term / terms.
 
The search index can be in memory.
<br><br>
 
=={{header|11l}}==
{{trans|D}}
 
<syntaxhighlight lang="11l">DefaultDict[String, Set[String]] index
 
F parse_file(fname, fcontents)
L(word) fcontents.split(re:‘\W’)
:index[word.lowercase()].add(fname)
 
L(fname, fcontents) [(‘inv1.txt’, ‘It is what it is.’),
(‘inv2.txt’, ‘What is it?’),
(‘inv3.txt’, ‘It is a banana!’)]
parse_file(fname, fcontents)
 
L(w) [‘cat’, ‘is’, ‘banana’, ‘it’, ‘what’]
print("\nEnter a word to search for: (q to quit): "w)
I w C index
print(‘'#.' found in #..’.format(w, sorted(Array(index[w]))))
E
print(‘'#.' not found.’.format(w))</syntaxhighlight>
 
{{out}}
<pre>
 
Enter a word to search for: (q to quit): cat
'cat' not found.
 
Enter a word to search for: (q to quit): is
'is' found in [inv1.txt, inv2.txt, inv3.txt].
 
Enter a word to search for: (q to quit): banana
'banana' found in [inv3.txt].
 
Enter a word to search for: (q to quit): it
'it' found in [inv1.txt, inv2.txt, inv3.txt].
 
Enter a word to search for: (q to quit): what
'what' found in [inv1.txt, inv2.txt].
</pre>
 
=={{header|Ada}}==
Line 10 ⟶ 58:
Here is the main program (file inverted_index.adb):
 
<langsyntaxhighlight Adalang="ada">with Ada.Text_IO, Generic_Inverted_Index, Ada.Strings.Hash, Parse_Lines;
use Ada.Text_IO;
 
Line 123 ⟶ 171:
end;
end loop;
end Inverted_Index;</langsyntaxhighlight>
 
A sample output:
Line 130 ⟶ 178:
 
Enter one or more words to search for; <return> to finish:
it
Found in the following files: 0.txt, 1.txt, 2.txt
 
Line 148 ⟶ 196:
The real work is actually done in the package Generic_Inverted_Index. Here is the specification (file generic_inverted_index.ads):
 
<langsyntaxhighlight Adalang="ada">with Ada.Containers.Indefinite_Vectors;
private with Ada.Containers.Indefinite_Hashed_Maps;
 
Line 210 ⟶ 258:
type Storage_Type is new Maps.Map with null record;
 
end Generic_Inverted_Index;</langsyntaxhighlight>
 
Here is the implementation (generic_inverted_index.adb):
 
<langsyntaxhighlight Adalang="ada">package body Generic_Inverted_Index is
 
use Source_Vecs;
Line 308 ⟶ 356:
end Same_Vector;
 
end Generic_Inverted_Index;</langsyntaxhighlight>
 
===Package Parse_Lines===
Line 314 ⟶ 362:
The main program also uses an auxiliary package Parse_Lines. Note the usage of Gnat.Regpat, which itself is pattern matching package, specific for gnat/gcc. This package is derived from the [[Regular expressions#Ada|Ada implementation of the regular expressions task]]. Here is the spec (parse_lines.ads):
 
<langsyntaxhighlight Adalang="ada">with Gnat.Regpat;
 
package Parse_Lines is
Line 333 ⟶ 381:
procedure Iterate_Words(S: String);
 
end Parse_Lines;</langsyntaxhighlight>
 
And here is the implementation (parse_lines.adb):
 
<langsyntaxhighlight Adalang="ada">with Gnat.Regpat;
 
package body Parse_Lines is
Line 378 ⟶ 426:
end Iterate_Words;
 
end Parse_Lines;</langsyntaxhighlight>
 
===Alternative Implementation of Generic_Inverted_Index (Ada 2012)===
Line 384 ⟶ 432:
The new standard Ada 2012 simplifies the usage of containers significantly. The following runs under gnat (GNAT GPL 2011 (20110419)), when using the experimental -gnat2012 switch. The main program is the same. Here is the spec for Generic_Inverted_Index:
 
<langsyntaxhighlight Adalang="ada">with Ada.Containers.Indefinite_Vectors;
private with Ada.Containers.Indefinite_Hashed_Maps;
 
Line 439 ⟶ 487:
type Storage_Type is new Maps.Map with null record;
 
end Generic_Inverted_Index;</langsyntaxhighlight>
 
The implementation:
 
<langsyntaxhighlight Adalang="ada">package body Generic_Inverted_Index is
-- uses some of the new Ada 2012 syntax
use Source_Vecs;
Line 524 ⟶ 572:
end Same_Vector;
 
end Generic_Inverted_Index;</langsyntaxhighlight>
 
=={{header|AutoHotkey}}==
{{works with|AutoHotkey_L}}
 
<langsyntaxhighlight AutoHotkeylang="autohotkey">; http://www.autohotkey.com/forum/viewtopic.php?t=41479
inputbox, files, files, file pattern such as c:\files\*.txt
 
Line 537 ⟶ 585:
Loop, %files%, 0,1
{
tooltip,%A_index% / 500
 
wordList := WordsIn(A_LoopFileFullPath)
InvertedIndex(wordList, A_loopFileFullpath)
}
 
Line 558 ⟶ 606:
 
WordsIn(docpath)
{
FileRead, content, %docpath%
spos = 1
Line 568 ⟶ 616:
this_wordList .= match "`n"
}
 
Sort, this_wordList, U
return this_wordList
}
 
Line 577 ⟶ 625:
global word2docs
 
loop, parse, words, `n,`r
{
{
if A_loopField =
continue
Line 593 ⟶ 641:
else
return word2docs[word2find]
}</langsyntaxhighlight>
 
=={{header|BBC BASIC}}==
{{works with|BBC BASIC for Windows}}
This uses a hashed index and linked lists to hold the file numbers.
<syntaxhighlight lang="bbcbasic"> DIM FileList$(4)
FileList$() = "BBCKEY0.TXT", "BBCKEY1.TXT", "BBCKEY2.TXT", \
\ "BBCKEY3.TXT", "BBCKEY4.TXT"
 
DictSize% = 30000
DIM Index{(DictSize%-1) word$, link%}
 
REM Build inverted index:
FOR file% = DIM(FileList$(),1) TO 0 STEP -1
filename$ = FileList$(file%)
F% = OPENIN(filename$)
IF F% = 0 ERROR 100, "Failed to open file"
 
WHILE NOT EOF#F%
REPEAT C%=BGET#F% : UNTIL C%>64 OR EOF#F% : word$ = CHR$(C%)
REPEAT C%=BGET#F% : word$ += CHR$(C%) : UNTIL C%<65
word$ = FNlower(LEFT$(word$))
 
hash% = FNhash(word$)
WHILE Index{(hash%)}.word$<>"" AND Index{(hash%)}.word$<>word$
hash% = (hash% + 1) MOD DictSize% : REM Collision
ENDWHILE
Index{(hash%)}.word$ = word$
link% = Index{(hash%)}.link%
IF link%=0 OR link%!4<>file% THEN
DIM heap% 7 : heap%!4 = file%
!heap% = link%
Index{(hash%)}.link% = heap% : REM Linked list
ENDIF
ENDWHILE
 
CLOSE #F%
NEXT file%
 
REM Now query the index:
PRINT FNquery("random")
PRINT FNquery("envelope")
PRINT FNquery("zebra")
PRINT FNquery("the")
END
 
DEF FNquery(A$)
LOCAL hash%, link%, temp%
A$ = FNlower(A$)
hash% = FNhash(A$)
temp% = hash%
WHILE Index{(hash%)}.word$ <> A$
hash% = (hash% + 1) MOD DictSize%
IF hash% = temp% THEN = """" + A$ + """ not found"
ENDWHILE
link% = Index{(hash%)}.link%
A$ = """" + A$ + """ found in "
WHILE link%
A$ += FileList$(link%!4) + ", "
link% = !link%
ENDWHILE
= LEFT$(LEFT$(A$))
 
DEF FNhash(A$)
LOCAL hash%
IF LEN(A$) < 4 A$ += STRING$(4-LEN(A$),CHR$0)
hash% = !!^A$
IF LEN(A$) > 4 hash% EOR= !(!^A$ + LEN(A$) - 4)
= hash% MOD DictSize%
 
DEF FNlower(A$)
LOCAL A%,C%
FOR A% = 1 TO LEN(A$)
C% = ASCMID$(A$,A%)
IF C% >= 65 IF C% <= 90 MID$(A$,A%,1) = CHR$(C%+32)
NEXT
= A$</syntaxhighlight>
'''Output:'''
<pre>
"random" found in BBCKEY2.TXT, BBCKEY3.TXT, BBCKEY4.TXT
"envelope" found in BBCKEY1.TXT, BBCKEY4.TXT
"zebra" not found
"the" found in BBCKEY0.TXT, BBCKEY1.TXT, BBCKEY2.TXT, BBCKEY3.TXT, BBCKEY4.TXT
</pre>
 
=={{header|C}}==
The code is stupidly long, having to implement a Trie to store strings and all -- the program doesn't do anything shiny, but Tries may be interesting to look at.
<langsyntaxhighlight Clang="c">#include <stdio.h>
#include <stdlib.h>
 
Line 603 ⟶ 734:
int chr_idx[256] = {0};
char idx_chr[256] = {0};
 
#define FNAME 0
typedef struct trie_t *trie, trie_t;
struct trie_t {
trie next[sizeof(chr_legal)]; /* next letter; slot 0 is for file name */
int eow;
};
 
trie trie_new() { return calloc(sizeof(trie_t), 1); }
 
#define find_word(r, w) trie_trav(r, w, 1)
/* tree traversal: returns node if end of word and matches string, optionally
* create node if doesn't exist
*/
trie trie_trav(trie root, const char * str, int no_create)
{
int c;
while (root) {
if ((c = str[0]) == '\0') {
if (!root->eow && no_create) return 0;
break;
}
}
if (! (c = chr_idx[c]) ) {
str++;
continue;
}
}
 
if (!root->next[c]) {
if (no_create) return 0;
root->next[c] = trie_new();
}
}
root = root->next[c];
str++;
}
}
return root;
}
 
/* complete traversal of whole tree, calling callback at each end of word node.
* similar method can be used to free nodes, had we wanted to do that.
Line 645 ⟶ 776:
int trie_all(trie root, char path[], int depth, int (*callback)(char *))
{
int i;
if (root->eow && !callback(path)) return 0;
 
for (i = 1; i < sizeof(chr_legal); i++) {
if (!root->next[i]) continue;
 
path[depth] = idx_chr[i];
path[depth + 1] = '\0';
if (!trie_all(root->next[i], path, depth + 1, callback))
return 0;
}
}
return 1;
}
 
void add_index(trie root, const char *word, const char *fname)
{
trie x = trie_trav(root, word, 0);
x->eow = 1;
 
if (!x->next[FNAME])
x->next[FNAME] = trie_new();
x = trie_trav(x->next[FNAME], fname, 0);
x->eow = 1;
}
 
int print_path(char *path)
{
printf(" %s", path);
return 1;
}
 
/* pretend we parsed text files and got lower cased words: dealing *
* with text file is a whole other animal and would make code too long */
const char *files[] = { "f1.txt", "source/f2.txt", "other_file" };
const char *text[][5] ={{ "it", "is", "what", "it", "is" },
{ "what", "is", "it", 0 },
{ "it", "is", "a", "banana", 0 }};
 
trie init_tables()
{
int i, j;
trie root = trie_new();
for (i = 0; i < sizeof(chr_legal); i++) {
chr_idx[(int)chr_legal[i]] = i + 1;
idx_chr[i + 1] = chr_legal[i];
}
}
 
/* Enable USE_ADVANCED_FILE_HANDLING to use advanced file handling.
* You need to have files named like above files[], with words in them
* like in text[][]. Case doesn't matter (told you it's advanced).
*/
#define USE_ADVANCED_FILE_HANDLING 0
#if USE_ADVANCED_FILE_HANDLING
void read_file(const char * fname) {
char cmd[1024];
char word[1024];
sprintf(cmd, "perl -p -e 'while(/(\\w+)/g) {print lc($1),\"\\n\"}' %s", fname);
FILE *in = popen(cmd, "r");
while (!feof(in)) {
fscanf(in, "%1000s", word);
add_index(root, word, fname);
}
}
pclose(in);
};
 
read_file("f1.txt");
read_file("source/f2.txt");
read_file("other_file");
#else
for (i = 0; i < 3; i++) {
for (j = 0; j < 5; j++) {
if (!text[i][j]) break;
add_index(root, text[i][j], files[i]);
}
}
}
}
#endif /*USE_ADVANCED_FILE_HANDLING*/
 
return root;
}
 
void search_index(trie root, const char *word)
{
char path[1024];
printf("Search for \"%s\": ", word);
trie found = find_word(root, word);
 
if (!found) printf("not found\n");
else {
trie_all(found->next[FNAME], path, 0, print_path);
printf("\n");
}
}
}
 
int main()
{
trie root = init_tables();
 
search_index(root, "what");
search_index(root, "is");
search_index(root, "banana");
search_index(root, "boo");
return 0;
}</langsyntaxhighlight>Output:<syntaxhighlight lang="text">Search for "what": f1.txt source/f2.txt
Search for "is": f1.txt other_file source/f2.txt
Search for "banana": other_file
Search for "boo": not found</langsyntaxhighlight>
 
=={{header|C sharp|C#}}==
<langsyntaxhighlight lang="csharp">using System;
using System.Collections.Generic;
using System.IO;
Line 777 ⟶ 908:
Console.WriteLine("{0} found in: {1}", find, string.Join(" ", Invert(dictionary)[find]));
}
}</langsyntaxhighlight>
Sample output:
<syntaxhighlight lang="text">files: file1 file2 file3
find: what
what found in: file1 file2</langsyntaxhighlight>
 
=={{header|C++}}==
Same idea as the C implementation - trie to store the words
<syntaxhighlight lang="cpp">
#include <algorithm>
#include <fstream>
#include <iostream>
#include <vector>
#include <string>
 
const std::string _CHARS = "abcdefghijklmnopqrstuvwxyz0123456789.:-_/";
const size_t MAX_NODES = 41;
 
class node
{
public:
node() { clear(); }
node( char z ) { clear(); }
~node() { for( int x = 0; x < MAX_NODES; x++ ) if( next[x] ) delete next[x]; }
void clear() { for( int x = 0; x < MAX_NODES; x++ ) next[x] = 0; isWord = false; }
bool isWord;
std::vector<std::string> files;
node* next[MAX_NODES];
};
 
class index {
public:
void add( std::string s, std::string fileName ) {
std::transform( s.begin(), s.end(), s.begin(), tolower );
std::string h;
for( std::string::iterator i = s.begin(); i != s.end(); i++ ) {
if( *i == 32 ) {
pushFileName( addWord( h ), fileName );
h.clear();
continue;
}
h.append( 1, *i );
}
if( h.length() )
pushFileName( addWord( h ), fileName );
}
void findWord( std::string s ) {
std::vector<std::string> v = find( s );
if( !v.size() ) {
std::cout << s + " was not found!\n";
return;
}
std::cout << s << " found in:\n";
for( std::vector<std::string>::iterator i = v.begin(); i != v.end(); i++ ) {
std::cout << *i << "\n";
}
std::cout << "\n";
}
private:
void pushFileName( node* n, std::string fn ) {
std::vector<std::string>::iterator i = std::find( n->files.begin(), n->files.end(), fn );
if( i == n->files.end() ) n->files.push_back( fn );
}
const std::vector<std::string>& find( std::string s ) {
size_t idx;
std::transform( s.begin(), s.end(), s.begin(), tolower );
node* rt = &root;
for( std::string::iterator i = s.begin(); i != s.end(); i++ ) {
idx = _CHARS.find( *i );
if( idx < MAX_NODES ) {
if( !rt->next[idx] ) return std::vector<std::string>();
rt = rt->next[idx];
}
}
if( rt->isWord ) return rt->files;
return std::vector<std::string>();
}
node* addWord( std::string s ) {
size_t idx;
node* rt = &root, *n;
for( std::string::iterator i = s.begin(); i != s.end(); i++ ) {
idx = _CHARS.find( *i );
if( idx < MAX_NODES ) {
n = rt->next[idx];
if( n ){
rt = n;
continue;
}
n = new node( *i );
rt->next[idx] = n;
rt = n;
}
}
rt->isWord = true;
return rt;
}
node root;
};
int main( int argc, char* argv[] ) {
index t;
std::string s;
std::string files[] = { "file1.txt", "f_text.txt", "text_1b.txt" };
 
for( int x = 0; x < 3; x++ ) {
std::ifstream f;
f.open( files[x].c_str(), std::ios::in );
if( f.good() ) {
while( !f.eof() ) {
f >> s;
t.add( s, files[x] );
s.clear();
}
f.close();
}
}
 
while( true ) {
std::cout << "Enter one word to search for, return to exit: ";
std::getline( std::cin, s );
if( !s.length() ) break;
t.findWord( s );
 
}
return 0;
}
</syntaxhighlight>
{{out}}
<pre>
Enter one word to search for, return to exit: goodness
goodness found in:
file1.txt
f_text.txt
 
Enter one word to search for, return to exit: because
because found in:
f_text.txt
 
Enter one word to search for, return to exit: her
her found in:
text_1b.txt
 
Enter one word to search for, return to exit: fat
fat was not found!
</pre>
 
=={{header|Clojure}}==
<syntaxhighlight lang="clojure">(ns inverted-index.core
(:require [clojure.set :as sets]
[clojure.java.io :as io]))
 
(def pattern #"\w+") ; Java regex for a raw term: here a substring of alphanums
(defn normalize [match] (.toLowerCase match)) ; normalization of a raw term
 
(defn term-seq [text] (map normalize (re-seq pattern text)))
 
(defn set-assoc
"Produces map with v added to the set associated with key k in map m"
[m k v] (assoc m k (conj (get m k #{}) v)))
 
(defn index-file [index file]
(with-open [reader (io/reader file)]
(reduce
(fn [idx term] (set-assoc idx term file))
index
(mapcat term-seq (line-seq reader)))))
 
(defn make-index [files]
(reduce index-file {} files))
 
(defn search [index query]
(apply sets/intersection (map index (term-seq query))))
</syntaxhighlight>
 
=={{header|CoffeeScript}}==
<langsyntaxhighlight lang="coffeescript">
fs = require 'fs'
 
Line 805 ⟶ 1,104:
console.log "#{fn}:#{line_num}"
console.log "\n"
 
get_words = (line) ->
words = line.replace(/\W/g, ' ').split ' '
Line 819 ⟶ 1,118:
grep index, 'make_index'
grep index, 'sort'
</syntaxhighlight>
</lang>
output
<syntaxhighlight lang="text">
> coffee inverted_index.coffee
locations for 'make_index':
inverted_index.coffee:3
Line 837 ⟶ 1,136:
inverted_index.coffee:35
knuth_sample.coffee:12
</syntaxhighlight>
</lang>
 
=={{header|Common Lisp}}==
<langsyntaxhighlight Lisplang="lisp">(defpackage rosettacode.inverted-index
(:use cl))
(in-package rosettacode.inverted-index)
Line 875 ⟶ 1,174:
:test #'equal))
 
</syntaxhighlight>
</lang>
Example:
<langsyntaxhighlight lang="lisp">(defparameter *index* (build-index '("file1.txt" "file2.txt" "file3.txt")))
(defparameter *query* "foo bar")
(defparameter *result* (lookup *index* *query*))
(format t "Result for query ~s: ~{~a~^, ~}~%" *query* *result*)</langsyntaxhighlight>
 
=={{header|FactorD}}==
<syntaxhighlight lang="d">import std.stdio, std.algorithm, std.string, std.file, std.regex;
<lang factor>USING: assocs fry io.encodings.utf8 io.files kernel sequences
sets splitting vectors ;
IN: rosettacode.inverted-index
 
void main() {
: file-words ( file -- assoc )
string[][string] index;
utf8 file-contents " ,;:!?.()[]{}\n\r" split harvest ;
: add-to-file-list ( files file -- files )
over [ swap [ adjoin ] keep ] [ nip 1vector ] if ;
: add-to-index ( words index file -- )
'[ _ [ _ add-to-file-list ] change-at ] each ;
: (index-files) ( files index -- )
[ [ [ file-words ] keep ] dip swap add-to-index ] curry each ;
: index-files ( files -- index )
H{ } clone [ (index-files) ] keep ;
: query ( terms index -- files )
[ at ] curry map [ ] [ intersect ] map-reduce ;
</lang>
 
void parseFile(in string fn) {
Example use :
if (!exists(fn) || !isFile(fn))
<lang>( scratchpad ) { "f1" "f2" "f3" } index-files
throw new Exception("File not found");
 
foreach (word; readText(fn).splitter(regex(r"\W"))) {
word = word.toLower();
if (!index.get(word, null).canFind(fn))
index[word] ~= fn;
}
}
 
immutable fileNames = ["inv1.txt", "inv2.txt", "inv3.txt"];
foreach (fName; fileNames)
parseFile(fName);
 
while (true) {
writef("\nEnter a word to search for: (q to quit): ");
immutable w = readln().strip().toLower();
if (w == "q") {
writeln("quitting.");
break;
}
if (w in index)
writefln("'%s' found in%( %).", w, index[w]);
else
writefln("'%s' not found.", w);
}
}</syntaxhighlight>
Both the demo text files and the queries are from the Wikipedia page, they contain:
It is what it is.
 
What is it?
 
It is a banana!
{{out}}
<pre>Enter a word to search for: (q to quit): cat
'cat' not found.
 
Enter a word to search for: (q to quit): is
'is' found in "inv1.txt" "inv2.txt" "inv3.txt".
 
Enter a word to search for: (q to quit): banana
'banana' found in "inv3.txt".
 
Enter a word to search for: (q to quit): it
'it' found in "inv1.txt" "inv2.txt" "inv3.txt".
 
Enter a word to search for: (q to quit): what
'what' found in "inv1.txt" "inv2.txt".
 
Enter a word to search for: (q to quit): q
quitting.</pre>
=={{header|Delphi}}==
{{libheader| System.SysUtils}}
{{libheader| system.Generics.Collections}}
{{libheader| SYstem.IOUtils}}
<syntaxhighlight lang="delphi">
program Inverted_index;
 
{$APPTYPE CONSOLE}
 
uses
System.SysUtils,
system.Generics.Collections,
SYstem.IOUtils;
 
type
TIndex = class
private
FFileNames: TArray<string>;
FIndexs: TDictionary<string, string>;
class function SplitWords(Text: string): TArray<string>; static;
function StoreFileName(FileName: TFileName): string;
procedure StoreLine(Line, Id: string);
procedure StoreWord(aWord, Id: string);
function DecodeFileName(Indexed: string): string;
public
procedure AddFile(FileName: TFileName);
constructor Create;
destructor Destroy; override;
function Search(aWord: string): string;
property Index[aWord: string]: string read Search; default;
end;
 
{ TIndex }
 
constructor TIndex.Create;
begin
FIndexs := TDictionary<string, string>.Create();
SetLength(FFileNames, 0);
end;
 
function TIndex.DecodeFileName(Indexed: string): string;
var
f: string;
i: integer;
begin
Result := Indexed;
if Length(FFileNames) > 0 then
for i := 0 to High(FFileNames) do
begin
f := FFileNames[i];
Result := Result.Replace(format('(%x)', [i]), f);
end;
end;
 
destructor TIndex.Destroy;
begin
FIndexs.Free;
inherited;
end;
 
function TIndex.Search(aWord: string): string;
begin
aWord := AnsiLowerCase(aWord);
if FIndexs.ContainsKey(aWord) then
Result := 'found in ' + DecodeFileName(FIndexs[aWord])
else
Result := 'not found';
end;
 
class function TIndex.SplitWords(Text: string): TArray<string>;
const
WORD_CHARSET: set of char = (['a'..'z', 'A'..'Z', 'á', 'é', 'í', 'ó', 'ú', 'ã',
'õ', 'à', 'è', 'ì', 'ò', 'ù', 'Á', 'É', 'Í', 'Ó', 'Ú', 'Ã', 'Õ', 'À', 'È',
'Ì', 'Ò', 'Ù', 'ä', 'ë', 'ï', 'ö', 'ü', 'ç', 'Ç', '_', '0'..'9']);
 
procedure Append(var value: string);
begin
if not value.IsEmpty then
begin
SetLength(result, length(result) + 1);
result[high(result)] := value;
value := '';
end;
end;
 
var
c: Char;
inWord, isWordChar: boolean;
w: string;
begin
inWord := False;
w := '';
SetLength(result, 0);
for c in Text do
begin
isWordChar := (c in WORD_CHARSET);
if inWord then
if isWordChar then
w := w + c
else
begin
Append(w);
inWord := false;
Continue;
end;
 
if not inWord and isWordChar then
begin
inWord := True;
w := c;
end;
end;
if inWord then
Append(w);
end;
 
function TIndex.StoreFileName(FileName: TFileName): string;
begin
SetLength(FFileNames, length(FFileNames) + 1);
FFileNames[High(FFileNames)] := FileName;
Result := format('"(%x)"', [High(FFileNames)]);
end;
 
procedure TIndex.StoreLine(Line, Id: string);
var
Words: TArray<string>;
w: string;
begin
Words := SplitWords(Line);
for w in Words do
StoreWord(w, Id);
end;
 
procedure TIndex.StoreWord(aWord, Id: string);
begin
aWord := AnsiLowercase(aWord);
if FIndexs.ContainsKey(aWord) then
begin
if (FIndexs[aWord].IndexOf(Id) = -1) then
FIndexs[aWord] := FIndexs[aWord] + ', ' + Id;
end
else
FIndexs.Add(aWord, Id);
end;
 
procedure TIndex.AddFile(FileName: TFileName);
var
Lines: TArray<string>;
Line: string;
Id: string;
begin
if not FileExists(FileName) then
exit;
Lines := TFile.ReadAllLines(FileName);
 
if length(Lines) = 0 then
exit;
 
// Remove this line if you want full filename
FileName := ExtractFileName(FileName);
 
Id := StoreFileName(FileName);
 
for Line in Lines do
StoreLine(Line, Id);
end;
 
var
Index: TIndex;
FileNames: TArray<TFileName> = ['file1.txt', 'file2.txt', 'file3.txt'];
FileName: TFileName;
Querys: TArray<string> = ['Cat', 'is', 'banana', 'it', 'what'];
Query, Found: string;
 
begin
Index := TIndex.Create;
 
for FileName in FileNames do
Index.AddFile(FileName);
 
for Query in Querys do
Writeln(format('"%s" %s'#10, [Query, Index[Query]]));
 
repeat
Writeln('Enter a word to search for: (q to quit)');
Readln(Query);
 
if Query = 'q' then
Break;
 
Writeln(format('"%s" %s'#10, [Query, Index[Query]]));
until False;
 
Index.Free;
end.</syntaxhighlight>
Input: Same of [[#D|D]].
{{out}}
<pre>"Cat" not found
 
"is" found in "file1.txt", "file2.txt", "file3.txt"
 
"banana" found in "file3.txt"
 
"it" found in "file1.txt", "file2.txt", "file3.txt"
 
"what" found in "file1.txt", "file2.txt"
 
Enter a word to search for: (q to quit):
q</pre>
 
=={{header|EchoLisp}}==
 
===Indexing===
Index values are sets associated with each word (key). We use the local-put-value function to permanently store the index, in the browser local storage.
<syntaxhighlight lang="lisp">
;; set of input files
(define FILES {T0.txt T1.txt T2.txt})
;; store name for permanent inverted index
(define INVERT "INVERTED-INDEX")
 
;; get text for each file, and call (action filename text)
(define (map-files action files)
(for ((file files))
(file->string action file)))
 
;; create store
(local-make-store INVERT)
 
; invert-word : word -> set of files
(define (invert-word word file store)
(local-put-value word
(make-set (cons file (local-get-value word store))) store))
 
; parse file text and invert each word
(define (invert-text file text)
(writeln 'Inverting file text)
(let ((text (text-parse text)))
(for ((word text)) (invert-word (string-downcase word) file INVERT))))
</syntaxhighlight>
 
=== Query ===
Intersect sets values of each word.
<syntaxhighlight lang="lisp">
;; usage : (inverted-search w1 w2 ..)
(define-syntax-rule (inverted-search w ...)
(and-get-invert (quote w )))
 
;; intersects all sets referenced by words
;; returns the intersection
(define (and-get-invert words)
(foldl
(lambda(word res)
(set-intersect res (local-get-value word INVERT)))
FILES words))
</syntaxhighlight>
Output :
<syntaxhighlight lang="lisp">
(map-files invert-text FILES)
(inverted-search is it)
[0]→ { T0.txt T1.txt T2.txt }
(inverted-search is banana)
[1]→ { T2.txt }
(inverted-search is what)
[2]→ { T0.txt T1.txt }
(inverted-search boule)
[3]→ null
</syntaxhighlight>
 
=={{header|Erlang}}==
This might be used with a lot of large files so we use binaries to save space. That adds <<>> to the search terms.
If somebody wants to avoid "end." and "end" being two different terms, just add <<".">> to binary:compile_pattern/1
Ditto for any other character.
 
<syntaxhighlight lang="erlang">
-module( inverted_index ).
 
-export( [from_files/1, search/2, task/0] ).
 
from_files( Files ) ->
lists:foldl( fun import_from_file/2, dict:new(), Files ).
 
search( Binaries, Inverted_index ) ->
[Files | T] = [dict:fetch(X, Inverted_index) || X <- Binaries],
lists:foldl( fun search_common/2, Files, T ).
 
task() ->
Files_contents = [{"file_1", <<"it is what it is">>}, {"file_2", <<"what is it">>}, {"file_3", <<"it is a banana">>}],
[file:write_file(X, Y) || {X, Y} <- Files_contents],
Inverted_index = from_files( [X || {X, _Y} <- Files_contents] ),
Result = search( [<<"what">>, <<"is">>, <<"it">>], Inverted_index ),
io:fwrite( "~p~n", [Result] ),
[file:delete(X) || {X, _Y} <- Files_contents].
 
 
 
import_from_file( File, Dict_acc ) ->
New_dict = dict:from_list( import_from_file_contents(File, file:read_file(File)) ),
dict:merge( fun import_from_file_merge/3, Dict_acc, New_dict ).
 
import_from_file_contents( File, {ok, Binary} ) ->
[{X, [File]} || X <- binary:split( Binary, binary:compile_pattern([<<" ">>, <<"\n">>]), [global] )];
import_from_file_contents( File, {error, Error} ) ->
io:fwrite( "Error: could not open file ~p: ~p~nContinuing with the rest of them~n", [File, Error] ),
[].
 
import_from_file_merge( _Key, Files, [New_file] ) -> [New_file | Files].
 
search_common( Files, Acc ) -> [X || X <- Acc, lists:member(X, Files)].
</syntaxhighlight>
 
--- Data stack:
H{ { "a" ~vector~ } { "is" ~vector~ } { "what" ~vector~ } { ...
( scratchpad ) { "what" "is" "it" } swap query .
V{ "f1" "f2" }
</lang>
=={{header|F Sharp|F#}}==
<langsyntaxhighlight lang="fsharp">open System
open System.IO
 
// Map search terms to associated set of files
type searchIndexMap = Map<string, Set<string>>
 
let inputSearchCriteria() =
let readLine prompt =
printf "%s: " prompt
Console.ReadLine().Split()
 
readLine "Files", (readLine "Find") |> Array.map (fun s -> s.ToLower())
 
let updateIndex indexMap keyValuePair =
let k, v = keyValuePair
 
match Map.tryFind k indexMap with
| None -> Map.add k (Set.singleton v) indexMap
| Some set -> Map.add k (Set.add v set) indexMap
 
let buildIndex files =
let fileData file =
File.ReadAllText(file).Split() |> Seq.map (fun word -> word.ToLower(), file)
 
files |> Seq.collect fileData
|> Seq.fold updateIndex Map.empty
 
let searchFiles() =
let files, terms = inputSearchCriteria()
Line 944 ⟶ 1,584:
|> Set.intersectMany
 
printf "Found in: " ; searchResults |> Set.iter (printf "%s ") ; printfn ""</langsyntaxhighlight>
Sample usage:
<pre>
Line 952 ⟶ 1,592:
Find: what is
Found in: file1.txt file2.txt</pre>
 
=={{header|Factor}}==
<syntaxhighlight lang="factor">USING: assocs fry io.encodings.utf8 io.files kernel sequences
sets splitting vectors ;
IN: rosettacode.inverted-index
 
: file-words ( file -- assoc )
utf8 file-contents " ,;:!?.()[]{}\n\r" split harvest ;
: add-to-file-list ( files file -- files )
over [ swap [ adjoin ] keep ] [ nip 1vector ] if ;
: add-to-index ( words index file -- )
'[ _ [ _ add-to-file-list ] change-at ] each ;
: (index-files) ( files index -- )
[ [ [ file-words ] keep ] dip swap add-to-index ] curry each ;
: index-files ( files -- index )
H{ } clone [ (index-files) ] keep ;
: query ( terms index -- files )
[ at ] curry map [ ] [ intersect ] map-reduce ;
</syntaxhighlight>
 
Example use :
<syntaxhighlight lang="text">( scratchpad ) { "f1" "f2" "f3" } index-files
 
--- Data stack:
H{ { "a" ~vector~ } { "is" ~vector~ } { "what" ~vector~ } { ...
( scratchpad ) { "what" "is" "it" } swap query .
V{ "f1" "f2" }
</syntaxhighlight>
 
=={{header|Go}}==
<langsyntaxhighlight lang="go">package main
 
import (
Line 1,062 ⟶ 1,730:
}
return true
}
 
func ui() {
fmt.Println(len(index), "words indexed in", len(indexed), "files")
Line 1,081 ⟶ 1,749:
fmt.Println("one match:")
fmt.Println(" ", indexed[dl[0]].file, indexed[dl[0]].title)
default:
fmt.Println(len(dl), "matches:")
for _, d := range dl {
Line 1,088 ⟶ 1,756:
}
}
}</langsyntaxhighlight>
Session:
<pre>
Line 1,111 ⟶ 1,779:
=={{header|Haskell}}==
 
<langsyntaxhighlight lang="haskell">import Control.Monad
import Data.Char (isAlpha, toLower)
import qualified Data.Map as M
Line 1,142 ⟶ 1,810:
intersections xs = foldl1 S.intersection xs
 
lowercase = map toLower</langsyntaxhighlight>
 
An example of use, assuming the program is named <code>iindex</code> and there exist files <code>t0</code>, <code>t1</code>, and <code>t2</code> with contents "It is what it is.", "What is it?", and "It is a banana.":
Line 1,153 ⟶ 1,821:
 
The following implements a simple case insensitive inverse index using lists simulating texts.
<langsyntaxhighlight Iconlang="icon">procedure main()
 
texts := table() # substitute for read and parse files
Line 1,161 ⟶ 1,829:
 
every textname := key(texts) do # build index for each 'text'
SII := InvertedIndex(SII,textname,texts[textname])
 
TermSearchUI(SII) # search UI
 
end
 
Line 1,182 ⟶ 1,850:
repeat {
writes("Enter search terms (^z to quit) : ")
terms := map(trim(read() | break))
 
x := []
terms ? while not pos(0) do {
tab(many(' \t'))
put(x,tab(upto('\ \t')|0))
}
 
show("Searching for : ",x)
show("Found in : ",s := TermSearch(ii,x)) | show("Not found : ",x)
}
write("End of search")
Line 1,198 ⟶ 1,866:
 
procedure TermSearch(ii,x) #: return set of matches or fail
every s := !x do
( /u := ii[s] ) | (u **:= ii[s])
if *u > 0 then return u
Line 1,206 ⟶ 1,874:
every writes(s|!x) do writes(" ")
write()
return
end</langsyntaxhighlight>
 
Output:<pre>Enter search terms (^z to quit) : is it
Line 1,223 ⟶ 1,891:
 
The following code will build a full index. Modification of search routines is left as an exercise:
<langsyntaxhighlight Uniconlang="unicon">record InvertedIndexRec(simple,full)
 
procedure FullInvertedIndex(ii,k,words) #: accumulate a full inverted index
Line 1,241 ⟶ 1,909:
 
return ii
end</langsyntaxhighlight>
 
=={{header|J}}==
Line 1,247 ⟶ 1,915:
This just implements the required spec, with a simplistic definition for what a word is, and with no support for stop words, nor for phrase searching.
 
<langsyntaxhighlight Jlang="j">require'files regex strings'
 
rxutf8 0 NB. support latin1 searches for this example, instead of utf8
Line 1,270 ⟶ 1,938:
hits=. buckets{~words i.~.parse tolower y
files {~ >([-.-.)each/hits
)</langsyntaxhighlight>
 
Example use:
 
<langsyntaxhighlight Jlang="j"> invert '~help/primer/cut.htm';'~help/primer/end.htm';'~help/primer/gui.htm'
>search 'finally learning'
~help/primer/end.htm
Line 1,282 ⟶ 1,950:
~help/primer/gui.htm
>search 'around'
~help/primer/gui.htm</langsyntaxhighlight>
 
=={{header|Java}}==
<syntaxhighlight lang="java">
<lang Java>
package org.rosettacode;
 
Line 1,302 ⟶ 1,971:
public class InvertedIndex {
 
List<String> stopwords = Arrays.asList("a", "able", "about",
"across", "after", "all", "almost", "also", "am", "among", "an",
"and", "any", "are", "as", "at", "be", "because", "been", "but",
"by", "can", "cannot", "could", "dear", "did", "do", "does",
"either", "else", "ever", "every", "for", "from", "get", "got",
"had", "has", "have", "he", "her", "hers", "him", "his", "how",
"however", "i", "if", "in", "into", "is", "it", "its", "just",
"least", "let", "like", "likely", "may", "me", "might", "most",
"must", "my", "neither", "no", "nor", "not", "of", "off", "often",
"on", "only", "or", "other", "our", "own", "rather", "said", "say",
"says", "she", "should", "since", "so", "some", "than", "that",
"the", "their", "them", "then", "there", "these", "they", "this",
"tis", "to", "too", "twas", "us", "wants", "was", "we", "were",
"what", "when", "where", "which", "while", "who", "whom", "why",
"will", "with", "would", "yet", "you", "your");
 
Map<String, List<Tuple>> index = new HashMap<String, List<Tuple>>();
List<String> files = new ArrayList<String>();
 
public void indexFile(File file) throws IOException {
int fileno = files.indexOf(file.getPath());
if (fileno == -1) {
files.add(file.getPath());
fileno = files.size() - 1;
}
}
 
int pos = 0;
BufferedReader reader = new BufferedReader(new FileReader(file));
for (String line = reader.readLine(); line != null; line = reader
.readLine()) {
for (String _word : line.split("\\W+")) {
String word = _word.toLowerCase();
pos++;
if (stopwords.contains(word))
continue;
List<Tuple> idx = index.get(word);
if (idx == null) {
idx = new LinkedList<Tuple>();
index.put(word, idx);
}
}
idx.add(new Tuple(fileno, pos));
}
}
}
}
System.out.println("indexed " + file.getPath() + " " + pos + " words");
}
}
 
public void search(List<String> words) {
for (String _word : words) {
Set<String> answer = new HashSet<String>();
String word = _word.toLowerCase();
List<Tuple> idx = index.get(word);
if (idx != null) {
for (Tuple t : idx) {
answer.add(files.get(t.fileno));
}
}
}
}
System.out.print(word);
for (String f : answer) {
System.out.print(" " + f);
}
}
System.out.println("");
}
}
}
}
 
public static void main(String[] args) {
try {
InvertedIndex idx = new InvertedIndex();
for (int i = 1; i < args.length; i++) {
idx.indexFile(new File(args[i]));
}
}
idx.search(Arrays.asList(args[0].split(",")));
} catch (Exception e) {
e.printStackTrace();
}
}
}
}
 
private class Tuple {
private int fileno;
private int position;
 
public Tuple(int fileno, int position) {
this.fileno = fileno;
this.position = position;
}
}
}
}
}
 
</syntaxhighlight>
</lang>
 
Example output:
<syntaxhighlight lang="java">
<lang Java>
java -cp bin org.rosettacode.InvertedIndex "huntsman,merit,dog,the,gutenberg,lovecraft,olympian" pg30637.txt pg7025.txt pg82.txt pg9090.txt
indexed pg30637.txt 106473 words
indexed pg7025.txt 205714 words
Line 1,406 ⟶ 2,075:
olympian pg30637.txt
 
</syntaxhighlight>
</lang>
 
=={{header|jq}}==
 
In the first part of this section, the core functions for computing an inverted index and for searching it are presented.
These functions will work with jq 1.4 as well as later (and possibly earlier) versions.
 
The second section shows how to accomplish the interactive task using a version of jq with support for 'input' and 'input_filename' (e.g. jq 1.5).
 
'''Part 1: inverted_index and search'''
<syntaxhighlight lang="jq"># Given an array of [ doc, array_of_distinct_words ]
# construct a lookup table: { word: array_of_docs }
def inverted_index:
reduce .[] as $pair
({};
$pair[0] as $doc
| reduce $pair[1][] as $word
(.; .[$word] += [$doc]));
 
def search(words):
def overlap(that): . as $this
| reduce that[] as $item ([]; if $this|index($item) then . + [$item] else . end);
 
. as $dict
| if (words|length) == 0 then []
else reduce words[1:][] as $word
( $dict[words[0]]; overlap( $dict[$word] ) )
end ; </syntaxhighlight>
 
'''Part 2: Interactive Search'''
 
In this section, a solution to the task is presented using two
invocations of jq: one parses the input files, and the other does
everything else. If your shell does not support <(...) then you
could create a temporary file to hold the parsed output.
 
<syntaxhighlight lang="jq">def prompt_search:
"Enter a string or an array of strings to search for, quoting each string, or 0 to exit:",
( (input | if type == "array" then . elif type == "string" then [.]
else empty
end) as $in
| search($in), prompt_search ) ;
 
$in | inverted_index | prompt_search</syntaxhighlight>
 
'''Example''':
<syntaxhighlight lang="sh">$ jq -r -c -n --argfile in <(jq -R 'split(" ") | select(length>0) | [input_filename, unique]' T?.txt) -f Inverted_index.jq
Enter a string or an array of strings to search for, quoting each string, or 0 to exit:
"is"
["T0.txt","T1.txt","T2.txt"]
Enter a string or an array of strings to search for, quoting each string, or 0 to exit:
["is", "banana"]
["T2.txt"]
Enter a string or an array of strings to search for, quoting each string, or 0 to exit:
0
$</syntaxhighlight>
 
=={{header|Julia}}==
<syntaxhighlight lang="julia">function makedoubleindex(files)
idx = Dict{String, Dict}()
for file in files
str = lowercase(read(file, String))
words = split(str, r"\s+")
for word in words
if !haskey(idx, word)
idx[word] = Dict{String, Int}()
elseif !haskey(idx[word], file)
(idx[word])[file] = 1
else
(idx[word])[file] += 1
end
end
end
idx
end
 
function wordsearch(dict, words::Vector{String})
for word in words
if haskey(dict, word)
for (f, n) in dict[word]
println("File $f contains $n instances of <$word>.")
end
else
println("No instances of \"$word\" were found.")
end
end
end
wordsearch(dict, word::String) = wordsearch(dict, [word])
 
 
const filenames = ["file1.txt", "file2.txt", "file3.txt"]
const didx = makedoubleindex(filenames)
const searchterms = ["forehead", "of", "hand", "a", "foot"]
wordsearch(didx, searchterms)
</syntaxhighlight>
 
=={{header|Kotlin}}==
<syntaxhighlight lang="scala">// version 1.1.51
 
import java.io.File
 
val invIndex = mutableMapOf<String, MutableList<Location>>()
val fileNames = mutableListOf<String>()
val splitter = Regex("""\W+""")
 
class Location(val fileName: String, val wordNum: Int) {
override fun toString() = "{$fileName, word number $wordNum}"
}
 
fun indexFile(fileName: String) {
if (fileName in fileNames) {
println("'$fileName' already indexed")
return
}
fileNames.add(fileName)
File(fileName).forEachLine { line ->
for ((i, w) in line.toLowerCase().split(splitter).withIndex()) {
var locations = invIndex[w]
if (locations == null) {
locations = mutableListOf<Location>()
invIndex.put(w, locations)
}
locations.add(Location(fileName, i + 1))
}
}
println("'$fileName' has been indexed")
}
 
fun findWord(word: String) {
val w = word.toLowerCase()
val locations = invIndex[w]
if (locations != null) {
println("\n'$word' found in the following locations:")
println(locations.map { " $it" }.joinToString("\n"))
}
else println("\n'$word' not found")
println()
}
 
fun main(args: Array<String>) {
// files to be indexed entered as command line arguments
if (args.size == 0) {
println("No file names have been supplied")
return
}
for (arg in args) indexFile(arg)
println()
println("Enter word(s) to be searched for in these files or 'q' to quit")
while (true) {
print(" ? : ")
val word = readLine()!!
if (word.toLowerCase() == "q") return
findWord(word)
}
}</syntaxhighlight>
 
Contents of files:
<pre>
inv1.txt -> It is what it is.
inv2.txt -> What is it?
inv3.txt -> It is a banana!
</pre>
 
Sample output:
<pre>$ java -jar inverted_index.jar inv1.txt inv2.txt inv3.txt inv1.txt
'inv1.txt' has been indexed
'inv2.txt' has been indexed
'inv3.txt' has been indexed
'inv1.txt' already indexed
 
Enter word(s) to be searched for in these files or 'q' to quit
? : cat
 
'cat' not found
 
? : is
 
'is' found in the following locations:
{inv1.txt, word number 2}
{inv1.txt, word number 5}
{inv2.txt, word number 2}
{inv3.txt, word number 2}
 
? : banana
 
'banana' found in the following locations:
{inv3.txt, word number 4}
 
? : it
 
'it' found in the following locations:
{inv1.txt, word number 1}
{inv1.txt, word number 4}
{inv2.txt, word number 3}
{inv3.txt, word number 1}
 
? : what
 
'what' found in the following locations:
{inv1.txt, word number 3}
{inv2.txt, word number 1}
 
? : a
 
'a' found in the following locations:
{inv3.txt, word number 3}
 
? : q
</pre>
 
=={{header|Mathematica}}/{{header|Wolfram Language}}==
<syntaxhighlight lang="mathematica">si = CreateSearchIndex["ExampleData/Text", Method -> "TFIDF"];
Manipulate[Grid[sr = TextSearch[si, ToString[s]];
{FileBaseName /@ Normal[Dataset[sr][All, "ReferenceLocation"]],
Column[#, Frame -> All] & /@ sr[All, "Snippet"]} // Transpose,
Frame -> All], {s, "tree"}]</syntaxhighlight>
{{out}}
Outputs a graphical user interface with an interactive input field showing filename and the snippet of text that includes the search string.
 
=={{header|Nim}}==
We have used as examples three files containing the text from Wikipedia article. We used here an index which keep document name and line number. It would be easy to add the position in the line.
 
<syntaxhighlight lang="nim">import os, strformat, strutils, tables
 
type
 
WordRef = tuple[docnum, linenum: int]
 
InvertedIndex = object
docs: seq[string]
index: Table[string, seq[WordRef]]
 
SearchResult = object
total: int
refs: seq[tuple[docname: string; linenums: seq[int]]]
 
IndexingError = object of CatchableError
 
const
 
# Characters composing a word (letters + "'").
WordChars = Letters + {'\''}
 
# Words to ignore.
StopWords = ["about", "above", "after", "again", "against", "all", "am", "an", "and", "any",
"are", "aren't", "as", "at", "be", "because", "been", "before", "being", "below",
"between", "both", "but", "by", "can't", "cannot", "could", "couldn't", "did",
"didn't", "do", "does", "doesn't", "doing", "don't", "down", "during", "each",
"few", "for", "from", "further", "had", "hadn't", "has", "hasn't", "have",
"haven't", "having", "he", "he'd", "he'll", "he's", "her", "here", "here's",
"hers", "herself", "him", "himself", "his", "how", "how's", "i", "i'd", "i'll",
"i'm", "i've", "if", "in", "into", "is", "isn't", "it", "it's", "its", "itself",
"let's", "me", "more", "most", "mustn't", "my", "myself", "no", "nor", "not",
"of", "off", "on", "once", "only", "or", "other", "ought", "our", "ours",
"ourselves", "out", "over", "own", "same", "shan't", "she", "she'd", "she'll",
"she's", "should", "shouldn't", "so", "some", "such", "than", "that", "that's",
"the", "their", "theirs", "them", "themselves", "then", "there", "there's",
"these", "they", "they'd", "they'll", "they're", "they've", "this", "those",
"through", "to", "too", "under", "until", "up", "very", "was", "wasn't", "we",
"we'd", "we'll", "we're", "we've", "were", "weren't", "what", "what's", "when",
"when's", "where", "where's", "which", "while", "who", "who's", "whom", "why",
"why's", "with", "won't", "would", "wouldn't", "you", "you'd", "you'll", "you're",
"you've", "your", "yours", "yourself", "yourselves"]
 
 
template plural(n: int): string = (if n > 1: "s" else: "")
 
 
proc add(invIndex: var InvertedIndex; docPath: string) =
## Add a document to the index.
 
let docName = docPath.extractFilename()
if docName in invIndex.docs:
raise newException(IndexingError, &"{docName}: a document with this name is alreadry indexed.")
invIndex.docs.add docName
let docIndex = invIndex.docs.high
 
var linenum = 0
var count = 0
 
try:
for line in docPath.lines:
inc linenum
var word = ""
for ch in line:
if ch in WordChars:
word.add ch.toLowerAscii()
else:
if word.len > 1 and word notin StopWords:
invIndex.index.mgetOrPut(word, newSeq[WordRef]()).add (docIndex, linenum)
inc count
word = ""
 
except IOError:
raise newException(IndexingError, &"{docName}: error while reading file.")
 
if count > 0:
echo &"{docName}: {count} word{plural(count)} indexed."
else:
raise newException(IndexingError, &"{docName}: nothing to index.")
 
 
func status(invIndex: InvertedIndex): string =
## Display information about the inverted index status.
let words = invIndex.index.len
let docs = invIndex.docs.len
&"Index contains {words} word{plural(words)} from {docs} document{plural(docs)}."
 
 
proc search(invIndex: InvertedIndex; word: string): SearchResult =
## Search a word in the inverted index.
## The references are grouped by document.
 
let refs = invIndex.index.getOrDefault(word)
if refs.len == 0: return
 
result.total = refs.len
var prevdoc = ""
var prevline = 0
for (docnum, linenum) in refs:
let docname = invIndex.docs[docnum]
if docname == prevDoc:
if linenum != prevline:
result.refs[^1].linenums.add(linenum)
prevline = linenum
else:
result.refs.add((docname, @[linenum]))
prevdoc = docname
prevline = linenum
 
 
#———————————————————————————————————————————————————————————————————————————————————————————————————
 
var invertedIndex: InvertedIndex
 
if paramCount() != 1 or not paramStr(1).dirExists():
quit &"Usage: {getAppFileName().lastPathPart} directory"
 
# Index the documents.
for doc in walkFiles(paramStr(1) / "*.txt"):
try:
invertedIndex.add(doc)
except IndexingError:
echo getCurrentExceptionMsg()
 
echo invertedIndex.status()
echo ""
 
# Enter search interface.
var word: string
while true:
try:
stdout.write "Word to search? "
word = stdin.readLine().toLowerAscii().strip()
if word == "q":
echo "Quitting"
break
if not word.allCharsInSet(WordChars):
echo "This word contains invalid characters."
continue
except EOFError:
echo ""
echo "Quitting"
break
 
# Search word.
let result = invertedIndex.search(word)
if result.refs.len == 0:
echo "No reference found for this word."
continue
 
# Display the references.
echo &"Found {result.total} reference{plural(result.total)} to “{word}”:"
for (docname, linenums) in result.refs:
stdout.write &"... in “{docName}”, line{plural(linenums.len)}"
for num in linenums: stdout.write ' ', num
echo ""</syntaxhighlight>
 
{{out}}
Example of session.
<pre>./inverted_index texts
a.txt: 129 words indexed.
b.txt: 183 words indexed.
c.txt: 16 words indexed.
Index contains 197 words from 3 documents.
 
Word to search? inverted
Found 23 references to “inverted”:
... in “a.txt”, lines 1 3 5
... in “b.txt”, lines 1 3 5 7
... in “c.txt”, line 1
Word to search? q
Quitting</pre>
 
=={{header|OCaml}}==
Line 1,418 ⟶ 2,479:
unix.cma bigarray.cma nums.cma -I +sexplib sexplib.cma str.cma \
inv.ml
 
ocamlc -o inv.byte unix.cma bigarray.cma nums.cma -I +sexplib sexplib.cma str.cma inv.cmo
 
<langsyntaxhighlight lang="ocaml">TYPE_CONV_PATH "Inverted_index"
 
type files = string array with sexp
Line 1,460 ⟶ 2,521:
List.iter (fun (w, from) -> Hashtbl.replace h w from) inv_index;
List.iter (fun w ->
iflet Hashtbl.memfrom h w=
try Hashtbl.find h w
then begin
except Not_found -> []
let from = Hashtbl.find h w in
in
Hashtbl.replace h w (i::from)
Hashtbl.replace h w (i::from)
end else
Hashtbl.add h w [i]
) words;
Hashtbl.fold (fun w from acc -> (w, from) :: acc) h []
Line 1,502 ⟶ 2,562:
| "--index-file", in_file -> index_file in_file
| "--search-word", word -> search_word word
| _ -> usage()</langsyntaxhighlight>
 
 
=={{header|Perl}}==
<langsyntaxhighlight lang="perl">use Set::Object 'set';
 
# given an array of files, returns the index
Line 1,517 ⟶ 2,576:
foreach my $file (@files)
{
open(F, "<", $file) or die "Can't read file $file: $!";
while(<F>) {
s/\A\W+//;
foreach my $w (map {lc} grep {length() >= 3} split /\W+/)
{
if ( exists($iindex{$w}) )
{
{
$iindex{$w}->insert($file);
} else {
$iindex{$w} = set($file);
}
}
}
}
}
close(F);
}
return %iindex;
Line 1,541 ⟶ 2,600:
my @words = @_;
my $res = set();
 
foreach my $w (map {lc} @words)
{
$w =~ s/\W+//g; # strip non-words chars
length $w < 3 and next;
exists $idx{$w} or return set();
Line 1,559 ⟶ 2,618:
# first arg is a comma-separated list of words to search for
print "$_\n"
foreach search_words_with_index({createindex(@ARGV)}, @searchwords);</langsyntaxhighlight>
=={{header|Perl 6}}==
<lang perl6>sub MAIN (*@files) {
(my %norm).push: do for @files -> $file {
$file X=> slurp($file).lc.words;
}
(my %inv).push: %norm.invert.uniq;
 
=={{header|Phix}}==
while prompt("Search terms: ").words -> @words {
Loads all text files in demo\rosetta\ and builds a list of filenames and
for @words -> $word {
a dictionary of {word,file_indexes}, before a handful of quick tests.<br>
say "$word => %inv.{$word.lc}";
Might be better (and almost as easy) for the dictionary values to be say
{total_count, {file nos}, {file counts}}.
<!--<syntaxhighlight lang="phix">(notonline)-->
<span style="color: #000080;font-style:italic;">-- demo\rosetta\Inverted_index.exw</span>
<span style="color: #008080;">without</span> <span style="color: #008080;">js</span> <span style="color: #000080;font-style:italic;">-- (file i/o)</span>
<span style="color: #004080;">integer</span> <span style="color: #000000;">word_count</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">0</span>
<span style="color: #004080;">sequence</span> <span style="color: #000000;">filenames</span> <span style="color: #0000FF;">=</span> <span style="color: #0000FF;">{}</span>
<span style="color: #008080;">function</span> <span style="color: #000000;">is_ascii</span><span style="color: #0000FF;">(</span><span style="color: #004080;">string</span> <span style="color: #000000;">txt</span><span style="color: #0000FF;">)</span>
<span style="color: #008080;">for</span> <span style="color: #000000;">i</span><span style="color: #0000FF;">=</span><span style="color: #000000;">1</span> <span style="color: #008080;">to</span> <span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #000000;">txt</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">do</span>
<span style="color: #004080;">integer</span> <span style="color: #000000;">ch</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">txt</span><span style="color: #0000FF;">[</span><span style="color: #000000;">i</span><span style="color: #0000FF;">]</span>
<span style="color: #008080;">if</span> <span style="color: #000000;">ch</span><span style="color: #0000FF;">=</span><span style="color: #008000;">'\0'</span> <span style="color: #008080;">or</span> <span style="color: #000000;">ch</span><span style="color: #0000FF;">>=</span><span style="color: #000000;">#7F</span> <span style="color: #008080;">then</span> <span style="color: #008080;">return</span> <span style="color: #000000;">0</span> <span style="color: #008080;">end</span> <span style="color: #008080;">if</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">for</span>
<span style="color: #008080;">return</span> <span style="color: #000000;">1</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">function</span>
<span style="color: #008080;">procedure</span> <span style="color: #000000;">add_words</span><span style="color: #0000FF;">(</span><span style="color: #004080;">string</span> <span style="color: #000000;">name</span><span style="color: #0000FF;">,</span> <span style="color: #004080;">sequence</span> <span style="color: #000000;">words</span><span style="color: #0000FF;">)</span>
<span style="color: #000000;">filenames</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">append</span><span style="color: #0000FF;">(</span><span style="color: #000000;">filenames</span><span style="color: #0000FF;">,</span><span style="color: #000000;">name</span><span style="color: #0000FF;">)</span>
<span style="color: #004080;">integer</span> <span style="color: #000000;">fn</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #000000;">filenames</span><span style="color: #0000FF;">)</span>
<span style="color: #008080;">for</span> <span style="color: #000000;">i</span><span style="color: #0000FF;">=</span><span style="color: #000000;">1</span> <span style="color: #008080;">to</span> <span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #000000;">words</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">do</span>
<span style="color: #004080;">string</span> <span style="color: #000000;">word</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">words</span><span style="color: #0000FF;">[</span><span style="color: #000000;">i</span><span style="color: #0000FF;">]</span>
<span style="color: #008080;">if</span> <span style="color: #000000;">word</span><span style="color: #0000FF;">[</span><span style="color: #000000;">1</span><span style="color: #0000FF;">]>=</span><span style="color: #008000;">'a'</span> <span style="color: #000080;font-style:italic;">-- skip numbers</span>
<span style="color: #008080;">and</span> <span style="color: #000000;">word</span><span style="color: #0000FF;">[</span><span style="color: #000000;">1</span><span style="color: #0000FF;">]<=</span><span style="color: #008000;">'z'</span> <span style="color: #008080;">then</span>
<span style="color: #004080;">integer</span> <span style="color: #000000;">node</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">getd_index</span><span style="color: #0000FF;">(</span><span style="color: #000000;">word</span><span style="color: #0000FF;">)</span>
<span style="color: #008080;">if</span> <span style="color: #000000;">node</span><span style="color: #0000FF;">=</span><span style="color: #000000;">0</span> <span style="color: #008080;">then</span> <span style="color: #000080;font-style:italic;">-- not present</span>
<span style="color: #7060A8;">setd</span><span style="color: #0000FF;">(</span><span style="color: #000000;">word</span><span style="color: #0000FF;">,{</span><span style="color: #000000;">fn</span><span style="color: #0000FF;">})</span>
<span style="color: #000000;">word_count</span> <span style="color: #0000FF;">+=</span> <span style="color: #000000;">1</span>
<span style="color: #008080;">else</span>
<span style="color: #004080;">sequence</span> <span style="color: #000000;">val</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">getd_by_index</span><span style="color: #0000FF;">(</span><span style="color: #000000;">node</span><span style="color: #0000FF;">)</span>
<span style="color: #008080;">if</span> <span style="color: #7060A8;">find</span><span style="color: #0000FF;">(</span><span style="color: #000000;">fn</span><span style="color: #0000FF;">,</span><span style="color: #000000;">val</span><span style="color: #0000FF;">)=</span><span style="color: #000000;">0</span> <span style="color: #008080;">then</span>
<span style="color: #7060A8;">setd</span><span style="color: #0000FF;">(</span><span style="color: #000000;">word</span><span style="color: #0000FF;">,</span><span style="color: #7060A8;">append</span><span style="color: #0000FF;">(</span><span style="color: #000000;">val</span><span style="color: #0000FF;">,</span><span style="color: #000000;">fn</span><span style="color: #0000FF;">))</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">if</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">if</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">if</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">for</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">procedure</span>
<span style="color: #008080;">procedure</span> <span style="color: #000000;">load_directory</span><span style="color: #0000FF;">()</span>
<span style="color: #004080;">sequence</span> <span style="color: #000000;">d</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">dir</span><span style="color: #0000FF;">(</span><span style="color: #008000;">"."</span><span style="color: #0000FF;">)</span>
<span style="color: #008080;">for</span> <span style="color: #000000;">i</span><span style="color: #0000FF;">=</span><span style="color: #000000;">1</span> <span style="color: #008080;">to</span> <span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #000000;">d</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">do</span>
<span style="color: #008080;">if</span> <span style="color: #008080;">not</span> <span style="color: #7060A8;">find</span><span style="color: #0000FF;">(</span><span style="color: #008000;">'d'</span><span style="color: #0000FF;">,</span><span style="color: #000000;">d</span><span style="color: #0000FF;">[</span><span style="color: #000000;">i</span><span style="color: #0000FF;">][</span><span style="color: #004600;">D_ATTRIBUTES</span><span style="color: #0000FF;">])</span> <span style="color: #000080;font-style:italic;">-- skip directories</span>
<span style="color: #008080;">and</span> <span style="color: #000000;">d</span><span style="color: #0000FF;">[</span><span style="color: #000000;">i</span><span style="color: #0000FF;">][</span><span style="color: #004600;">D_SIZE</span><span style="color: #0000FF;">]<</span><span style="color: #000000;">1024</span><span style="color: #0000FF;">*</span><span style="color: #000000;">1024</span><span style="color: #0000FF;">*</span><span style="color: #000000;">1024</span> <span style="color: #008080;">then</span> <span style="color: #000080;font-style:italic;">-- and files &gt; 1GB</span>
<span style="color: #004080;">string</span> <span style="color: #000000;">name</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">d</span><span style="color: #0000FF;">[</span><span style="color: #000000;">i</span><span style="color: #0000FF;">][</span><span style="color: #004600;">D_NAME</span><span style="color: #0000FF;">]</span>
<span style="color: #004080;">integer</span> <span style="color: #000000;">fn</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">open</span><span style="color: #0000FF;">(</span><span style="color: #000000;">name</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"rb"</span><span style="color: #0000FF;">)</span>
<span style="color: #004080;">string</span> <span style="color: #000000;">txt</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">lower</span><span style="color: #0000FF;">(</span><span style="color: #7060A8;">get_text</span><span style="color: #0000FF;">(</span><span style="color: #000000;">fn</span><span style="color: #0000FF;">))</span>
<span style="color: #7060A8;">close</span><span style="color: #0000FF;">(</span><span style="color: #000000;">fn</span><span style="color: #0000FF;">)</span>
<span style="color: #008080;">if</span> <span style="color: #000000;">is_ascii</span><span style="color: #0000FF;">(</span><span style="color: #000000;">txt</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">then</span> <span style="color: #000080;font-style:italic;">-- skip any bitmaps etc</span>
<span style="color: #004080;">sequence</span> <span style="color: #000000;">words</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">split_any</span><span style="color: #0000FF;">(</span><span style="color: #000000;">txt</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"\0\r\n\t !\"#$%&\'()*+,-./:;&lt;=&gt;?@[\\]^`{|}~"</span><span style="color: #0000FF;">)</span>
<span style="color: #000000;">add_words</span><span style="color: #0000FF;">(</span><span style="color: #000000;">name</span><span style="color: #0000FF;">,</span><span style="color: #000000;">words</span><span style="color: #0000FF;">)</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">if</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">if</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">for</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">procedure</span>
<span style="color: #008080;">function</span> <span style="color: #000000;">lookup</span><span style="color: #0000FF;">(</span><span style="color: #004080;">sequence</span> <span style="color: #000000;">words</span><span style="color: #0000FF;">)</span>
<span style="color: #004080;">sequence</span> <span style="color: #000000;">files</span> <span style="color: #0000FF;">=</span> <span style="color: #0000FF;">{}</span> <span style="color: #000080;font-style:italic;">-- indexes to filenames</span>
<span style="color: #008080;">for</span> <span style="color: #000000;">i</span><span style="color: #0000FF;">=</span><span style="color: #000000;">1</span> <span style="color: #008080;">to</span> <span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #000000;">words</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">do</span>
<span style="color: #004080;">string</span> <span style="color: #000000;">word</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">words</span><span style="color: #0000FF;">[</span><span style="color: #000000;">i</span><span style="color: #0000FF;">]</span>
<span style="color: #004080;">integer</span> <span style="color: #000000;">node</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">getd_index</span><span style="color: #0000FF;">(</span><span style="color: #000000;">word</span><span style="color: #0000FF;">)</span>
<span style="color: #008080;">if</span> <span style="color: #000000;">node</span><span style="color: #0000FF;">=</span><span style="color: #000000;">0</span> <span style="color: #008080;">then</span> <span style="color: #008080;">return</span> <span style="color: #0000FF;">{}</span> <span style="color: #008080;">end</span> <span style="color: #008080;">if</span>
<span style="color: #004080;">sequence</span> <span style="color: #000000;">val</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">getd_by_index</span><span style="color: #0000FF;">(</span><span style="color: #000000;">node</span><span style="color: #0000FF;">)</span>
<span style="color: #008080;">if</span> <span style="color: #000000;">i</span><span style="color: #0000FF;">=</span><span style="color: #000000;">1</span> <span style="color: #008080;">then</span>
<span style="color: #000000;">files</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">val</span>
<span style="color: #008080;">else</span>
<span style="color: #008080;">for</span> <span style="color: #000000;">j</span><span style="color: #0000FF;">=</span><span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #000000;">files</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">to</span> <span style="color: #000000;">1</span> <span style="color: #008080;">by</span> <span style="color: #0000FF;">-</span><span style="color: #000000;">1</span> <span style="color: #008080;">do</span>
<span style="color: #008080;">if</span> <span style="color: #008080;">not</span> <span style="color: #7060A8;">find</span><span style="color: #0000FF;">(</span><span style="color: #000000;">files</span><span style="color: #0000FF;">[</span><span style="color: #000000;">j</span><span style="color: #0000FF;">],</span><span style="color: #000000;">val</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">then</span>
<span style="color: #000000;">files</span><span style="color: #0000FF;">[</span><span style="color: #000000;">j</span><span style="color: #0000FF;">..</span><span style="color: #000000;">j</span><span style="color: #0000FF;">]</span> <span style="color: #0000FF;">=</span> <span style="color: #0000FF;">{}</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">if</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">for</span>
<span style="color: #008080;">if</span> <span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #000000;">files</span><span style="color: #0000FF;">)=</span><span style="color: #000000;">0</span> <span style="color: #008080;">then</span> <span style="color: #008080;">return</span> <span style="color: #0000FF;">{}</span> <span style="color: #008080;">end</span> <span style="color: #008080;">if</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">if</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">for</span>
<span style="color: #008080;">for</span> <span style="color: #000000;">i</span><span style="color: #0000FF;">=</span><span style="color: #000000;">1</span> <span style="color: #008080;">to</span> <span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #000000;">files</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">do</span>
<span style="color: #000000;">files</span><span style="color: #0000FF;">[</span><span style="color: #000000;">i</span><span style="color: #0000FF;">]</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">filenames</span><span style="color: #0000FF;">[</span><span style="color: #000000;">files</span><span style="color: #0000FF;">[</span><span style="color: #000000;">i</span><span style="color: #0000FF;">]]</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">for</span>
<span style="color: #008080;">return</span> <span style="color: #000000;">files</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">function</span>
<span style="color: #000000;">load_directory</span><span style="color: #0000FF;">()</span>
<span style="color: #0000FF;">?</span><span style="color: #000000;">word_count</span>
<span style="color: #0000FF;">?</span><span style="color: #000000;">lookup</span><span style="color: #0000FF;">({</span><span style="color: #008000;">"load_directory"</span><span style="color: #0000FF;">})</span> <span style="color: #000080;font-style:italic;">-- should only be this file</span>
<span style="color: #0000FF;">?</span><span style="color: #000000;">lookup</span><span style="color: #0000FF;">({</span><span style="color: #008000;">"dir"</span><span style="color: #0000FF;">})</span> <span style="color: #000080;font-style:italic;">-- currently two use this</span>
<span style="color: #0000FF;">?</span><span style="color: #000000;">lookup</span><span style="color: #0000FF;">({</span><span style="color: #008000;">"lower"</span><span style="color: #0000FF;">})</span> <span style="color: #000080;font-style:italic;">-- currently four use this</span>
<span style="color: #0000FF;">?</span><span style="color: #000000;">lookup</span><span style="color: #0000FF;">({</span><span style="color: #008000;">"lower"</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"dir"</span><span style="color: #0000FF;">})</span> <span style="color: #000080;font-style:italic;">-- currently two use both</span>
<span style="color: #0000FF;">?</span><span style="color: #000000;">lookup</span><span style="color: #0000FF;">({</span><span style="color: #008000;">"dir"</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"lower"</span><span style="color: #0000FF;">})</span> <span style="color: #000080;font-style:italic;">-- should be the same two</span>
<span style="color: #0000FF;">?</span><span style="color: #000000;">lookup</span><span style="color: #0000FF;">({</span><span style="color: #008000;">"ban"</span><span style="color: #0000FF;">&</span><span style="color: #008000;">"anafish"</span><span style="color: #0000FF;">})</span> <span style="color: #000080;font-style:italic;">-- should be none ({})</span>
<!--</syntaxhighlight>-->
{{out}}
Note the distributed version has been modified and is occasionally used for real, so the output will likely differ.
<pre>
3365
{"Inverted_index.exw"}
{"Inverted_index.exw","viewppm.exw"}
{"AlmostPrime.exw","Inverted_index.exw","RockPaperScissors.exw","viewppm.exw"}
{"Inverted_index.exw","viewppm.exw"}
{"Inverted_index.exw","viewppm.exw"}
{}
</pre>
 
=={{header|PHP}}==
 
<syntaxhighlight lang="php"><?php
 
function buildInvertedIndex($filenames)
{
$invertedIndex = [];
 
foreach($filenames as $filename)
{
$data = file_get_contents($filename);
 
if($data === false) die('Unable to read file: ' . $filename);
 
preg_match_all('/(\w+)/', $data, $matches, PREG_SET_ORDER);
 
foreach($matches as $match)
{
$word = strtolower($match[0]);
 
if(!array_key_exists($word, $invertedIndex)) $invertedIndex[$word] = [];
if(!in_array($filename, $invertedIndex[$word], true)) $invertedIndex[$word][] = $filename;
}
}
 
}</lang>
return $invertedIndex;
}
 
function lookupWord($invertedIndex, $word)
{
return array_key_exists($word, $invertedIndex) ? $invertedIndex[$word] : false;
}
 
$invertedIndex = buildInvertedIndex2(['file1.txt', 'file2.txt', 'file3.txt']);
 
foreach(['cat', 'is', 'banana', 'it'] as $word)
{
$matches = lookupWord($invertedIndex, $word);
 
if($matches !== false)
{
echo "Found the word \"$word\" in the following files: " . implode(', ', $matches) . "\n";
}
else
{
echo "Unable to find the word \"$word\" in the index\n";
}
}</syntaxhighlight>
 
=={{header|PicoLisp}}==
Line 1,585 ⟶ 2,782:
it is a banana</pre>
we can read them into a binary tree in the global variable '*MyIndex'
<langsyntaxhighlight PicoLisplang="picolisp">(off *MyIndex)
 
(use Word
Line 1,599 ⟶ 2,796:
(extract
'((Word) (val (car (idx '*MyIndex Word))))
(rest) ) ) )</langsyntaxhighlight>
Output:
<pre>: (searchFor "what" "is" "it")
Line 1,609 ⟶ 2,806:
: (searchFor "it" "is")
-> ("file3" "file2" "file1")</pre>
 
=={{header|PowerShell}}==
{{works with|PowerShell|2}}
<syntaxhighlight lang="powershell">function Index-File ( [string[]]$FileList )
{
# Create index hashtable, as needed
If ( -not $Script:WordIndex ) { $Script:WordIndex = @{} }
 
# For each file to be indexed...
ForEach ( $File in $FileList )
{
# Find any previously existing entries for this file
$ExistingEntries = $Script:WordIndex.Keys | Where { $Script:WordIndex[$_] -contains $File }
 
# For any previously existing entries
# Delete them (prior to reindexing the file)
ForEach ( $Key in $ExistingEntries )
{
$Script:WordIndex[$Key] = @( $Script:WordIndex[$Key] | Where { $_ -ne $File } )
}
 
# Get the contents of the file, split on non-alphanumeric characters, and remove duplicates
$Words = ( Get-Content $File ) -split '[^a-zA-Z\d]' | Sort -Unique
 
# For each word in the file...
ForEach ( $Word in $Words )
{
# If the entry for the word already exists...
If ( $Script:WordIndex[$Word] )
{
# Add the file name to the entry
$Script:WordIndex[$Word] += $File
}
Else
{
# Create a new entry
$Script:WordIndex[$Word] = @( $File )
}
}
}
}
 
function Find-Word ( [string]$Word )
{
return $WordIndex[$Word]
}</syntaxhighlight>
<syntaxhighlight lang="powershell"># Populate test files
@'
Files full of
various words.
'@ | Out-File -FilePath C:\Test\File1.txt
 
@'
Create an index
of words.
'@ | Out-File -FilePath C:\Test\File2.txt
 
@'
Use the index
to find the files.
'@ | Out-File -FilePath C:\Test\File3.txt</syntaxhighlight>
<syntaxhighlight lang="powershell"># Index files
Index-File C:\Test\File1.txt, C:\Test\File2.txt, C:\Test\File3.txt</syntaxhighlight>
Because PowerShell is a shell language, it is "a user interface to do a search". After running the script defining the functions and running a command to index the files, the user can simply run the search function at the PowerShell command prompt.
Alternatively, one could create a more complex custom UI or GUI if desired.
<syntaxhighlight lang="powershell"># Query index
Find-Word files</syntaxhighlight>
{{out}}
<pre>C:\Test\File1.txt
C:\Test\File3.txt</pre>
 
=={{header|Python}}==
Line 1,615 ⟶ 2,882:
 
First the simple inverted index from [[wp:Inverted index|here]] together with an implementation of a search for (multiple) terms from that index.
<langsyntaxhighlight lang="python">'''
This implements: http://en.wikipedia.org/wiki/Inverted_index of 28/07/10
'''
Line 1,655 ⟶ 2,922:
terms = ["what", "is", "it"]
print('\nTerm Search for: ' + repr(terms))
pp(sorted(termsearch(terms)))</langsyntaxhighlight>
 
'''Sample Output'''
Line 1,683 ⟶ 2,950:
 
<small>It is assumed that the following code is added to the end of the code for the simple case above and so shares its file opening and parsing results</small>
<langsyntaxhighlight lang="python">from collections import Counter
 
 
Line 1,734 ⟶ 3,001:
print(ans)
ans = Counter(ans)
print(' The phrase is found most commonly in text: ' + repr(ans.most_common(1)[0][0]))</langsyntaxhighlight>
 
'''Sample Output'''
Line 1,753 ⟶ 3,020:
['T0.txt', 'T0.txt', 'T2.txt']
The phrase is found most commonly in text: 'T0.txt'</pre>
 
=={{header|Racket}}==
<syntaxhighlight lang="racket">
#!/usr/bin/env racket
#lang racket
(command-line
#:args (term . files)
(define rindex (make-hasheq))
(for ([file files])
(call-with-input-file file
(λ(in) (let loop ()
(define w (regexp-match #px"\\w+" in))
(when w
(let* ([w (bytes->string/utf-8 (car w))]
[w (string->symbol (string-foldcase w))]
[r (hash-ref rindex w '())])
(unless (member file r) (hash-set! rindex w (cons file r)))
(loop)))))))
(define res
(for/list ([w (regexp-match* #px"\\w+" term)])
(list->set (hash-ref rindex (string->symbol (string-foldcase w)) '()))))
(define all (set->list (apply set-intersect res)))
(if (null? all)
(printf "No matching files.\n")
(printf "Terms found at: ~a.\n" (string-join all ", "))))
</syntaxhighlight>
{{out}}
<pre>
$ echo "It is what it is." > F1
$ echo "What is it?" > F2
$ echo "It is a banana." > F3
$ ./search.rkt "what" F?
Terms found at: F1, F2.
$ ./search.rkt "a" F?
Terms found at: F3.
$ ./search.rkt "what a" F?
No matching files.
$ ./search.rkt "what is it" F?
Terms found at: F1, F2.
</pre>
 
=={{header|Raku}}==
(formerly Perl 6)
{{works with|rakudo|2015-09-16}}
<syntaxhighlight lang="raku" line>sub MAIN (*@files) {
my %norm;
do for @files -> $file {
%norm.push: $file X=> slurp($file).lc.words;
}
(my %inv).push: %norm.invert.unique;
 
while prompt("Search terms: ").words -> @words {
for @words -> $word {
say "$word => {%inv.{$word.lc}//'(not found)'}";
}
}
}</syntaxhighlight>
 
=={{header|REXX}}==
Note: In this algorithm, word indices start at 1.
<lang rexx>/*REXX program illustrates building a simple inverted index & word find.*/
 
Note: &nbsp; the Burma Shave signs were created from &nbsp; 1930 ──► 1951 &nbsp; and were common among the rural byways of America.
@.='' /*dictionary of words (so far).*/
!='' /*a list of found words (so far).*/
 
To see more about Burma Shave signs, see the Wikipedia entry: &nbsp; [http://en.wikipedia.org/wiki/Burma-Shave Burma Shave signs.]
call invertI 0,'BURMA0.TXT' /*read file 0 ... */
<syntaxhighlight lang="rexx">/*REXX program illustrates building a simple inverted index and a method of word find.*/
call invertI 1,'BURMA1.TXT' /* " " 1 ... */
call@.= invertI 2,'BURMA2.TXT' /* " " 2 ... /*a dictionary of words (so far). */
call!= invertI 3,'BURMA3.TXT' /* " " 3 ... /*a list of found words (so far). */
call invertI 40, 'BURMA4BURMA0.TXT' /* " " 4 ... /*read the file: BURMA0.TXT ··· */
call invertI 51, 'BURMA5BURMA1.TXT' /* " " 5 .../* " " " BURMA1.TXT ··· */
call invertI 62, 'BURMA6BURMA2.TXT' /* " " 6 .../* " " " BURMA2.TXT ··· */
call invertI 73, 'BURMA7BURMA3.TXT' /* " " 7 .../* " " " BURMA3.TXT ··· */
call invertI 84, 'BURMA8BURMA4.TXT' /* " " 8 .../* " " " BURMA4.TXT ··· */
call invertI 95, 'BURMA9BURMA5.TXT' /* " " 9 .../* " " " BURMA5.TXT ··· */
call invertI 6, 'BURMA6.TXT' /* " " " BURMA6.TXT ··· */
call invertI 7, 'BURMA7.TXT' /* " " " BURMA7.TXT ··· */
call invertI 8, 'BURMA8.TXT' /* " " " BURMA8.TXT ··· */
call invertI 9, 'BURMA9.TXT' /* " " " BURMA9.TXT ··· */
call findAword "huz" /*find a word. */
call findAword "60" /*find another word. */
call findAword "don't" /*and find another word. */
call findAword "burma-shave" /*and find yet another word. */
exit /*stick a fork in it, we're all done. */
/*──────────────────────────────────────────────────────────────────────────────────────*/
findAword: procedure expose @.; arg x /*get an uppercase version of the X arg*/
parse arg ox /*get original (as-is) value of X arg.*/
_=@.x; oxo='───'ox"───"
if _=='' then do
say 'word' oxo "not found."
return 0
end
_@=_ /*save _ text, pass it back to invoker.*/
say 'word' oxo "found in:"
do until _==''; parse var _ f w _
say ' file='f " word="w
end /*until ··· */
return _@
/*──────────────────────────────────────────────────────────────────────────────────────*/
invertI: procedure expose @. !; parse arg #,fn /*the file number and the filename. */
call lineout fn /*close the file, ··· just in case. */
w=0 /*the number of words found (so far). */
do while lines(fn)\==0 /* [↓] process the entire file. */
_=space( linein(fn) ) /*read a line, elide superfluous blanks*/
if _=='' then iterate /*if a blank record, then ignore it. */
say 'file' #", record:" _ /*display the record ──► terminal. */
 
call findAword 'does' do /*finduntil a_=='' word. /*pick off words from record until done*/
call findAword '60' parse upper var /*find another_ word. ? _ /*pick off a word (it's in uppercase).*/
call findAword "don't" ?=stripper(?) /*and find another word. /*strip any trailing punctuation. */
call findAword "burma-shave" /*and find yet another if ?='' then iterate /*is the word. now all blank (or null)? */
exit w=w+1 /*enoughbump ofthe this,word I'mcounter tired(index). */
@.?=@.? # w /*append the new word to a list. */
/*─────────────────────────────────────FINDAWORD subroutine─────────────*/
if wordpos(?,!)==0 then !=! ? /*add it to the list of words found. */
findAword: procedure expose @. /*get A word, and uppercase it. */
parse arg ox; arg x end /*OX= word; X=until uppercase··· version*/
end /*while ··· */
_=@.x
say; call lineout fn /*close the file, just to be neat&safe.*/
oxo='───'ox"───"
return w /*return the index of word in record. */
/*──────────────────────────────────────────────────────────────────────────────────────*/
stripper: procedure; parse arg q /*remove punctuation at the end of word*/
@punctuation= '.,:;?¿!¡∙·'; do j=1 for length(@punctuation)
q=strip(q, 'T', substr(@punctuation, j, 1) )
end /*j*/
return q</syntaxhighlight>
{{out|output|text=&nbsp; when using the default inputs:}}
 
Output note: &nbsp; The Burma-Shave company was shocked at the number of automobile fenders mailed to the company.
if _=='' then do
say 'word' oxo "not found."
return 0
end
 
All fender senders received a half-pound jar of Burma-Shave.
_@=_ /*save _, pass it back to invoker*/
<pre style="height:87ex">
say 'word' oxo "found in:"
file 0, record: Rip a fender
file 0, record: Off your Car
file 0, record: Send it in
file 0, record: For a half-pound jar
file 0, record: Burma-Shave
 
file 1, record: A peach
do until _==''
file 1, record: Looks good
parse var _ f w _
file 1, record: With lots of fuzz
say ' file='f ' word='w
file 1, record: Man's no peach
end /*until*/
file 1, record: And never was
file 1, record: Burma-Shave
 
file 2, record: Does your husband
return _@
file 2, record: Misbehave
/*─────────────────────────────────────INVERTI subroutine───────────────*/
file 2, record: Grunt and grumble
invertI: procedure expose @. !; parse arg #,fn /*file#, filename*/
file 2, record: Rant and rave ?
call lineout fn /*close the file, just in case. */
file 2, record: Shoot the brute some
w=0 /*number of words so far. */
file 2, record: Burma-Shave
 
file 3, record: Don't take a curve
do while lines(fn)\==0 /*read the entire file. */
file 3, record: At 60 per
_=linein(fn) /*read the file, 1 line at a time*/
file 3, record: We hate to lose
_=space(_); if _='' then iterate /*if blank record, then ignore it*/
file 3, record: A customer
file 3, record: Burma-Shave
 
file 4, record: Every shaver
say 'file' #",record="_ /*echo a record, just to be verbose.*/
file 4, record: Now can snore
file 4, record: Six more minutes
file 4, record: Than before
file 4, record: By using
file 4, record: Burma-Shave
 
file 5, record: He played
upper _ /*make it case insensative. */
file 5, record: a sax
file 5, record: Had no B.O.
file 5, record: But his whiskers scratched
file 5, record: So she let him go
file 5, record: Burma-Shave
 
file 6, record: Henry the Eighth
do until _=='' /*pick off words until done. */
file 6, record: Prince of Friskers
parse var _ xxx _ /*pick off a word. */
file 6, record: Lost five wives
xxx=stripper(xxx) /*go and strip off ending punct. */
file 6, record: But kept his whiskers
if xxx='' then iterate /*is the word now blank (null) ? */
file 6, record: Burma-Shave
w=w+1 /*bump the word counter. */
@.xxx=@.xxx # w
if wordpos(xxx,!)==0 then !=! xxx /*add to THE list of words found.*/
end /*until _=='' */
 
file 7, record: Listen birds
end /*while lines...*/
file 7, record: These signs cost
file 7, record: Money
file 7, record: So roost a while
file 7, record: But don't get funny
file 7, record: Burma-Shave
 
file 8, record: My man
call lineout fn /*close the file, just to be neat*/
file 8, record: Won't shave
return w
file 8, record: Sez Hazel Huz
/*─────────────────────────────────────STRIPPER subroutine──────────────*/
file 8, record: But I should worry
stripper: procedure; parse arg q /*remove punctuation at word-end.*/
file 8, record: Dora's does
@punctuation='.,:;?¿!¡' /*serveral punctuation marks. */
file 8, record: Burma-Shave
 
file 9, record: Past
do j=1 for length(@punctuation)
file 9, record: Schoolhouses
q=strip(q,'T',substr(@punctuation,j,1))
file 9, record: Take it slow
end /*j*/
file 9, record: Let the little
file 9, record: Shavers grow
file 9, record: Burma-Shave
 
word ───huz─── found in:
return q</lang>
file=8 word=7
'''output'''
word ───60─── found in:
<pre style="height:30ex;overflow:scroll">
file=3 word=6
file 0,record=Rip a fender
word ───don't─── found in:
file 0,record=off your car
file=3 word=1
file 0,record=send it in
file=7 word=12
file 0,record=for a half-pound jar
word ───burma-shave─── found in:
file 0,record=Burma-shave
file=0 word=14
file 1,record=A peach
file=1 word=15
file 1,record=looks good
file=2 word=15
file 1,record=with lots of fuzz
file=3 word=14
file 1,record=but a man's no peach
file=4 word=13
file 1,record=and never was
file=5 word=17
file 1,record=Burma-shave
file=6 word=14
file 2,record=Does your husband
file=7 word=15
file 2,record=misbehave
file=8 word=14
file 2,record=grunt and grumble
file=9 word=11
file 2,record=rant and rave ?
file 2,record=shoot the brute some
file 2,record=Burma-shave
file 3,record=Don't take a curve
file 3,record=at 60 per
file 3,record=we hate to lose
file 3,record=a customer
file 3,record=Burma-shave
file 4,record=Every shaver
file 4,record=now can snore
file 4,record=six more minutes
file 4,record=than before
file 4,record=by using
file 4,record=Burma-shave
file 5,record=He played
file 5,record=a sax
file 5,record=had no B.O.
file 5,record=but his whiskers scratched
file 5,record=so they let him go
file 5,record=Burma-shave
file 6,record=Henry the Eighth
file 6,record=Prince of Friskers
file 6,record=lost five wives
file 6,record=but kept his whiskers
file 6,record=Burma-shave
file 7,record=Listen, birds
file 7,record=those signs cost
file 7,record=money
file 7,record=so roost a while but
file 7,record=don't get funny
file 7,record=Burma-shave
file 8,record=My man
file 8,record=won't shave
file 8,record=sez Hazel Huz
file 8,record=but I should worry
file 8,record=Dora's does
file 8,record=Burma-shave
file 9,record=Past schoolhouses
file 9,record=take it slow
file 9,record=let the little
file 9,record=shavers
file 9,record=grow
file 9,record=Burma-shave
word ---does--- found in:
file=2 word=1
file=8 word=13
word ---60--- found in:
file=3 word=6
word ---don't--- found in:
file=3 word=1
file=7 word=12
word ---burma-shave--- found in:
file=0 word=14
file=1 word=17
file=2 word=15
file=3 word=14
file=4 word=13
file=5 word=17
file=6 word=14
file=7 word=15
file=8 word=14
file=9 word=11
</pre>
 
Line 1,917 ⟶ 3,240:
 
'''indexmerge.rb'''
<langsyntaxhighlight lang="ruby">if File.exist? "index.dat"
@data = Marshal.load open("index.dat")
else
Line 1,945 ⟶ 3,268:
open("index.dat", "w") do |index|
index.write Marshal.dump(@data)
end</langsyntaxhighlight>
 
'''indexsearch.rb'''
<langsyntaxhighlight lang="ruby">if File.exist? "index.dat"
@data = Marshal.load open("index.dat")
else
Line 1,969 ⟶ 3,292:
end
 
p @result</langsyntaxhighlight>
 
'''Output'''
Line 1,980 ⟶ 3,303:
> ./indexsearch.rb It iS\!
["file1", "file2", "file3"]</pre>
 
=={{header|Rust}}==
<syntaxhighlight lang="rust">// Part 1: Inverted index structure
 
use std::{
borrow::Borrow,
collections::{BTreeMap, BTreeSet},
};
 
#[derive(Debug, Default)]
pub struct InvertedIndex<T> {
indexed: BTreeMap<String, BTreeSet<usize>>,
sources: Vec<T>,
}
 
impl<T> InvertedIndex<T> {
pub fn add<I, V>(&mut self, source: T, tokens: I)
where
I: IntoIterator<Item = V>,
V: Into<String>,
{
let source_id = self.sources.len();
self.sources.push(source);
for token in tokens {
self.indexed
.entry(token.into())
.or_insert_with(BTreeSet::new)
.insert(source_id);
}
}
 
pub fn search<'a, I, K>(&self, tokens: I) -> impl Iterator<Item = &T>
where
String: Borrow<K>,
K: Ord + ?Sized + 'a,
I: IntoIterator<Item = &'a K>,
{
let mut tokens = tokens.into_iter();
 
tokens
.next()
.and_then(|token| self.indexed.get(token).cloned())
.and_then(|first| {
tokens.try_fold(first, |found, token| {
self.indexed
.get(token)
.map(|sources| {
found
.intersection(sources)
.cloned()
.collect::<BTreeSet<_>>()
})
.filter(|update| !update.is_empty())
})
})
.unwrap_or_else(BTreeSet::new)
.into_iter()
.map(move |source| &self.sources[source])
}
 
pub fn tokens(&self) -> impl Iterator<Item = &str> {
self.indexed.keys().map(|it| it.as_str())
}
 
pub fn sources(&self) -> &[T] {
&self.sources
}
}
 
// Part 2: File walking and processing
 
use std::{
ffi::OsString,
fmt::{Debug, Display},
fs::{read_dir, DirEntry, File, ReadDir},
io::{self, stdin, Read},
path::{Path, PathBuf},
};
 
#[derive(Debug)]
pub struct Files {
dirs: Vec<ReadDir>,
}
 
impl Files {
pub fn walk<P: AsRef<Path>>(path: P) -> io::Result<Self> {
Ok(Files {
dirs: vec![read_dir(path)?],
})
}
}
 
impl Iterator for Files {
type Item = DirEntry;
 
fn next(&mut self) -> Option<Self::Item> {
'outer: while let Some(mut current) = self.dirs.pop() {
while let Some(entry) = current.next() {
if let Ok(entry) = entry {
let path = entry.path();
if !path.is_dir() {
self.dirs.push(current);
return Some(entry);
} else if let Ok(dir) = read_dir(path) {
self.dirs.push(current);
self.dirs.push(dir);
continue 'outer;
}
}
}
}
 
None // No directory left
}
}
 
fn tokenize<'a>(input: &'a str) -> impl Iterator<Item = String> + 'a {
input
.split(|c: char| !c.is_alphanumeric())
.filter(|token| !token.is_empty())
.map(|token| token.to_lowercase())
}
 
fn tokenize_file<P: AsRef<Path>>(path: P) -> io::Result<BTreeSet<String>> {
let mut buffer = Vec::new();
File::open(path)?.read_to_end(&mut buffer)?;
let text = String::from_utf8_lossy(&buffer);
Ok(tokenize(&text).collect::<BTreeSet<_>>())
}
 
fn tokenize_query(input: &str) -> Vec<String> {
let result = tokenize(input).collect::<BTreeSet<_>>();
// Make a vector sorted by length, so that longer tokens are processed first.
// This heuristics should narrow the resulting set faster.
let mut result = result.into_iter().collect::<Vec<_>>();
result.sort_by_key(|item| usize::MAX - item.len());
result
}
 
// Part 3: Interactive application
 
fn args() -> io::Result<(OsString, BTreeSet<OsString>)> {
let mut args = std::env::args_os().skip(1); // Skip the executable's name
 
let path = args
.next()
.ok_or_else(|| io::Error::new(io::ErrorKind::Other, "missing path"))?;
 
let extensions = args.collect::<BTreeSet<_>>();
 
Ok((path, extensions))
}
 
fn print_hits<'a, T>(hits: impl Iterator<Item = T>)
where
T: Display,
{
let mut found_none = true;
for (number, hit) in hits.enumerate() {
println!(" [{}] {}", number + 1, hit);
found_none = false;
}
 
if found_none {
println!("(none)")
}
}
 
fn main() -> io::Result<()> {
let (path, extensions) = args()?;
let mut files = InvertedIndex::<PathBuf>::default();
let mut content = InvertedIndex::<PathBuf>::default();
 
println!(
"Indexing {:?} files in '{}'",
extensions,
path.to_string_lossy()
);
 
for path in Files::walk(path)?.map(|file| file.path()).filter(|path| {
path.extension()
.filter(|&ext| extensions.is_empty() || extensions.contains(ext))
.is_some()
}) {
files.add(path.clone(), tokenize(&path.to_string_lossy()));
 
match tokenize_file(&path) {
Ok(tokens) => content.add(path, tokens),
Err(e) => eprintln!("Skipping a file {}: {}", path.display(), e),
}
}
 
println!(
"Indexed {} tokens in {} files.",
content.tokens().count(),
content.sources.len()
);
 
// Run the query UI loop
let mut query = String::new();
 
loop {
query.clear();
println!("Enter search query:");
if stdin().read_line(&mut query).is_err() || query.trim().is_empty() {
break;
}
 
match query.trim() {
"/exit" | "/quit" | "" => break,
 
"/tokens" => {
println!("Tokens:");
for token in content.tokens() {
println!("{}", token);
}
 
println!();
}
 
"/files" => {
println!("Sources:");
for source in content.sources() {
println!("{}", source.display());
}
 
println!();
}
 
_ => {
let query = tokenize_query(&query);
println!();
println!("Found hits:");
print_hits(content.search(query.iter()).map(|it| it.display()));
println!("Found file names:");
print_hits(files.search(query.iter()).map(|it| it.display()));
println!();
}
}
}
 
Ok(())
}</syntaxhighlight>
 
{{out}}
<pre>C:\Temp>inverted_index books html htm txt
Indexing {"htm", "html", "txt"} files in 'books'
Indexed 34810 tokens in 9 files.
Enter search query:
war
 
Found hits:
[1] books\EN\C\Chaucer, Geoffrey - Canterbury Tales.txt
[2] books\EN\H\Homer - The Odyssey.txt
[3] books\EN\O\Orwell, George - 1984.html
[4] books\EN\W\Wells, Herbert George - The Invisible Man.txt
[5] books\EN\W\Wells, Herbert George - The Island of Doctor Moreau.txt
[6] books\EN\W\Wells, Herbert George - The Time Machine.txt
[7] books\EN\W\Wells, Herbert George - War of the Worlds.txt
Found file names:
[1] books\EN\W\Wells, Herbert George - War of the Worlds.txt
 
Enter search query:
war gun
 
Found hits:
[1] books\EN\O\Orwell, George - 1984.html
[2] books\EN\W\Wells, Herbert George - War of the Worlds.txt
Found file names:
(none)
 
Enter search query:
/exit
 
C:\Temp></pre>
 
=={{header|Scala}}==
<syntaxhighlight lang="scala">object InvertedIndex extends App {
import java.io.File
 
// indexer
val WORD = raw"(\w+)".r
def parse(s: String) = WORD.findAllIn(s).map(_.toString.toLowerCase)
def invertedIndex(files: Seq[File]): Map[String,Set[File]] = {
var i = Map[String,Set[File]]() withDefaultValue Set.empty
files.foreach{f => scala.io.Source.fromFile(f).getLines flatMap parse foreach
(w => i = i + (w -> (i(w) + f)))}
i
}
 
// user interface
args match {
case _ if args.length < 2 => println("Usage: InvertedIndex ALLSEARCHWORDS FILENAME...")
case Array(searchwords, filenames @ _*) =>
val queries = parse(searchwords).toList
val files = filenames.map(new File(_)).filter{f => if (!f.exists) println(s"Ignoring $f"); f.exists}
(queries, files) match {
case (q, _) if q.isEmpty => println("Missing search words")
case (_, f) if f.isEmpty => println("Missing extant files")
case _ => val index = invertedIndex(files)
println(s"""Searching for ${queries map ("\""+_+"\"") mkString " and "} in ${files.size} files:""")
queries.map(index).foldLeft(files.toSet)(_ intersect _) match {
case m if m.isEmpty => println("No matching files")
case m => println(m mkString "\n")
}
}
}
}</syntaxhighlight>
{{out}}
<pre>> InvertedIndex "the" file1.txt file2.txt file3.txt
Searching for "the" in 3 files:
data/file1.txt
data/file2.txt
data/file3.txt
 
> InvertedIndex "the cat sat" file1.txt file2.txt file3.txt
Searching for "the" and "cat" and "sat" in 3 files:
file1.txt
file2.txt
 
> InvertedIndex fox file1.txt file2.txt file3.txt
Searching for "fox" in 3 files:
file3.txt
 
> InvertedIndex abc file1.txt file2.txt file3.txt
Searching for "abc" in 3 files:
No matching files</pre>
 
=={{header|Tcl}}==
<langsyntaxhighlight lang="tcl">package require Tcl 8.5
proc wordsInString str {
# We define "words" to be "maximal sequences of 'word' characters".
Line 2,000 ⟶ 3,650:
array set localidx {}
foreach word [wordsInString $data] {
lappend localidx($word) $i
incr i
}
 
# Transcribe into global index
foreach {word places} [array get localidx] {
dict set index($word) $filename $places
}
}
Line 2,014 ⟶ 3,664:
global index
if {[info exists index($word)]} {
return [dict keys $index($word)]
}
}
Line 2,022 ⟶ 3,672:
set files [findFilesForWord [lindex $words 0]]
foreach w [lrange $words 1 end] {
set wf [findFilesForWord $w]
set newfiles {}
foreach f $files {
if {$f in $wf} {lappend newfiles $f}
}
}
set files $newfiles
}
return $files
Line 2,037 ⟶ 3,687:
set files {}
foreach w $words {
if {![info exist index($w)]} {
return
}
}
}
dict for {file places} $index([lindex $words 0]) {
if {$file in $files} continue
foreach start $places {
set gotStart 1
foreach w [lrange $words 1 end] {
incr start
set gotNext 0
foreach {f ps} $index($w) {
if {$f ne $file} continue
foreach p $ps {
if {$p == $start} {
set gotNext 1
break
}
}
}
if {$gotNext} break
}
}
if {!$gotNext} {
set gotStart 0
break
}
}
}
if {$gotStart} {
lappend files $file
break
}
}
}
}
return $files
}</langsyntaxhighlight>
For the GUI:
<langsyntaxhighlight lang="tcl">package require Tk
pack [labelframe .files -text Files] -side left -fill y
pack [listbox .files.list -listvariable files]
Line 2,080 ⟶ 3,730:
pack [entry .found.entry -textvariable terms] -fill x
pack [button .found.findAll -command FindAll \
-text "Find File with All"] -side left
pack [button .found.findSeq -command FindSeq \
-text "Find File with Sequence"] -side right
 
# The actions invoked by various GUI buttons
Line 2,089 ⟶ 3,739:
set f [tk_getOpenFile]
if {$f ne ""} {
addDocumentToIndex $f
lappend files $f
}
}
Line 2,098 ⟶ 3,748:
set fs [findFilesWithAllWords $words]
lappend found "Searching for files with all $terms" {*}$fs \
"---------------------"
}
proc FindSeq {} {
Line 2,105 ⟶ 3,755:
set fs [findFilesWithWordSequence $words]
lappend found "Searching for files with \"$terms\"" {*}$fs \
"---------------------"
}</langsyntaxhighlight>
 
=={{header|TUSCRIPT}}==
<langsyntaxhighlight lang="tuscript">
$$ MODE TUSCRIPT
 
Line 2,140 ⟶ 3,790:
ENDLOOP
PRINT "-> ",files
</syntaxhighlight>
</lang>
Output:
<pre>
Line 2,157 ⟶ 3,807:
{{works with|ksh93}}
 
<langsyntaxhighlight lang="bash">#!/bin/ksh
 
typeset -A INDEX
Line 2,181 ⟶ 3,831:
(( count == $# )) && echo $file
done
}</langsyntaxhighlight>
 
Example use:
<langsyntaxhighlight lang="korn">index *.txt
search hello world
</syntaxhighlight>
</lang>
 
===Directory on filesystem===
Line 2,196 ⟶ 3,846:
* Add note about slowness.
 
<langsyntaxhighlight lang="bash">#!/bin/sh
# index.sh - create an inverted index
 
Line 2,205 ⟶ 3,855:
# the record separator for $INDEX/all.tab).
for file in "$@"; do
# Use printf(1), not echo, because "$file" might start with
# a hyphen and become an option to echo.
test 0 -eq $(printf %s "$file" | wc -l) || {
printf '%s\n' "$file: newline in filename" >&2
exit 1
}
}
done
 
Line 2,219 ⟶ 3,869:
fi=1
for file in "$@"; do
printf %s "Indexing $file." >&2
 
# all.tab maps $fi => $file
echo "$fi $file" >> "$INDEX/all.tab"
 
# Use punctuation ([:punct:]) and whitespace (IFS)
# to split tokens.
ti=1
tr -s '[:punct:]' ' ' < "$file" | while read line; do
for token in $line; do
# Index token by position ($fi, $ti). Ignore
# error from mkdir(1) if directory exists.
mkdir "$INDEX/$token" 2>/dev/null
echo $ti >> "$INDEX/$token/$fi"
: $((ti += 1))
 
# Show progress. Print a dot per 1000 tokens.
case "$ti" in
*000) printf .
esac
done
done
 
echo >&2
: $((fi += 1))
done</langsyntaxhighlight>
 
<langsyntaxhighlight lang="bash">#!/bin/sh
# search.sh - search an inverted index
 
Line 2,254 ⟶ 3,904:
want=sequence
while getopts aos name; do
case "$name" in
a) want=all;;
o) want=one;;
s) want=sequence;;
*) exit 2;;
esac
done
shift $((OPTIND - 1))
 
all() {
echo "TODO"
exit 2
}
 
one() {
echo "TODO"
exit 2
}
 
sequence() {
echo "TODO"
exit 2
}
 
$want "$@"</langsyntaxhighlight>
 
=={{header|Wren}}==
{{trans|Kotlin}}
{{libheader|Wren-ioutil}}
{{libheader|Wren-pattern}}
{{libheader|Wren-str}}
<syntaxhighlight lang="wren">import "./ioutil" for FileUtil, Input
import "./pattern" for Pattern
import "./str" for Str
import "os" for Process
 
var invIndex = {}
var fileNames = []
var splitter = Pattern.new("+1/W")
 
class Location {
construct new(fileName, wordNum) {
_fileName = fileName
_wordNum = wordNum
}
 
toString { "%(_fileName), word number %(_wordNum)" }
}
 
var indexFile = Fn.new { |fileName|
if (fileNames.contains(fileName)) {
System.print("'%(fileName)' already indexed")
return
}
fileNames.add(fileName)
var lines = FileUtil.readLines(fileName)
lines.each { |line|
line = Str.lower(line)
var i = 0
for (w in splitter.splitAll(line)) {
var locations = invIndex[w]
if (!locations) {
locations = []
invIndex[w] = locations
}
locations.add(Location.new(fileName, i + 1))
i = i + 1
}
}
System.print("'%(fileName)' has been indexed")
}
 
var findWord = Fn.new { |word|
var w = Str.lower(word)
var locations = invIndex[w]
if (locations) {
System.print("\n'%(word)' found in the following locations:")
System.print(locations.map { |l| " %(l)" }.join("\n"))
} else {
System.print("\n'%(word)' not found")
}
System.print()
}
 
// files to be indexed entered as command line arguments
var args = Process.arguments
if (args.count == 0) {
System.print("No file names have been supplied")
return
}
for (arg in args) indexFile.call(arg)
System.print("\nEnter word(s) to be searched for in these files or 'q' to quit")
while (true) {
var word = Input.text(" ? : ", 1)
if (word == "q" || word == "Q") return
findWord.call(word)
}</syntaxhighlight>
 
{{out}}
Uses the same files as the Kotlin example:
<pre>
$ wren-cli inverted_index.wren inv1.txt inv2.txt inv3.txt inv1.txt
'inv1.txt' has been indexed
'inv2.txt' has been indexed
'inv3.txt' has been indexed
'inv1.txt' already indexed
 
Enter word(s) to be searched for in these files or 'q' to quit
? : cat
 
'cat' not found
 
? : is
 
'is' found in the following locations:
inv1.txt, word number 2
inv1.txt, word number 5
inv2.txt, word number 2
inv3.txt, word number 2
 
? : banana
 
'banana' found in the following locations:
inv3.txt, word number 4
 
? : it
 
'it' found in the following locations:
inv1.txt, word number 1
inv1.txt, word number 4
inv2.txt, word number 3
inv3.txt, word number 1
 
? : what
 
'what' found in the following locations:
inv1.txt, word number 3
inv2.txt, word number 1
 
? : a
 
'a' found in the following locations:
inv3.txt, word number 3
 
? : q
</pre>
9,476

edits