Words containing "the" substring: Difference between revisions

Content added Content deleted

Inline

Revision as of 13:48, 9 December 2020

Task

Using the dictionary unixdict.txt, search words containing "the" substring,
then display the found words (on this page).

The length of any word shown should have a length > 11.

Other tasks related to string operations:

Metrics

Counting

Remove/replace

Anagrams/Derangements/shuffling

Find/Search/Determine

Formatting

Song lyrics/poems/Mad Libs/phrases

Tokenize

Sequences

ALGOL 68

<lang algol68># find 12 character (or more) words that have "the" in them # IF FILE input file;

   STRING file name = "unixdict.txt";
   open( input file, file name, stand in channel ) /= 0

THEN

   # failed to open the file #
   print( ( "Unable to open """ + file name + """", newline ) )

ELSE

   # file opened OK #
   BOOL at eof := FALSE;
   # set the EOF handler for the file #
   on logical file end( input file, ( REF FILE f )BOOL:
                                    BEGIN
                                        # note that we reached EOF on the #
                                        # latest read #
                                        at eof := TRUE;
                                        # return TRUE so processing can continue #
                                        TRUE
                                    END
                      );
   INT the count := 0;
   WHILE STRING word;
         get( input file, ( word, newline ) );
         NOT at eof
   DO
       IF INT w len = ( UPB word + 1 ) - LWB word;
          w len > 11
       THEN
           BOOL found the := FALSE;
           FOR w pos FROM LWB word TO UPB word - 2 WHILE NOT found the DO
               IF word[ w pos : w pos + 2 ] = "the" THEN
                   found the  := TRUE;
                   the count +:= 1;
                   print( ( word, " " ) );
                   IF the count MOD 6 = 0
                   THEN print( ( newline ) )
                   ELSE FROM w len + 1 TO 18 DO print( ( " " ) ) OD
                   FI
               FI
           OD
       FI
   OD;
   print( ( newline, "found ", whole( the count, 0 ), " ""the"" words", newline ) );
   close( input file )

FI</lang>

Output:

authenticate       chemotherapy       chrysanthemum      clothesbrush       clotheshorse       eratosthenes
featherbedding     featherbrain       featherweight      gaithersburg       hydrothermal       lighthearted
mathematician      neurasthenic       nevertheless       northeastern       northernmost       otherworldly
parasympathetic    physiotherapist    physiotherapy      psychotherapeutic  psychotherapist    psychotherapy
radiotherapy       southeastern       southernmost       theoretician       weatherbeaten      weatherproof
weatherstrip       weatherstripping
found 32 "the" words

AppleScripters can tackle this task in a variety of ways. The example handlers below are listed in order of increasing speed but all complete the task in under 0.2 seconds on my current machine. They all take a file specifier, search string, and minimum length as parameters and return identical results for the same input.

Using just the core language — 'words': <lang applescript>on wordsContaining(textfile, searchText, minLength)

   script o
       property wordList : missing value
       property output : {}
   end script
   
   -- Extract the text's 'words' and return any that meet both the search text and minimum length requirements.
   set o's wordList to words of (read (textfile as alias) as «class utf8»)
   repeat with thisWord in o's wordList
       if ((thisWord contains searchText) and (thisWord's length ≥ minLength)) then
           set end of o's output to thisWord's contents
       end if
   end repeat
   
   return o's output

end wordsContaining</lang>

Using just the core language — 'text items': <lang applescript>on wordsContaining(textFile, searchText, minLength)

   script o
       property textItems : missing value
       property output : {}
   end script
   
   -- Extract the text's search-text-delimited sections.
   set astid to AppleScript's text item delimiters
   set AppleScript's text item delimiters to searchText
   set o's textItems to text items of (read (textFile as alias) as «class utf8»)
   set AppleScript's text item delimiters to astid
   
   -- Reconstitute any words containing the search text from the stubs at the section ends and
   -- the search text itself, returning any results which meet the minimum length requirement.
   set thisSection to beginning of o's textItems
   set sectionHasWords to ((count thisSection's words) > 0)
   considering white space
       repeat with i from 2 to (count o's textItems)
           set foundWord to searchText
           if (sectionHasWords) then
               set thisStub to thisSection's last word
               if (thisSection ends with thisStub) then set foundWord to thisStub & foundWord
           end if
           set thisSection to item i of o's textItems
           set sectionHasWords to ((count thisSection's words) > 0)
           if (sectionHasWords) then
               set thisStub to thisSection's first word
               if (thisSection begins with thisStub) then set foundWord to foundWord & thisStub
           end if
           if (foundWord's length ≥ minLength) then set end of o's output to foundWord
       end repeat
   end considering
   
   return o's output

end wordsContaining</lang>

Using a shell script: <lang applescript>on wordsContaining(textFile, searchText, minLength)

   -- Set up and execute a shell script which uses grep to find words containing the search text
   -- (matching AppleScript's current case-sensitivity setting) and awk to pass those which
   -- satisfy the minimum length requirement.
   if ("A" = "a") then
       set part1 to "grep -io "
   else
       set part1 to "grep -o "
   end if
   set shellCode to part1 & quoted form of ("\\b\\w*" & searchText & "\\w*\\b") & ¬
       (" <" & quoted form of textFile's POSIX path) & ¬
       (" | awk " & quoted form of ("// && length($0) >= " & minLength))
   
   return paragraphs of (do shell script shellCode)

end wordsContaining</lang>

Using Foundation methods (AppleScriptObjC): <lang applescript>use AppleScript version "2.4" -- OS X 10.10 (Yosemite) or later use framework "Foundation" use scripting additions

on wordsContaining(textFile, searchText, minLength)

   set theText to current application's class "NSMutableString"'s ¬
       stringWithContentsOfFile:(textFile's POSIX path) usedEncoding:(missing value) |error|:(missing value)
   -- Replace every run of non AppleScript 'word' characters with a linefeed.
   tell theText to replaceOccurrencesOfString:("(?:[\\W--[.'’]]|(?<!\\w)[.'’]|[.'’](?!\\w))++") withString:(linefeed) ¬
       options:(current application's NSRegularExpressionSearch) range:({0, its |length|()})
   -- Split the text at the linefeeds.
   set theWords to theText's componentsSeparatedByString:(linefeed)
   -- Filter the resulting array for strings which meet the search text and minimum length requirements,
   -- matching AppleScript's current case-sensitivity setting. NSString lengths are measured in 16-bit
   -- code units so use regex to check the lengths in characters.
   if ("A" = "a") then
       set filterTemplate to "((self CONTAINS[c] %@) && (self MATCHES %@))"
   else
       set filterTemplate to "((self CONTAINS %@) && (self MATCHES %@))"
   end if
   set filter to current application's class "NSPredicate"'s ¬
       predicateWithFormat_(filterTemplate, searchText, ".{" & minLength & ",}+")
   
   return (theWords's filteredArrayUsingPredicate:(filter)) as list

end wordsContaining</lang>

Test code for the task with any of the above: <lang applescript>local textFile, output set textFile to ((path to desktop as text) & "unixdict.txt") as «class furl» -- considering case -- Uncomment this and the corresponding 'end' line for case-sensitive searches. set output to wordsContaining(textFile, "the", 12) -- end considering return {count output, output}</lang>

Output:

<lang applescript>{32, {"authenticate", "chemotherapy", "chrysanthemum", "clothesbrush", "clotheshorse", "eratosthenes", "featherbedding", "featherbrain", "featherweight", "gaithersburg", "hydrothermal", "lighthearted", "mathematician", "neurasthenic", "nevertheless", "northeastern", "northernmost", "otherworldly", "parasympathetic", "physiotherapist", "physiotherapy", "psychotherapeutic", "psychotherapist", "psychotherapy", "radiotherapy", "southeastern", "southernmost", "theoretician", "weatherbeaten", "weatherproof", "weatherstrip", "weatherstripping"}}</lang>

AWK

The following is an awk one-liner entered at a Posix shell.

<lang awk>/Code$ awk '/the/ && length($1) > 11' unixdict.txt authenticate chemotherapy chrysanthemum clothesbrush clotheshorse eratosthenes featherbedding featherbrain featherweight gaithersburg hydrothermal lighthearted mathematician neurasthenic nevertheless northeastern northernmost otherworldly parasympathetic physiotherapist physiotherapy psychotherapeutic psychotherapist psychotherapy radiotherapy southeastern southernmost theoretician weatherbeaten weatherproof weatherstrip weatherstripping /Code$ </lang>

FreeBASIC

Reuses some code from Odd words#FreeBASIC <lang freebasic>#define NULL 0

type node

   word as string*32   'enough space to store any word in the dictionary
   nxt as node ptr

end type

function addword( tail as node ptr, word as string ) as node ptr

   'allocates memory for a new node, links the previous tail to it,
   'and returns the address of the new node
   dim as node ptr newnode = allocate(sizeof(node))
   tail->nxt = newnode
   newnode->nxt = NULL
   newnode->word = word
   return newnode

end function

function length( word as string ) as uinteger

   'necessary replacement for the built-in len function, which in this
   'case would always return 32
   for i as uinteger = 1 to 32
       if asc(mid(word,i,1)) = 0 then return i-1
   next i
   return 999

end function

dim as string word dim as node ptr tail = allocate( sizeof(node) ) dim as node ptr head = tail, curr = head, currj tail->nxt = NULL tail->word = "XXXXHEADER"

open "unixdict.txt" for input as #1 while true

   line input #1, word
   if word = "" then exit while
   if length(word)>11 then tail = addword( tail, word )

wend close #1

dim as string tempword

while curr->nxt <> NULL

   for i as uinteger = 1 to length(curr->word)-3
       if mid(curr->word,i,3) = "the" then print curr->word
   next i
   curr = curr->nxt

wend</lang>

Output:

authenticate                    
chemotherapy                    
chrysanthemum                   
clothesbrush                    
clotheshorse                    
eratosthenes                    
featherbedding                  
featherbrain                    
featherweight                   
gaithersburg                    
hydrothermal                    
lighthearted                    
mathematician                   
neurasthenic                    
nevertheless                    
northeastern                    
northernmost                    
otherworldly                    
parasympathetic                 
physiotherapist                 
physiotherapy                   
psychotherapeutic               
psychotherapist                 
psychotherapy                   
radiotherapy                    
southeastern                    
southernmost                    
theoretician                    
weatherbeaten                   
weatherproof                    
weatherstrip                    
weatherstripping

Go

<lang go>package main

import (

   "bytes"
   "fmt"
   "io/ioutil"
   "log"
   "strings"
   "unicode/utf8"

)

func main() {

   wordList := "unixdict.txt"
   b, err := ioutil.ReadFile(wordList)
   if err != nil {
       log.Fatal("Error reading file")
   }
   bwords := bytes.Fields(b)
   var words []string
   for _, bword := range bwords {
       s := string(bword)
       if utf8.RuneCountInString(s) > 11 {
           words = append(words, s)
       }
   }
   count := 0
   fmt.Println("Words containing 'the' having a length > 11 in", wordList, "\b:")
   for _, word := range words {
       if strings.Contains(word, "the") {
           count++
           fmt.Printf("%2d: %s\n", count, word)
       }
   }

}</lang>

Output:

Words containing 'the' having a length > 11 in unixdict.txt:
 1: authenticate
 2: chemotherapy
 3: chrysanthemum
 4: clothesbrush
 5: clotheshorse
 6: eratosthenes
 7: featherbedding
 8: featherbrain
 9: featherweight
10: gaithersburg
11: hydrothermal
12: lighthearted
13: mathematician
14: neurasthenic
15: nevertheless
16: northeastern
17: northernmost
18: otherworldly
19: parasympathetic
20: physiotherapist
21: physiotherapy
22: psychotherapeutic
23: psychotherapist
24: psychotherapy
25: radiotherapy
26: southeastern
27: southernmost
28: theoretician
29: weatherbeaten
30: weatherproof
31: weatherstrip
32: weatherstripping

Julia

<lang julia>function wordscontaining(needle, overlength, dictfile)

   for haystack in split(read(dictfile, String))
       length(haystack) > overlength && occursin(needle, haystack) && println(haystack)
   end

end

wordscontaining("the", 11, "unixdict.txt")

</lang>

Output:

authenticate
chemotherapy  
chrysanthemum 
clothesbrush  
clotheshorse  
eratosthenes  
featherbedding
featherbrain  
featherweight 
gaithersburg  
hydrothermal  
lighthearted
mathematician
neurasthenic
nevertheless
northeastern
northernmost
otherworldly
parasympathetic
physiotherapist
physiotherapy
psychotherapeutic
psychotherapist
psychotherapy
radiotherapy
southeastern
southernmost
theoretician
weatherbeaten
weatherproof
weatherstrip
weatherstripping

Perl

Perl one-liner entered from a Posix shell:

<lang perl>/Code$ perl -n -e '/(\w*the\w*)/ && length($1)>11 && print' unixdict.txt authenticate chemotherapy chrysanthemum clothesbrush clotheshorse eratosthenes featherbedding featherbrain featherweight gaithersburg hydrothermal lighthearted mathematician neurasthenic nevertheless northeastern northernmost otherworldly parasympathetic physiotherapist physiotherapy psychotherapeutic psychotherapist psychotherapy radiotherapy southeastern southernmost theoretician weatherbeaten weatherproof weatherstrip weatherstripping /Code$ </lang>

Phix

<lang Phix>function the(string word) return length(word)>11 and match("the",word) end function sequence words = filter(get_text("demo/unixdict.txt",GT_LF_STRIPPED),the) printf(1,"found %d 'the' words:\n%s\n",{length(words),join(shorten(words,"",3),", ")})</lang>

Output:

found 32 'the' words:
authenticate, chemotherapy, chrysanthemum, ..., weatherproof, weatherstrip, weatherstripping

Python

Entered from a Posix shell:

<lang python>/Code$ python -c 'import sys > for line in sys.stdin: > if "the" in line and len(line.strip()) > 11: > print(line.rstrip()) > ' < unixdict.txt authenticate chemotherapy chrysanthemum clothesbrush clotheshorse eratosthenes featherbedding featherbrain featherweight gaithersburg hydrothermal lighthearted mathematician neurasthenic nevertheless northeastern northernmost otherworldly parasympathetic physiotherapist physiotherapy psychotherapeutic psychotherapist psychotherapy radiotherapy southeastern southernmost theoretician weatherbeaten weatherproof weatherstrip weatherstripping /Code$ </lang>

Raku

A trivial modification of the ABC words task.

<lang perl6>put 'unixdict.txt'.IO.words».fc.grep({ (.chars > 11) && (.contains: 'the') })\

   .&{"{+$_} words:\n  " ~ .batch(8)».fmt('%-17s').join: "\n  "};</lang>

Output:

32 words:
  authenticate      chemotherapy      chrysanthemum     clothesbrush      clotheshorse      eratosthenes      featherbedding    featherbrain     
  featherweight     gaithersburg      hydrothermal      lighthearted      mathematician     neurasthenic      nevertheless      northeastern     
  northernmost      otherworldly      parasympathetic   physiotherapist   physiotherapy     psychotherapeutic psychotherapist   psychotherapy    
  radiotherapy      southeastern      southernmost      theoretician      weatherbeaten     weatherproof      weatherstrip      weatherstripping

REXX

This REXX version doesn't care what order the words in the dictionary are in, nor does it care what
case (lower/upper/mixed) the words are in, the search for the substring the is caseless.

It also allows the substring to be specified on the command line (CL) as well as the dictionary file identifier.

Programming note: If the minimum length is negative, it indicates to find the words (but not display them), and
only the display the count of found words. <lang rexx>/*REXX program finds words that contain the substring "the" (within an identified dict.)*/ parse arg $ minL iFID . /*obtain optional arguments from the CL*/ if $== | $=="," then $= 'the' /*Not specified? Then use the default.*/ if minL== | minL=="," then minL= 12 /* " " " " " " */ if iFID== | iFID=="," then iFID='unixdict.txt' /* " " " " " " */ tell= minL>0; minL= abs(minL) /*use absolute value of minimum length.*/ @.= /*default value of any dictionary word.*/

        do #=1  while lines(iFID)\==0           /*read each word in the file  (word=X).*/
        @.#= strip( linein( iFID) )             /*pick off a word from the input line. */
        end   /*#*/

$u= $; upper $u /*obtain an uppercase version of $. */ say copies('─', 25) # "words in the dictionary file: " iFID finds= 0 /*count of the substring found in dict.*/

        do j=1  for #-1;   z= @.j;     upper z  /*process all the words that were found*/
        if length(z)<minL  then iterate         /*Is word too short?    Yes, then skip.*/
        if pos($u, z)==0   then iterate         /*Found the substring?   No,   "    "  */
        finds= finds + 1                        /*bump count of substring words found. */
        if tell  then say right(left(@.j, 20), 25)    /*Show it?  Indent original word.*/
        end        /*j*/
                                                /*stick a fork in it,  we're all done. */

say copies('─', 25) finds " words (with a min. length of" ,

                                 minL') that contains the substring: '     $</lang>

output when using the default inputs:

───────────────────────── 25105 words in the dictionary file:  unixdict.txt
     authenticate
     chemotherapy
     chrysanthemum
     clothesbrush
     clotheshorse
     eratosthenes
     featherbedding
     featherbrain
     featherweight
     gaithersburg
     hydrothermal
     lighthearted
     mathematician
     neurasthenic
     nevertheless
     northeastern
     northernmost
     otherworldly
     parasympathetic
     physiotherapist
     physiotherapy
     psychotherapeutic
     psychotherapist
     psychotherapy
     radiotherapy
     southeastern
     southernmost
     theoretician
     weatherbeaten
     weatherproof
     weatherstrip
     weatherstripping
───────────────────────── 32  words (with a min. length of 12) that contain the substring:  the

output when using the input of: , -3

───────────────────────── 25105 words in the dictionary file:  unixdict.txt
───────────────────────── 287  words (with a min. length of 3) that contains the substring:  the

Ring

<lang ring> cStr = read("unixdict.txt") wordList = str2list(cStr) num = 0 the = "the"

see "working..." + nl

ln = len(wordList) for n = ln to 1 step -1

   if len(wordList[n]) < 12
      del(wordList,n)
   ok

for n = 1 to len(wordList)

   ind = substr(wordList[n],the)
   if ind > 0
      num = num +1
      see "" + num + ". " + wordList[n] + nl
   ok

working...
Founded "the" words are:
1. authenticate
2. chemotherapy
3. chrysanthemum
4. clothesbrush
5. clotheshorse
6. eratosthenes
7. featherbedding
8. featherbrain
9. featherweight
10. gaithersburg
11. hydrothermal
12. lighthearted
13. mathematician
14. neurasthenic
15. nevertheless
16. northeastern
17. northernmost
18. otherworldly
19. parasympathetic
20. physiotherapist
21. physiotherapy
22. psychotherapeutic
23. psychotherapist
24. psychotherapy
25. radiotherapy
26. southeastern
27. southernmost
28. theoretician
29. weatherbeaten
30. weatherproof
31. weatherstrip
32. weatherstripping
done...

Smalltalk

Works with: Smalltalk/X

<lang smalltalk>d := 'unixdict.txt' asFilename contents asSet. page := 'https://www.rosettacode.org/wiki/Words_containing_%22the%22_substring' asURL retrieveContents. page asCollectionOfWords

   select:[:word | (word size > 11) and:[word includesString:'the' caseSensitive:trueOrFalseWhoKnows]]
   thenDo:#transcribeCR</lang>

Wren

Library: Wren-fmt

<lang ecmascript>import "io" for File import "/fmt" for Fmt

var wordList = "unixdict.txt" // local copy var words = File.read(wordList).trimEnd().split("\n").where { |w| w.count > 11 }.toList var count = 0 System.print("Words containing 'the' having a length > 11 in %(wordList):") for (word in words) {

   if (word.contains("the")) {
       count = count + 1
       Fmt.print("$2d: $s", count, word)
   }

}</lang>

Output:

Words containing 'the' having a length > 11 in unixdict.txt:
 1: authenticate
 2: chemotherapy
 3: chrysanthemum
 4: clothesbrush
 5: clotheshorse
 6: eratosthenes
 7: featherbedding
 8: featherbrain
 9: featherweight
10: gaithersburg
11: hydrothermal
12: lighthearted
13: mathematician
14: neurasthenic
15: nevertheless
16: northeastern
17: northernmost
18: otherworldly
19: parasympathetic
20: physiotherapist
21: physiotherapy
22: psychotherapeutic
23: psychotherapist
24: psychotherapy
25: radiotherapy
26: southeastern
27: southernmost
28: theoretician
29: weatherbeaten
30: weatherproof
31: weatherstrip
32: weatherstripping