Letter frequency

From Rosetta Code
Revision as of 16:11, 18 September 2011 by rosettacode>Kernigh (7 programs (Aikido, C, J, OCaml, Pascal, Python, SIMPOL) moved here from Array page, http://rosettacode.org/mw/index.php?title=Array&action=history)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Letter frequency is a draft programming task. It is not yet considered ready to be promoted as a complete task, for reasons that should be found in its talk page.

Open a text file and compute letter frequency.

The first 7 solutions (Aikido, C, J, OCaml, Pascal, Python, SIMPOL) moved here from the Array page. They might not all accomplish the same task.

Aikido

<lang aikido> import ctype

var letters = new int [26]

var s = openin (args[0]) while (!s.eof()) {

   var ch = s.getchar()
   if (s.eof()) {
       break
   }
   if (ctype.isalpha (ch)) {
       var n = cast<int>(ctype.tolower(ch) - 'a')
       ++letters[n]
   }

}

foreach i letters.size() {

   println (cast<char>('a' + i) + " " + letters[i])

}

</lang>

C

We wish to open a text file and compute letter frequency

<lang c> FILE *any_text;

/* declare array */
int frequency[26];
/* declare a computed index */
int ch;
any_text = fopen ("a_text_file.txt", "rt");
/* init the freq table: */
for (ch = 0; ch < 26; ch++)
    frequency[ch] = 0;
ch = fgetc(any_text);
while (!feof(any_text)) {
    if (is_a_letter(ch))
        /* if the char is a letter, then increase character slot in the freq table: */
        frequency[ch-'A'] += 1;
    ch = fgetc(any_text);
}</lang>

J

The example task is the same: open a text file and compute letter frequency.
Written in this array programming language, no loops are specified.
Input is a directory-path with filename. Output is a 26-element single-axis integer array.

load 'files'     NB. fread is among these standard file utilities
ltrfreq=: 3 : 0
 letters=. (65+(i.26)) { a.                  NB. We'll work with minimal alphabet.
 reduced=. (#~ e.&letters) toupper fread y   NB. Omit non-letters. (y is the input.)
 sums   =. +/"_1 = reduced                   NB. Count how often each letter occurs.
 sums (letters I. (~. reduced))} 26 # 0      NB. Alphabetize the sums, then return.
)

OCaml

same task, open a text file and compute letter frequency

<lang ocaml>let () =

 let ic = open_in Sys.argv.(1) in
 let base = int_of_char 'a' in
 let arr = Array.make 26 0 in
 try while true do
   let c = Char.lowercase(input_char ic) in
   let ndx = int_of_char c - base in
   if ndx < 26 && ndx >= 0 then
     arr.(ndx) <- succ arr.(ndx)
 done
 with End_of_file ->
   close_in ic;
   for i=0 to 25 do
     Printf.printf "%c -> %d\n" (char_of_int(i + base)) arr.(i)
   done</lang>

Here is the documentation of the module Array. (there is also a Bigarray module)

Pascal

This defines an array suitable to hold a 64x64 truecolor image (i.e. red, green and blue RGB values all can go from 0 to 255) and then sets the color of a single pixel <lang pascal> type

 color = red, green, blue;
 rgbvalue = 0 .. 255;

var

 picture: array[0 .. 63, 0 .. 63, color] of rgbvalue

begin

 { set pixel (4,7) to yellow }
 picture[4, 7, red]   := 255;
 picture[4, 7, green] := 255;
 picture[4, 7, blue]  := 0

end. </lang>

Python

Example: open a text file and compute letter frequency. <lang python> import string if hasattr(string, ascii_lowercase):

   letters = string.ascii_lowercase       # Python 2.2 and later

else:

   letters = string.lowercase             # Earlier versions

offset = ord('a')

def countletters(file_handle):

   """Traverse a file and compute the number of occurences of each letter
   """return results as a simple 26 element list of integers.
   results = [0] * len(letters)
   for line in file_handle:
       for char in line:
           char = char.lower()
           if char in letters:
               results[offset - ord(char)] += 1
               # Ordinal of 'a' minus ordinal of any lowercase ASCII letter -> 0..25
   return results

if __name__ == "__main__":

   sourcedata = open(sys.argv[1])
   lettercounts = countletters(sourcedata)
   for i in xrange(len(lettercounts)):
       print "%s=%d" % (chr(i + ord('a')), lettercounts[i]),

</lang>

This example defines the function and provides a sample usage. The if ... __main__... line allows it to be cleanly imported into any other Python code while also allowing it to function as a standalone script. (A very common Python idiom).

Using a numerically indexed array (list) for this is artificial and clutters the code somewhat. The more Pythonic approach would be:

Works with: Python version 2.5

<lang python> ... from collections import defaultdict def countletters(file_handle):

   """Count occurences of letters and return a dictionary of them
   """
   results = defaultdict(int)
   for line in file_handle:
       for char in line:
           if char.lower() in letters:
               c = char.lower()
               results[c] += 1
   return results

</lang>

Which eliminates the ungainly fiddling with ordinal values and offsets. More importantly it allows the results to be more simply printed using:

<lang python> lettercounts = countletters(sourcedata) for letter,count in lettercounts.iteritems():

   print "%s=%s" % (letter, count),

</lang>

Again eliminating all fussing with the details of converting letters into list indices.

SIMPOL

Example: open a text file and compute letter frequency.

<lang simpol>constant iBUFSIZE 500

function main(string filename)

 fsfileinputstream fpi
 integer e, i, aval, zval, cval
 string s, buf, c
 array chars
 e = 0
 fpi =@ fsfileinputstream.new(filename, error=e)
 if fpi =@= .nul
   s = "Error, file """ + filename + """ not found{d}{a}"
 else
   chars =@ array.new()
   aval = .charval("a")
   zval = .charval("z")
   i = 1
   while i <= 26
     chars[i] = 0
     i = i + 1
   end while
   buf = .lcase(fpi.getstring(iBUFSIZE, 1))
   while not fpi.endofdata and buf > ""
     i = 1
     while i <= .len(buf)
       c = .substr(buf, i, 1)
       cval = .charval(c)
       if cval >= aval and cval <= zval
         chars[cval - aval + 1] = chars[cval - aval + 1] + 1
       end if
       i = i + 1
     end while
     buf = .lcase(fpi.getstring(iBUFSIZE, 1))
   end while
   s = "Character counts for """ + filename + """{d}{a}"
   i = 1
   while i <= chars.count()
     s = s + .char(aval + i - 1) + ": " + .tostr(chars[i], 10) + "{d}{a}"
     i = i + 1
   end while
 end if

end function s</lang>

As this was being created I realized that in [SIMPOL] I wouldn't have done it this way (in fact, I wrote it differently the first time and had to go back and change it to use an array afterward). In [SIMPOL] we would have used the set object. It acts similarly to a single-dimensional array, but can also use various set operations, such as difference, unite, intersect, etc. One of th einteresting things is that each unique value is stored only once, and the number of duplicates is stored with it. The sample then looks a little cleaner:

<lang simpol>constant iBUFSIZE 500

function main(string filename)

 fsfileinputstream fpi
 integer e, i, aval, zval
 string s, buf, c
 set chars
 e = 0
 fpi =@ fsfileinputstream.new(filename, error=e)
 if fpi =@= .nul
   s = "Error, file """ + filename + """ not found{d}{a}"
 else
   chars =@ set.new()
   aval = .charval("a")
   zval = .charval("z")
   buf = .lcase(fpi.getstring(iBUFSIZE, 1))
   while not fpi.endofdata and buf > ""
     i = 1
     while i <= .len(buf)
       c = .substr(buf, i, 1)
       if .charval(c) >= aval and .charval(c) <= zval
         chars.addvalue(c)
       end if
       i = i + 1
     end while
     buf = .lcase(fpi.getstring(iBUFSIZE, 1))
   end while
   s = "Character counts for """ + filename + """{d}{a}"
   i = 1
   while i <= chars.count()
     s = s + chars[i] + ": " + .tostr(chars.valuecount(chars[i]), 10) + "{d}{a}"
     i = i + 1
   end while
 end if

end function s</lang>

The final stage simply reads the totals for each character. One caveat, if a character is unrepresented, then it will not show up at all in this second implementation.