Talk:Letter frequency

From Rosetta Code

Task description

More detailed task description is needed. For example, should we only count ASCII letters A-Z? Case in-sensitive?

Or for that matter, what is a letter? For what language? Most programs seemed to assume the Latin alphabet for English. -- Gerard Schildberger 01:47, 30 June 2012 (UTC)

Maybe the results can be displayed with whatever method is the most convenient? I assume opening the file is required (not just handle a file that is already open)?

Since the first solutions were copied from another page and may not be correct solutions, those should be marked somehow. (Was there some specific tag for this purpose?) At least the Pascal solution seems to have nothing to do with the task.

--PauliKL 13:03, 19 September 2011 (UTC)

I took it as anything that is un-said should be whatever is convenient to the implementer.
  • ASCII? Count whatever the file open routine makes most easy.
  • Case sensitivity? Count what you get without applying any uppercase/lowercase filters.
  • Output format? whatever is convenient.
Open the file in your code? I interpreted this as being a requirement.
This leaves the guts as being a way to iterate through the characters keeping count. --Paddy3118 14:58, 19 September 2011 (UTC)

"Letter frequency" is not the same as "letter occurences". The title hints that more is needed in the task description. It would seem that some description of output is required as well. --Demivec 16:43, 9 November 2011 (UTC)

It seems that many program examples interpreted a letter as a character. A (Latin) letter has two forms: its uppercase and lowercase version. So if two H characters and three h characters were in a file, then there would be five occurrences of the letter aitch. [Aitch is the English name for the letter H or h.] The task description could've been more clear on that point, so for the REXX version 1, a count was done for each (Latin) letter, AND also for each character, and the counts are provided in separate lists. This made the loosey-goosey interpretation moot. Since it wasn't stated what a letter is   (I used the primary definition that it's any of the symbols of an alphabet), it seemed appropriate to provide a both lists:   a list of letters, and a list of all characters (for any language's alphabet). -- Gerard Schildberger 21:19, 25 July 2012 (UTC)

The English pronunciation of the letter 'H' is haitch surely! :-)
--Paddy3118 08:48, 26 July 2012 (UTC)
One may pronounce it that way (depending upon which side of the pond you're on), but the spelling of the letter isn't that.     -- Gerard Schildberger (talk) 21:13, 4 August 2015 (UTC)

In hindsight, it would've been nice to make a requirement to use the program example as the primary input (but not necessarily the only input); that way, everyone could see what was used for its input. At least one example used UNIXDICT.TXT, which has no capital letters. Another example only counts capital letters. Still others showed a list, but excluded most of the counts, so it can't be verified if the uppercase letters were included (or not) with the lowercase letters, or kept as separate counts. -- Gerard Schildberger 21:19, 25 July 2012 (UTC)

a few remarks for Rexx

A few typos: carraiage -> carriage occurances -> occurrences independant -> independent

(see next section on typos and misspellings) -- Gerard Schildberger 22:17, 25 July 2012 (UTC)

and some more substantial observations: y=d2x(L); if @.up.y==0 then iterate /*zero count? Then ignore letter*/

 c=d2c(L)                             /*C is the hex version of of char*/

In such cases I use cnt.0up.y so that a possible variable up never interferes actually y is the hex version , drop one ‘of’

I don't understand what you are observing, substantial or not. Thank you for your suggestion that I modify the version 1 code, but one of "them" can't be dropped as REXX version 1 keeps track of letters as well as characters, and both of "them" are needed for their respective (count) lists. -- Gerard Schildberger 22:01, 25 July 2012 (UTC)

@.=0 /*wouldn't it be neat to use Θ ? */ why not use cnt. ?

Because it's my style of coding. I use @. for important stemmed arrays, and it makes it easier to find in the code where that stemmed array is referenced. Using such constructs as c.c is very confusing to a novice reader of REXX. There's two variables being used, c and c, one is the stemmed array name, the other the stemmed array index. But --- defending one's programming style will just start a religious war, so there is no sense in pointing faults in another's coding style. -- Gerard Schildberger 22:01, 25 July 2012 (UTC)

In the discussion I read: “Case sensitivity? Count what you get without applying any uppercase/lowercase filters”

No filters were used in REXX version 1, letters and characters were counted correctly without case sensitivity, and a count was provided for each. -- Gerard Schildberger 22:01, 25 July 2012 (UTC)

upper c -> c=translate(c) would help for other Rexxes (in particular ooRexx)

REXX version 1 was coded for classic REXX (not the object-oriented version of REXX, ooRexx), and the use of the   upper   statement is more intuitive when being read by people who don't know REXX that well, the   upper   BIF explains itself.   It also has uses that   translate   BIF doesn't have (or doesn't support),   but that discussion should be done elsewhere. -- Gerard Schildberger 22:01, 25 July 2012 (UTC)
Also, when uppercasing tens of thousands of lines,   the use of the   UPPER   statement is noticeably faster than the   TRANSLATE   BIF.     -- Gerard Schildberger (talk) 22:59, 11 March 2020 (UTC)

--Walterpachl 09:21, 25 July 2012 (UTC)


By the way, for quite a long time,   I couldn't get   (Rosetta Code's)   spellchecker to work.   I finally found where I had to turn it on (enable it),   so now I can catch typos and misspellings on-the-go, so to speak.     -- Gerard Schildberger (talk) 22:59, 11 March 2020 (UTC)

typos and misspellings

Rather than point out typos and/or misspellings (and hoping that the original author notices the critique and corrects), I believe it is quite acceptable and more than that, expedient to just correct the typo or misspelling as long as it's a comment (in a program) or withing a "talk" page --- if the error is an obvious one. If there's a doubt, don't change it. It's harder to tell if there's an error when the wrong word is used (was it intentional?). If I'd bothered to complain about everybody's bad spelling, typos, or wrong word use, I'd never get any real work done. The few I did correct, I make sure it's the only thing I did on that update, so other people (especially the original poster) can see what was changed. And even then, I did the Rosetta Code update with trepidation and consternation. -- Gerard Schildberger 22:24, 25 July 2012 (UTC)

Changing program code or (input) data is much more probematic. General rule of thumb: don't.

Some programmers use misspelled words like kount (instead of count) for variable names intentionaly for whatever reasons. I also use misspelled words like Ka-razy (for crazy) at times in the comment portions; sometimes these attempts at humor may be hard to discern. There is a fine line between humor and ... not humor. -- Gerard Schildberger 22:17, 25 July 2012 (UTC)

Humour is in the eye. (Anyone's eye, not just the beholder). --Paddy3118 08:52, 26 July 2012 (UTC)
invited to do so, I corrected one corrected misspelling (occurences -> occurrences)
However my observation that
/*C is the hex version of of char*/
should read
/*Y is the hex version of char */
was not considered and I dare not change it
When I did the TSO version I realized that the hex stuff isn't needed at all, is it? --Walterpachl 10:18, 26 July 2012 (UTC)
I can't speak for your code, just the code that I entered, and that hex stuff was needed to present a list of letter counts as well as the character counts.
Also, when changing other people's code or comments, please use a summary stating that (or better yet, whenever you make any change). This makes it easier for the original poster to see what the changes where via the friendly Rosetta Code notification system. I know when I first started making changes and entering examples on Rosetta Code, I didn't know about the (edit) summaries. Now, I have the box checked:
  • My preferences
  • Editing
  • [√] Prompt me when entering a blank edit summary
Of course, this only nags, er, prompts you when you don't enter an edit summary, but it helps. -- Gerard Schildberger 12:20, 26 July 2012 (UTC)
Thanks for the hint. that'll keep me from forgetting. --Walterpachl 14:03, 26 July 2012 (UTC)

Whitespace and Assembly code

After the Whitespace code, there is a block of Assembly code. Is that part of the Whitespace implementation, or is there a heading missing? Further, the "Output" section after the Assembly block only contain three numeric values, which doesn't seem to be right. --PauliKL (talk) 12:16, 2 August 2013 (UTC)

AppleScript "generic" alternative

Well. Although the "composition of generic functions" AppleScript alternative is cryptic, obfuscated, and not usefully commented, and therefore practically impossible to assess before running, I took a chance and ran it, trying it with a shorter text file which the first script handles in just under one and a quarter seconds. "Composition of generic functions" was still going after an hour, which is when I stopped it. It was more successful with the text of this comment, although it counts non-letters too and treats different cases of the same letter as different letters. --Nig (talk) 20:12, 12 April 2020 (UTC)

I do, of course, understand your puzzlement and exasperation :-)
Our two examples are complementary: mine is much faster to write (10 new lines of code, just clicking together existing Lego bricks, yours is inevitably slower to write and debug, but demonstrably faster at run-time, especially with larger samples.
Mine is perfectly serviceable for the small text I wanted a count of, and cost me no effort or time to write.
For me (and for anyone familiar with the tradition of composing pure functions, and with that well-established ML tradition of function names, subsequently adopted by various other languages) my version is familiar and easy to read, with a good ratio of signal-to-noise. We can glance at it and see how it defines the problem. Referring to the detail of how each familiar abstraction is implemented, can give the curious some useful insight into the paricular quirks and data structures of AppleScript.
Your version is clearly more legible to you, but do remember that to others it may also look like a cloud of squid ink, spreading densely and noisily past the Rosetta code 80 character limit, and folding a bit wildly, while tracing through a series of state mutations which, in formal terms, is very much more complex, and very much less safe and predictable, and much harder to model, than any composition of pure functions.
The distinction between these two approaches is usually formulated in terms like Procedural (or Imperative) vs Functional (or Declarative), and something like that might be a little more helpful to readers than the slightly more private 'Straightforward' vs 'Weird' which seems, understandably, to express your own experience :-)
The right tool for a job depends on the moment and the context - sometimes we need much more efficient use of our own time – swift writing, easy refactoring, less debugging. Something quick for a small data sample.
And sometimes we can afford to spend more of our own time, and want to build something that we can use with larger sets of data, optimized for run-time rather than write-time.
Yours works very well for the latter context (possibly at the cost of how much time it take to write and debug) though I personally might reach for another language when performance or scale are what I need. (In a macOS scripting context, the same composition of pure functions runs very fast in JavaScript for Automation, for example. I hardly need to tell you that AppleScript records are not famous for their speed or powers of introspection). By the time we have to reach for ObjC Foundation classes to get some usable performance at any scale, AppleScript has really rather lost the bloom of its charm and claims to accessibility, and may be an implausible or inappropriate instrument to reach for anyway.

To avoid the risk of a faintly comic impression, might it make sense for us to find a slightly more illuminating, and more widely understood alternative to "straightforward" ? My approach obviously seems more "straightforward" to me – that's a symmetrical relationship :-) Perhaps Imperative ? Procedural ? Something else which has some precedent and familiarity on Rosetta Code, and will immediately be intelligible ? Hout (talk) 10:17, 13 April 2020 (UTC)