Talk:Letter frequency

From Rosetta Code
Revision as of 14:28, 17 April 2020 by Hout (talk | contribs) (→‎AppleScript functionally composed variant)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Task description

More detailed task description is needed. For example, should we only count ASCII letters A-Z? Case in-sensitive?

Or for that matter, what is a letter? For what language? Most programs seemed to assume the Latin alphabet for English. -- Gerard Schildberger 01:47, 30 June 2012 (UTC)

Maybe the results can be displayed with whatever method is the most convenient? I assume opening the file is required (not just handle a file that is already open)?

Since the first solutions were copied from another page and may not be correct solutions, those should be marked somehow. (Was there some specific tag for this purpose?) At least the Pascal solution seems to have nothing to do with the task.

--PauliKL 13:03, 19 September 2011 (UTC)

I took it as anything that is un-said should be whatever is convenient to the implementer.
  • ASCII? Count whatever the file open routine makes most easy.
  • Case sensitivity? Count what you get without applying any uppercase/lowercase filters.
  • Output format? whatever is convenient.
Open the file in your code? I interpreted this as being a requirement.
This leaves the guts as being a way to iterate through the characters keeping count. --Paddy3118 14:58, 19 September 2011 (UTC)

"Letter frequency" is not the same as "letter occurences". The title hints that more is needed in the task description. It would seem that some description of output is required as well. --Demivec 16:43, 9 November 2011 (UTC)

It seems that many program examples interpreted a letter as a character. A (Latin) letter has two forms: its uppercase and lowercase version. So if two H characters and three h characters were in a file, then there would be five occurrences of the letter aitch. [Aitch is the English name for the letter H or h.] The task description could've been more clear on that point, so for the REXX version 1, a count was done for each (Latin) letter, AND also for each character, and the counts are provided in separate lists. This made the loosey-goosey interpretation moot. Since it wasn't stated what a letter is   (I used the primary definition that it's any of the symbols of an alphabet), it seemed appropriate to provide a both lists:   a list of letters, and a list of all characters (for any language's alphabet). -- Gerard Schildberger 21:19, 25 July 2012 (UTC)

The English pronunciation of the letter 'H' is haitch surely! :-)
--Paddy3118 08:48, 26 July 2012 (UTC)
One may pronounce it that way (depending upon which side of the pond you're on), but the spelling of the letter isn't that.     -- Gerard Schildberger (talk) 21:13, 4 August 2015 (UTC)

In hindsight, it would've been nice to make a requirement to use the program example as the primary input (but not necessarily the only input); that way, everyone could see what was used for its input. At least one example used UNIXDICT.TXT, which has no capital letters. Another example only counts capital letters. Still others showed a list, but excluded most of the counts, so it can't be verified if the uppercase letters were included (or not) with the lowercase letters, or kept as separate counts. -- Gerard Schildberger 21:19, 25 July 2012 (UTC)

a few remarks for Rexx

A few typos: carraiage -> carriage occurances -> occurrences independant -> independent

(see next section on typos and misspellings) -- Gerard Schildberger 22:17, 25 July 2012 (UTC)

and some more substantial observations: y=d2x(L); if @.up.y==0 then iterate /*zero count? Then ignore letter*/

 c=d2c(L)                             /*C is the hex version of of char*/

In such cases I use cnt.0up.y so that a possible variable up never interferes actually y is the hex version , drop one ‘of’

I don't understand what you are observing, substantial or not. Thank you for your suggestion that I modify the version 1 code, but one of "them" can't be dropped as REXX version 1 keeps track of letters as well as characters, and both of "them" are needed for their respective (count) lists. -- Gerard Schildberger 22:01, 25 July 2012 (UTC)

@.=0 /*wouldn't it be neat to use Θ ? */ why not use cnt. ?

Because it's my style of coding. I use @. for important stemmed arrays, and it makes it easier to find in the code where that stemmed array is referenced. Using such constructs as c.c is very confusing to a novice reader of REXX. There's two variables being used, c and c, one is the stemmed array name, the other the stemmed array index. But --- defending one's programming style will just start a religious war, so there is no sense in pointing faults in another's coding style. -- Gerard Schildberger 22:01, 25 July 2012 (UTC)

In the discussion I read: “Case sensitivity? Count what you get without applying any uppercase/lowercase filters”

No filters were used in REXX version 1, letters and characters were counted correctly without case sensitivity, and a count was provided for each. -- Gerard Schildberger 22:01, 25 July 2012 (UTC)

upper c -> c=translate(c) would help for other Rexxes (in particular ooRexx)

REXX version 1 was coded for classic REXX (not the object-oriented version of REXX, ooRexx), and the use of the   upper   statement is more intuitive when being read by people who don't know REXX that well, the   upper   BIF explains itself.   It also has uses that   translate   BIF doesn't have (or doesn't support),   but that discussion should be done elsewhere. -- Gerard Schildberger 22:01, 25 July 2012 (UTC)
Also, when uppercasing tens of thousands of lines,   the use of the   UPPER   statement is noticeably faster than the   TRANSLATE   BIF.     -- Gerard Schildberger (talk) 22:59, 11 March 2020 (UTC)

--Walterpachl 09:21, 25 July 2012 (UTC)


By the way, for quite a long time,   I couldn't get   (Rosetta Code's)   spellchecker to work.   I finally found where I had to turn it on (enable it),   so now I can catch typos and misspellings on-the-go, so to speak.     -- Gerard Schildberger (talk) 22:59, 11 March 2020 (UTC)

typos and misspellings

Rather than point out typos and/or misspellings (and hoping that the original author notices the critique and corrects), I believe it is quite acceptable and more than that, expedient to just correct the typo or misspelling as long as it's a comment (in a program) or withing a "talk" page --- if the error is an obvious one. If there's a doubt, don't change it. It's harder to tell if there's an error when the wrong word is used (was it intentional?). If I'd bothered to complain about everybody's bad spelling, typos, or wrong word use, I'd never get any real work done. The few I did correct, I make sure it's the only thing I did on that update, so other people (especially the original poster) can see what was changed. And even then, I did the Rosetta Code update with trepidation and consternation. -- Gerard Schildberger 22:24, 25 July 2012 (UTC)

Changing program code or (input) data is much more probematic. General rule of thumb: don't.

Some programmers use misspelled words like kount (instead of count) for variable names intentionaly for whatever reasons. I also use misspelled words like Ka-razy (for crazy) at times in the comment portions; sometimes these attempts at humor may be hard to discern. There is a fine line between humor and ... not humor. -- Gerard Schildberger 22:17, 25 July 2012 (UTC)

Humour is in the eye. (Anyone's eye, not just the beholder). --Paddy3118 08:52, 26 July 2012 (UTC)
invited to do so, I corrected one corrected misspelling (occurences -> occurrences)
However my observation that
/*C is the hex version of of char*/
should read
/*Y is the hex version of char */
was not considered and I dare not change it
When I did the TSO version I realized that the hex stuff isn't needed at all, is it? --Walterpachl 10:18, 26 July 2012 (UTC)
I can't speak for your code, just the code that I entered, and that hex stuff was needed to present a list of letter counts as well as the character counts.
Also, when changing other people's code or comments, please use a summary stating that (or better yet, whenever you make any change). This makes it easier for the original poster to see what the changes where via the friendly Rosetta Code notification system. I know when I first started making changes and entering examples on Rosetta Code, I didn't know about the (edit) summaries. Now, I have the box checked:
  • My preferences
  • Editing
  • [√] Prompt me when entering a blank edit summary
Of course, this only nags, er, prompts you when you don't enter an edit summary, but it helps. -- Gerard Schildberger 12:20, 26 July 2012 (UTC)
Thanks for the hint. that'll keep me from forgetting. --Walterpachl 14:03, 26 July 2012 (UTC)

Whitespace and Assembly code

After the Whitespace code, there is a block of Assembly code. Is that part of the Whitespace implementation, or is there a heading missing? Further, the "Output" section after the Assembly block only contain three numeric values, which doesn't seem to be right. --PauliKL (talk) 12:16, 2 August 2013 (UTC)

AppleScript "generic" alternative

Well. Although the "composition of generic functions" AppleScript alternative is cryptic, obfuscated, and not usefully commented, and therefore practically impossible to assess before running, I took a chance and ran it, trying it with a shorter text file which the first script handles in just under one and a quarter seconds. "Composition of generic functions" was still going after an hour, which is when I stopped it. It was more successful with the text of this comment, although it counts non-letters too and treats different cases of the same letter as different letters. --Nig (talk) 20:12, 12 April 2020 (UTC)

I do, of course, understand your puzzlement and exasperation :-)
Our two examples are complementary: mine is much faster to write (10 new lines of code, just clicking together existing Lego bricks, yours is inevitably slower to write and debug, but demonstrably faster at run-time, especially with larger samples.
Mine is perfectly serviceable for the small text I wanted a count of, and cost me no effort or time to write.
For me (and for anyone familiar with the tradition of composing pure functions, and with that well-established ML tradition of function names, subsequently adopted by various other languages) my version is familiar and easy to read, with a good ratio of signal-to-noise. We can glance at it and see immediately how it defines the problem. Referring to the detail of how each familiar abstraction is implemented, can give the curious some useful insight into the particular quirks and data structures of AppleScript.
Your version is clearly more legible to you, but do remember that to others it may also look like a cloud of squid ink, spreading densely and noisily past the Rosetta code 80 character limit, and folding a bit wildly, while tracing through a series of state mutations which, in formal terms, is very much more complex, and very much less safe and predictable, and much harder to model, than any composition of pure functions.
The distinction between these two approaches is usually formulated in terms like Procedural (or Imperative) vs Functional (or Declarative), and something like that might be a little more helpful to readers than the slightly more private 'Straightforward' vs 'Unfamiliar' which seems, quite understandably, to express your own experience :-)
The right tool for a job depends on the moment and the context - sometimes we need much more efficient use of our own time – swift writing, easy refactoring, less debugging. Something quick and reliable for a smallish data sample.
And sometimes we can afford to spend more of our own time, and want to build something that we can use with larger sets of data, optimized for run-time rather than write-time.
Yours works very well for the latter context (possibly at the cost of how much time it take to write and debug) though I personally might reach for another language when performance or scale are what I need. (In a macOS scripting context, the same composition of pure functions runs very fast in JavaScript for Automation, for example. I hardly need to tell you that AppleScript records are not famous for their speed or powers of introspection). By the time we have to reach for ObjC Foundation classes to get some usable performance at any scale, AppleScript has really rather lost the bloom of its charm and claims to accessibility, and may be an implausible or inappropriate instrument to reach for anyway.
To avoid the risk of a faintly comic impression, might it make sense for us to find a more illuminating, and more widely understood alternative to "straightforward" ? My approach obviously seems more "straightforward" to me – that's a symmetrical relationship :-) Perhaps Imperative ? Procedural ? Something else which has some precedent and familiarity on Rosetta Code, and will immediately be intelligible ? Hout (talk) 10:17, 13 April 2020 (UTC)
"Straightforward" will do where I've used it: that is, doing the required task simply and effectively and with code which can be understood and maintained by people familiar with the language who aren't the original coder. This means not using handlers with cryptic labels and cryptic contents which do little but call other handlers with cryptic labels and cryptic contents and which aren't adequately commented.
AppleScript's designed as far as possible to allow users of Macintosh computers, or those responsible for Macs in, say, small school or office environments, to automate tasks for themselves without having to call in a programmer wedded to a "Declarative", "Procedural", or whatever other programming philosophy. Its English-like syntax is one of its main features, allowing rank beginners often to guess their way to something — however terrible — which does what they need. On the other hand, more experienced coders coming to AppleScript from other languages find they can often still use those languages from AppleScript (courtesy of the do shell script command) or can at least follow their accustomed thinking habits when getting AppleScript to work for them. Those who take the bother to become familiar with AppleScript have a great deal of useful power at their disposal, including direct control of many applications. It's basically a getting-things-done language for Mac users, although many do grow to love it in its own right.
As you've found, it can even accommodate your style of coding. If you're only writing for yourself and aren't interested in performance (!), it's perfectly legitimate to code that way in AppleScript (coding errors excepted). But it becomes problematic when it's presented in public as "the way things are done in AppleScript", which is simply misleading. You're the only AppleScripter in the known universe who writes code like that. It goes against the spirit of the language and against the general programming rule that code should be understandable and maintainable by coders other than the original author. Any novice Mac owner thinking of having a go at learning AppleScript and seeing your code would be put off immediately, thinking the language was all about terseness and chasing around in circles. Even an experienced AppleScripter trying to assess one of your scripts before running it, or trying to find out why it wasn't behaving as expected, would similarly lose patience very quickly and just write something more intelligible from scratch.
I won't discuss this any further as I see you've had similar exchanges with practitioners of other languages on this site and are not to be budged from your conviction. I'm also aware that I'm in your debt for having dropped me a few hints when I first started posting here a couple of months ago, for which I couldn't at the time work out how to post my thanks. Thanks here if it's not to late.  :) --Nig (talk) 18:51, 13 April 2020 (UTC)
My pleasure, and I should have thought to warn you about the 80 character norm, which would help legibility, particularly with your excellent and generous comments.
On the other issues, well, disagreement and varying approaches are the life blood of Rosetta Code.
To quote from the landing page, the idea is not only to demonstrate how languages are similar and different but also to aid a person with a grounding in one approach to a problem in learning another,
So the more we disagree, the more useful material and revealing comparisons we can generate :-)
No amount of iterative programming with sequence branches and loops, even for fifty years, will ever introduce us to even the basics of maps, folds, curried and higher-order functions, and monadic compositions, or enable us to reframe from 'doing' to 'defining', so the notion of even an experienced AppleScripter ... XYZ ... probably looks a bit less rational to me than it apparently does to you – first-year kids in college often take courses in Scheme or Haskell these days.
The only really useful commentary is, however, just an alternative draft.
As Knuth famously puts it, premature optimization is the root of all evil, but if I did want something faster for macOS scripting, my comment on your AppleScript draft would probably just be the JavaScript ES6 draft (on this page) which uses the macOS Automation library for the file access.
(Just a few lines of unoptimized new code, and on this system it gives counts for each distinct character in the text of les Miserables in well under a second. I would feel ashamed of myself if I wasted any time on thoughts of further optimizing that :-)
Keep well, and enjoy the coding ! Hout (talk) 18:17, 14 April 2020 (UTC)