Talk:Text processing/1: Difference between revisions

 
(9 intermediate revisions by 8 users not shown)
Line 1:
==Why?==
I was reading through [http://paddy3118.blogspot.com/2007/01/data-mining-in-three-language05.html old blog entries] and thought it would be appropriate (minus the focus on speed).
 
==The Sample File Was Not Found==
still missing [[User:Petelomax|Pete Lomax]] ([[User talk:Petelomax|talk]]) 06:16, 1 August 2016 (UTC)
: I do have the file on my disk, but I'm not sure how to make it available. Maybe I should mail it to an admin. [[User:Fwend|Fwend]] ([[User talk:Fwend|talk]]) 07:26, 1 August 2016 (UTC)
 
==Please clarify the task==
Line 28 ⟶ 32:
::: --[[User:PauliKL|PauliKL]] 13:54, 11 November 2008 (UTC)
::::PauliKL is right, this task seems to be more calculation than munging. Perhaps a new, similar, locally defined task could be made using the same data file (the date formatting is especially good for munging and some languages use _ for negative numbers instead of -)? The gap calculation seems like too much--like it's getting too far away from munging. Maybe just count the number of bad values in each line and report that in the line stats? I wouldn't worry about errors that much either. If anything, I would consider it bad data on an error (bad line for errors on dates and bad value for errors on values or flags) and just skip it.--[[User:Mwn3d|Mwn3d]] 14:25, 11 November 2008 (UTC)
:::::The book "Data Munging with Perl" by David Cross describes the data munging process as:
<pre>More specifically, data munging consists of a number of processes that are applied to
an initial data set to convert it into a different, but related data set. These processes will
fall into a number of categories: recognition, parsing, filtering, and transformation.</pre>
:::::Data munging is a loose term and ''does'' apply to the task.
:::::There was a very real requirement to find the longest time that things were broke, and, compared to some requests, is quite reasonable. The newsgroup reference I added was to show that this was a real-world problem. The data already in the article should be enough to complete the task. The task maybe more challenging than some, but maybe trying to solve it will impart useful, (and marketable), skills, as well as being able to contrast solutions in different languages. Unfortunately their is no way to see the development process used in different programming languages because that might show if such a task is easier done in a scripting language. some of the questions asked above for example, might not occur if the language chosen made it easy to just assume correctly formed data and quickly write a parser that could be quickly re-written if your assumptions were wrong. This task is quite straightforward for a data munging task, the full file follows the syntax of the excerpt. Their are no hand editing errors, no funny escape characters, the needed results can be calculated from the data shown, ... --[[User:Paddy3118|Paddy3118]] 20:15, 11 November 2008 (UTC)
::::::Just because it's a real-world problem doesn't mean that it's a good instructional problem. People are coming here to learn and they don't need to filter through a complex task to see how to take in pre-formatted input and do a few little calculations with it. It's fine to mention that it's real, but I don't think makes the task any more valid for RC. I believe that the "broke" time calculation is simple, it just seems a little weird for learning purposes (once again, being real doesn't imply people will easily learn from it). I don't think the task as a whole is very difficult, I just think its complexity is overriding its purpose. As for the errors, in this particular case you may be able to assume perfect input, but in general it's good practice to be thinking "what if", and with data munging jobs in general, you may not be able to assume clean input. The people who ask about errors are just following their good programming instincts. So basically: it doesn't matter if it's real when people need to learn from it, simpler is better when creating tasks here, and it would be nice if we could just agree on what to do for some errors.--[[User:Mwn3d|Mwn3d]] 20:48, 11 November 2008 (UTC)
:::::::The task is a bit complicated because it performs two unrelated functions, calculating sums and averages line by line and simultaneously finding the longest gap asynchronously over multiple lines. But maybe it is not too complicated (and I already made an implementation for Vedit macro language). Anyway a few words about how the gap is calculated could be added in the task description. --[[User:PauliKL|PauliKL]] 16:39, 12 November 2008 (UTC)
 
: I think we should assume that the format is as in the example input file readings.txt. That is:
:* Field separator = single Tab character
Line 40 ⟶ 54:
flag<=0)</pre>
::Is where it is saying it wants the longest gap, not restricting it to a single record(line), as the sentence above it in the article is. --[[User:Paddy3118|Paddy3118]] 18:55, 11 November 2008 (UTC)
:::That sentence does not say whether they want the longest gap in a line or over multiple lines. However, the code included in the original message
 
for (i=1;i<=24;i++)
if ($(i*2+1)>0){
num_valid++
sum+=$(i*2)
} else {
# find out what the max_gap for this row is
}
if (num_valid>13 && max_gap<=6){
print date,sum/num_valid,1
} else {
# print something else
}
 
:::indicates that max gap for each row is required. In addition, if the gap is longer than 6, the average for the line would not be displayed. But if the data is from some measurement device such as weather station, it makes sense to find out the longest overall gap period. Anyway, from Rosetta Code point of view, this is insignificant. --[[User:PauliKL|PauliKL]] 13:06, 12 November 2008 (UTC)
Anonymous user