Talk:Text processing/1: Difference between revisions

Line 28:
::: --[[User:PauliKL|PauliKL]] 13:54, 11 November 2008 (UTC)
::::PauliKL is right, this task seems to be more calculation than munging. Perhaps a new, similar, locally defined task could be made using the same data file (the date formatting is especially good for munging and some languages use _ for negative numbers instead of -)? The gap calculation seems like too much--like it's getting too far away from munging. Maybe just count the number of bad values in each line and report that in the line stats? I wouldn't worry about errors that much either. If anything, I would consider it bad data on an error (bad line for errors on dates and bad value for errors on values or flags) and just skip it.--[[User:Mwn3d|Mwn3d]] 14:25, 11 November 2008 (UTC)
:::::The book "Data Munging with Perl" by David Cross describes the data munging process as:
<pre>More specifically, data munging consists of a number of processes that are applied to
an initial data set to convert it into a different, but related data set. These processes will
fall into a number of categories: recognition, parsing, filtering, and transformation.</pre>
:::::Data munging is a loose term and ''does'' apply to the task.
:::::There was a very real requirement to find the longest time that things were broke, and, compared to some requests, is quite reasonable. The newsgroup reference I added was to show that this was a real-world problem. The data already in the article should be enough to complete the task. The task maybe more challenging than some, but maybe trying to solve it will impart useful, (and marketable), skills, as well as being able to contrast solutions in different languages. Unfortunately their is no way to see the development process used in different programming languages because that might show if such a task is easier done in a scripting language. some of the questions asked above for example, might not occur if the language chosen made it easy to just assume correctly formed data and quickly write a parser that could be quickly re-written if your assumptions were wrong. This task is quite straightforward for a data munging task, the full file follows the syntax of the excerpt. Their are no hand editing errors, no funny escape characters, the needed results can be calculated from the data shown, ... --[[User:Paddy3118|Paddy3118]] 20:15, 11 November 2008 (UTC)
 
: I think we should assume that the format is as in the example input file readings.txt. That is:
:* Field separator = single Tab character
Anonymous user