Talk:Text processing/1: Difference between revisions

Content added Content deleted

Inline

Revision as of 08:07, 9 November 2008

Why?

I was reading through old blog entries and thought it would be appropriate (minus the focus on speed).

Please clarify the task

Syntax errors in the file to be detected?
The field separator is what? One space, any non-empty chain of spaces, any non-empty chain of spaces or tabs. Something else?
Average to evaluate over all fields or else over each field separately?
When at the same line some fields are flagged invalid but others are not, is it a gap? Or is it only when all fields are invalid?
Further, do valid fields participate in averaging when some other fields at the same line are invalid?
When a field is not present is it a syntax error or a gap?
What to do when syntactically wrong fields appear (not a number, too large number etc)?

--Dmitry-kazakov 12:13, 8 November 2008 (UTC)

Hi Dmitry the comp.lang.awk newsgroup thread contains all the information necessary for the original poster to get his job done. The example records are probably typical, but you need to try something out and make your own decisions on the format/error handling. The original newsgroup thread actually has more information than you get on some data munging problems as in many cases someone just says "wouldn't it be good if this talked to this"; or "When wasn't this working".

Data format information might be here. (Sorry if I seem patronising, it was not meant) --Paddy3118 17:34, 8 November 2008 (UTC)

I suppose that any task should be defined in the article. The code presented for this task looks like a translation from one language into another, rather than independent implementations. Actually there is no way to verify whether they do the job or not. What would be the right output if the input file were:

2008/Mar/21    -1E-2 1

On second thought I would suggest to replace it to something more general and better defined text processing task. Like parsing a CSV file, for example. --Dmitry-kazakov 18:37, 8 November 2008 (UTC)

How about we wait a week and see what others think? If no one else can figure it out then I will add further explanations.

If you can find such malformed data in the readings.txt file then it becomes an issue. Asking such what-if questions seem to be finding ways to fail.

On the Python, AWK, and Perl examples being similar: people are able to submit their own solutions in those languages if they think their solutions better fit the usual style of writing in Python/Perl/AWK, indeed someone suggests changes to the AWK solution in the newsgroup although they weren't obvious to me. --Paddy3118 08:07, 9 November 2008 (UTC)

@@ Line 17: / Line 17: @@
 : I suppose that any task should be defined in the article. The code presented for this task looks like a translation from one language into another, rather than independent implementations. Actually there is no way to verify whether they do the job or not. What would be the right output if the input file were:<pre>2008/Mar/21    -1E-2 1</pre> On second thought I would suggest to replace it to something more general and better defined text processing task. Like parsing a [http://en.wikipedia.org/wiki/Comma-separated_values CSV file], for example. --[[User:Dmitry-kazakov|Dmitry-kazakov]] 18:37, 8 November 2008 (UTC)
+::How about we wait a week and see what others think? If no one else can figure it out then I will add further explanations.
+::If you can find such malformed data in the readings.txt file then it becomes an issue. Asking such what-if questions seem to be finding ways to fail.
+::On the Python, AWK, and Perl examples being similar: people are able to submit their own solutions in those languages if they think their solutions better fit the usual style of writing in Python/Perl/AWK, indeed someone suggests changes to the AWK solution in the newsgroup although they weren't obvious to me. --[[User:Paddy3118|Paddy3118]] 08:07, 9 November 2008 (UTC)