Talk:Convert CSV records to TSV

On the general matter of "robustness"

A "robust" system is more likely to arise from failing as loudly and as soon as possible, and certainly more than quietly hiding errors, without even logging them [on-screen would be apt for rc], or worse still presenting them as valid [omission wd be better]. While I accept there may be some [remote] sources over which you have no control for which "the robustness principle of processing" might apply, generally "garbage in, garbage out" is beaten by "garbage in, nothing out", is beaten by "errors/logs", is beaten by "ask again". If it contains unexpected whitespace, has missing closing quotes or missing fields (oh, yeah, btw, all csv lines shd have fields matching 1st header line), or has the number "1Q7A", it is a far safer bet to assume the whole thing is kaput, at least the whole line and quite probably the whole file. Just my personal opinion. --Petelomax (talk) 02:09, 14 November 2022 (UTC)

I added quotation marks ("Robustness"). Whether the "robustness principle" leads to a "robust system" seems a bit off-topic, no? See https://en.wikipedia.org/wiki/Robustness_principle --Peak (talk) 09:16, 14 November 2022 (UTC)

There are indeed different forms of robustness.

The "fail loudly" mechanism works in the context where the user is sufficiently informed to address the complaint and has sufficient time available to make that a priority.

The "discard trash" mechanism works in some other contexts.

Note, also, that a "lint-like" approach might be used to detect and address issues which would be neglected by the "discard trash" mechanism. --Rdm (talk) 16:04, 14 November 2022 (UTC)

Handling "nonsense"

The requirement regarding the handling of "nonsense" appears under the heading '"Robustness"' and is supplemented by an example:

'a,"'        => 'a<tab>'    # the trailing nonsense is ignored, but the comma is not

but perhaps more needs to be said, so let me add that the PEG grammar should be consulted to determine where parsing of a line can no longer continue.

For example, consider the following:

a,  "  this line has two valid fields, the first of length 1 ("a") and the second of length 2 ("  ")

In the above line, the "nonsense" can be seen to begin with the double-quote, so in accordance with the "carry on" principle, it and the rest of the line are to be discarded.

Can "nonsense" (in the present sense of parsing failure) begin with a character or codepoint other than a double-quote? If the grammar were to disallow some control characters (e.g. RETURN) in unquoted fields, then the possibilities for "nonsense" clearly increase. Perhaps certain wrongly-placed Unicode characters might cause problems as well. --Peak (talk) 10:11, 14 November 2022 (UTC)