Talk:Convert CSV records to TSV
On the general matter of "robustness"
A "robust" system is more likely to arise from failing as loudly and as soon as possible, and certainly more than quietly hiding errors, without even logging them [on-screen would be apt for rc], or worse still presenting them as valid [omission wd be better]. While I accept there may be some [remote] sources over which you have no control for which "the robustness principle of processing" might apply, generally "garbage in, garbage out" is beaten by "garbage in, nothing out", is beaten by "errors/logs", is beaten by "ask again". If it contains unexpected whitespace, has missing closing quotes or missing fields (oh, yeah, btw, all csv lines shd have fields matching 1st header line), or has the number "1Q7A", it is a far safer bet to assume the whole thing is kaput, at least the whole line and quite probably the whole file. Just my personal opinion. --Petelomax (talk) 02:09, 14 November 2022 (UTC)
- I added quotation marks ("Robustness"). Whether the "robustness principle" leads to a "robust system" seems a bit off-topic, no? See https://en.wikipedia.org/wiki/Robustness_principle --Peak (talk) 09:16, 14 November 2022 (UTC)
- That wp entry states "as long as the meaning is clear", so in delivering half a TSV from half a CSV it is clearly not following that principle anyway. I don't consider this to be off-topic at all, and quite concerned about the complete mis-interpretation of the low-level advice "don't crash" (log/do nothing) to a higher level "do the wrong thing" (and if unsure, just make something up). --Petelomax (talk) 20:20, 16 November 2022 (UTC)
- There are indeed different forms of robustness.
- The "fail loudly" mechanism works in the context where the user is sufficiently informed to address the complaint and has sufficient time available to make that a priority.
- The "discard trash" mechanism works in some other contexts.
- Note, also, that a "lint-like" approach might be used to detect and address issues which would be neglected by the "discard trash" mechanism. --Rdm (talk) 16:04, 14 November 2022 (UTC)
The requirement regarding the handling of "nonsense" appears under the heading '"Robustness"' and is supplemented by an example:
'a,"' => 'a<tab>' # the trailing nonsense is ignored, but the comma is not
but perhaps more needs to be said, so let me add that the PEG grammar should be consulted to determine where parsing of a line can no longer continue.
For example, consider the following:
a, " this line has two valid fields, the first of length 1 ("a") and the second of length 2 (" ")
In the above line, the "nonsense" can be seen to begin with the double-quote, so in accordance with the "carry on" principle, it and the rest of the line are to be discarded.
Can "nonsense" (in the present sense of parsing failure) begin with a character or codepoint other than a double-quote? If the grammar were to disallow some control characters (e.g. RETURN) in unquoted fields, then the possibilities for "nonsense" clearly increase. Perhaps certain wrongly-placed Unicode characters might cause problems as well. --Peak (talk) 10:11, 14 November 2022 (UTC)
- I imagine that "illegal characters" or "illegal code points" could be a part of a valid implementation.
- However, the task itself specifies:
- "Our starting point will be a character set that includes ASCII", and
- "note any discrepancies with the requirements"
- (For example, the various unicode encodings include ASCII but other encodings such as shift-jis or latin1 also include ASCII.)
- So, I imagine that this sort of thing would (a) need to be implementation specific and (b) documented in the implementation notes. --Rdm (talk) 16:11, 14 November 2022 (UTC)