Convert CSV records to TSV: Difference between revisions
m
→{{header|Wren}}: Changed to Wren S/H
m (Added {{draft task}} template - should have noticed this was missing earlier.) |
m (→{{header|Wren}}: Changed to Wren S/H) |
||
(20 intermediate revisions by 5 users not shown) | |||
Line 8:
the operating system.
In accordance with the so-called robustness principle of processing, if a string or
input line begins with a valid (possibly empty) sequence of comma-separated
values but is followed by nonsense, then processing should continue based on
Line 14:
; A CSV record
[[Category:PEG]]
Our starting point will be a character set that includes ASCII; the language
of regular expressions (which will be denoted by strings of the form
r'REGEX'); and the following [[:Category:PEG|PEG]] (parsing expression grammar) grammar for a
single CSV record:
<pre>
Line 33 ⟶ 34:
so there is no need to express that as a restriction in the PEG.
; "Robustness"
As mentioned above, if an input line or string begins with a valid sequence of
comma-separated values but is followed by nonsense, then processing should
Line 91 ⟶ 92:
the platform.
Use the following as a test file
requirements
(besides those preceding r'[nrt]) are transformed.
; TEST FILE
The test file should include a line with a RETURN ('\r') character.
One way to change the first RETURN in the following to the '\r' control character would be
to use `sed`, perhaps along the lines of: sed -i.bak $'s/RETURN/\r/' TEST.csv
<pre>
a,"b"
Line 108 ⟶ 112:
a\n\rb
a�b, That is a NUL character
aRETURNb, Be sure to change RETURN to the '\r' control character (#xd)
a\b
</pre>
[[category:CSV]]
[[category:TSV]]
=={{header|ALGOL 68}}==
All input \ characters are doubled in the output.<br>
As with some of the other samples, the input data is stored in an array, not read from a file.
<syntaxhighlight lang="algol68">
BEGIN # convert lines of CSV to TSV #
CHAR nul = REPR 0;
CHAR tab = REPR 9;
CHAR lf = REPR 10;
CHAR cr = REPR 13;
# returns s with some control characters converted to <name> #
PROC show ctrl = ( STRING s )STRING:
BEGIN
STRING result := "";
FOR i FROM LWB s TO UPB s DO
result +:= IF s[ i ] = nul THEN "<nul>"
ELIF s[ i ] = tab THEN "<tab>"
ELIF s[ i ] = cr THEN "<cr>"
ELIF s[ i ] = lf THEN "<lf>"
ELSE s[ i ]
FI
OD;
result
END # show ctrl # ;
# returns csv converted to TSV #
PROC csv2tsv = ( STRING csv )STRING:
BEGIN
BOOL at end := FALSE;
CHAR ch := nul;
# sets ch to the next character in csv, if there is one #
PROC next = VOID: ch := IF c pos < c max
THEN csv[ c pos +:= 1 ]
ELSE at end := TRUE
; nul
FI;
# skips over spaces and returns the count of skipped spaces #
PROC spaces = INT:
BEGIN
INT s count := 0;
WHILE NOT at end AND ch = " " DO s count +:= 1; next OD;
s count
END # spaces # ;
# adds ch to the TSV - converting some characters to escaped form #
PROC add = VOID: tsv +:= IF ch = "\" THEN "\\"
ELIF ch = nul THEN "\0"
ELIF ch = cr THEN "\r"
ELIF ch = lf THEN "\n"
ELIF ch = tab THEN "\t"
ELSE ch
FI;
# parse the csv and generate the tsv #
STRING tsv := "";
INT c pos := LWB csv - 1;
INT c max = UPB csv;
WHILE NOT at end DO
# spaces are not significant around quoted fields but are part of unquoted fields #
next;
INT space count := spaces;
IF ch = """" THEN
# quoted field part #
WHILE next;
WHILE NOT at end AND ch /= """" DO add; next OD;
IF NOT at end THEN
next;
IF ch = """" THEN
# embedded quote #
add
FI
FI;
NOT at end AND ch = """"
DO SKIP OD;
space count := spaces;
IF at end OR ch = "," THEN
# nothing significant after the quoted field part #
space count := 0
FI
FI;
# unquoted field part #
tsv +:= space count * " ";
WHILE NOT at end AND ch /= "," DO add; next OD;
IF ch = "," THEN
# have another field following this one #
tsv +:= tab
FI
OD;
tsv
END # csv2tsv # ;
# task test cases #
[]STRING tests =
( "a,""b"""
, """a"",""b""""c"""
, ""
, ",a"
, "a,"""
, " a , ""b"""
, """12"",34"
, "a" + tab + "b, That is a TAB character"
, "a\tb"
, "a\n\rb"
, "a" + nul + "b, That is a NUL character"
, "a" + cr + "b, Be sure to change RETURN to the '\r' control character (#xd)"
, "a\b"
);
FOR i FROM LWB tests TO UPB tests DO
print( ( " {{", show ctrl( tests[ i ] ), "}}", newline
, " -> {{", show ctrl( csv2tsv( tests[ i ] ) ), "}}", newline
)
)
OD
END
</syntaxhighlight>
{{out}}
<pre>
{{a,"b"}}
-> {{a<tab>b}}
{{"a","b""c"}}
-> {{a<tab>b"c}}
{{}}
-> {{}}
{{,a}}
-> {{<tab>a}}
{{a,"}}
-> {{a<tab>}}
{{ a , "b"}}
-> {{ a <tab>b}}
{{"12",34}}
-> {{12<tab>34}}
{{a<tab>b, That is a TAB character}}
-> {{a\tb<tab> That is a TAB character}}
{{a\tb}}
-> {{a\\tb}}
{{a\n\rb}}
-> {{a\\n\\rb}}
{{a<nul>b, That is a NUL character}}
-> {{a\0b<tab> That is a NUL character}}
{{a<cr>b, Be sure to change RETURN to the '\r' control character (#xd)}}
-> {{a\rb<tab> Be sure to change RETURN to the '\\r' control character (#xd)}}
{{a\b}}
-> {{a\\b}}
</pre>
Line 153 ⟶ 299:
Also, we escape all backslashes, regardless of what character follows, during tsv encoding.
Here, also, we interpret "nonsense" as starting
For csv parsing we first break out fields using <b><tt>[[j:Vocabulary/semico#dyadic|;:]]</tt></b>. Here, each field is preceded by a delimiter. (We discard an optional trailing newline from the csv text and prepend a newline at the beginning so that every field has a preceding delimiter. Also, of course, if we were given a file reference, we work with the text of the file rather than its name.)
Then, these fields are formed into rows (fields which
To translate to tsv form, we would first escape special characters in each field, then insert delimiters between each field and terminate each record with a newline. Thus, mat2tsv takes a two-dimensional matrix and its result is a tsv string. (For utility, mat2tsv also supports numeric matrices, since that was trivial.)
Task example:
Line 181 ⟶ 327:
'''The following program can also be used with gojq, the Go implementation of jq,
but until recently NUL (#x0)
In this entry, the PEG grammar for "record" as defined in the task
Line 187 ⟶ 333:
closely to jq operators, notably PEG's '/' to jq's '//'.
In translating the PEG grammar to a jq program, the main idea is to define a
pipeline for each grammar rule. A JSON object with keys "remainder" and
"result" is passed through this pipeline, consuming the text in .remainder and
Line 225 ⟶ 371:
def record: field | star(consume(",") | field);
def csv2tsv:
{remainder: .} | record | .result | @tsv ;
# Transform an entire file
inputs | csv2tsv
</syntaxhighlight>
{{output}}
As required:
* Backslashes are uniformly duplicated.
* Until recently gojq did not handle NUL (#x0) properly.
=={{header|Julia}}==
{{trans|Phix}}
<syntaxhighlight lang="julia">function csv_tsv(str)
p = split(str, ",")
for (i, f) in enumerate(p)
if count(==('"'), f) > 1
p[i] = replace(strip(f, [' ', '"']), "\"\"" => "\"")
elseif f == "\""
p[i] = ""
end
end
t = join(p,"<TAB>")
s = replace(str, "\\" => "\\\\", "\t" => "\\t", "\0" => "\\0", "\n" => "\\n", "\r" => "\\r")
t = replace(t, "\\" => "\\\\", "\t" => "\\t", "\0" => "\\0", "\n" => "\\n", "\r" => "\\r")
return s, t
end
const testfile = "test.tmp"
fh = open(testfile, "w")
write(fh, """
a,"b"
"a","b""c"
,a
a,"
a , "b"
"12",34
a\tb, TAB
a\\tb
a\\n\\rb
a\0b, NUL
a\rb, RETURN
a\\b""")
close(fh)
for test_string in split(read(testfile, String), "\n")
csv, tsv = csv_tsv(test_string)
println(lpad(csv, 12), " => ", tsv)
end
</syntaxhighlight>{{out}}
<pre>
a,"b" => a<TAB>b
"a","b""c" => a<TAB>b"c
=>
,a => <TAB>a
a," => a<TAB>
a , "b" => a <TAB>b
"12",34 => 12<TAB>34
a\tb, TAB => a\tb<TAB> TAB
a\\tb => a\\tb
a\\n\\rb => a\\n\\rb
a\0b, NUL => a\0b<TAB> NUL
a\rb, RETURN => a\rb<TAB> RETURN
a\\b => a\\b
</pre>
=={{header|Phix}}==
Note that backtick strings are not escaped, whereas double-quoted strings are.
<!--<syntaxhighlight lang="phix">(phixonline)-->
<span style="color: #008080;">with</span> <span style="color: #008080;">javascript_semantics</span>
<span style="color: #008080;">procedure</span> <span style="color: #000000;">csv2tsv</span><span style="color: #0000FF;">(</span><span style="color: #004080;">string</span> <span style="color: #000000;">s</span><span style="color: #0000FF;">)</span>
<span style="color: #004080;">sequence</span> <span style="color: #000000;">p</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">split</span><span style="color: #0000FF;">(</span><span style="color: #000000;">s</span><span style="color: #0000FF;">,</span><span style="color: #008000;">','</span><span style="color: #0000FF;">,</span><span style="color: #004600;">false</span><span style="color: #0000FF;">)</span>
<span style="color: #008080;">for</span> <span style="color: #000000;">i</span><span style="color: #0000FF;">,</span><span style="color: #000000;">f</span> <span style="color: #008080;">in</span> <span style="color: #000000;">p</span> <span style="color: #008080;">do</span>
<span style="color: #008080;">if</span> <span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #7060A8;">find_all</span><span style="color: #0000FF;">(</span><span style="color: #008000;">'"'</span><span style="color: #0000FF;">,</span><span style="color: #000000;">f</span><span style="color: #0000FF;">))></span><span style="color: #000000;">1</span> <span style="color: #008080;">then</span>
<span style="color: #000000;">p</span><span style="color: #0000FF;">[</span><span style="color: #000000;">i</span><span style="color: #0000FF;">]</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">substitute</span><span style="color: #0000FF;">(</span><span style="color: #7060A8;">trim</span><span style="color: #0000FF;">(</span><span style="color: #000000;">f</span><span style="color: #0000FF;">,</span><span style="color: #008000;">` "`</span><span style="color: #0000FF;">),</span><span style="color: #008000;">`""`</span><span style="color: #0000FF;">,</span><span style="color: #008000;">`"`</span><span style="color: #0000FF;">)</span>
<span style="color: #008080;">elsif</span> <span style="color: #000000;">f</span><span style="color: #0000FF;">=</span><span style="color: #008000;">`"`</span> <span style="color: #008080;">then</span>
<span style="color: #000000;">p</span><span style="color: #0000FF;">[</span><span style="color: #000000;">i</span><span style="color: #0000FF;">]</span> <span style="color: #0000FF;">=</span> <span style="color: #008000;">""</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">if</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">for</span>
<span style="color: #004080;">string</span> <span style="color: #000000;">t</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">join</span><span style="color: #0000FF;">(</span><span style="color: #000000;">p</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"<tab>"</span><span style="color: #0000FF;">)</span>
<span style="color: #000000;">s</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">substitute_all</span><span style="color: #0000FF;">(</span><span style="color: #000000;">s</span><span style="color: #0000FF;">,{</span><span style="color: #008000;">`\`</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"\t"</span><span style="color: #0000FF;">,</span><span style="color: #008000;">'\0'</span><span style="color: #0000FF;">,</span><span style="color: #008000;">'\n'</span><span style="color: #0000FF;">},{</span><span style="color: #008000;">`\\`</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"\\t"</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"\\0"</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"\\n"</span><span style="color: #0000FF;">})</span>
<span style="color: #000000;">t</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">substitute_all</span><span style="color: #0000FF;">(</span><span style="color: #000000;">t</span><span style="color: #0000FF;">,{</span><span style="color: #008000;">`\`</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"\t"</span><span style="color: #0000FF;">,</span><span style="color: #008000;">'\0'</span><span style="color: #0000FF;">,</span><span style="color: #008000;">'\n'</span><span style="color: #0000FF;">},{</span><span style="color: #008000;">`\\`</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"\\t"</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"\\0"</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"\\n"</span><span style="color: #0000FF;">})</span>
<span style="color: #7060A8;">printf</span><span style="color: #0000FF;">(</span><span style="color: #000000;">1</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"%12s => %s\n"</span><span style="color: #0000FF;">,{</span><span style="color: #000000;">s</span><span style="color: #0000FF;">,</span><span style="color: #000000;">t</span><span style="color: #0000FF;">})</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">procedure</span>
<span style="color: #008080;">constant</span> <span style="color: #000000;">test_data</span> <span style="color: #0000FF;">=</span> <span style="color: #0000FF;">{</span><span style="color: #008000;">`a,"b"`</span><span style="color: #0000FF;">,</span>
<span style="color: #008000;">`"a","b""c"`</span><span style="color: #0000FF;">,</span>
<span style="color: #008000;">``</span><span style="color: #0000FF;">,</span>
<span style="color: #008000;">`,a`</span><span style="color: #0000FF;">,</span>
<span style="color: #008000;">`a,"`</span><span style="color: #0000FF;">,</span>
<span style="color: #008000;">` a , "b"`</span><span style="color: #0000FF;">,</span>
<span style="color: #008000;">`"12",34`</span><span style="color: #0000FF;">,</span>
<span style="color: #008000;">"a\tb,"</span><span style="color: #0000FF;">,</span> <span style="color: #000080;font-style:italic;">-- That is a TAB character</span>
<span style="color: #008000;">`a\tb`</span><span style="color: #0000FF;">,</span> <span style="color: #000080;font-style:italic;">-- That is not</span>
<span style="color: #008000;">`a\n\rb`</span><span style="color: #0000FF;">,</span>
<span style="color: #008000;">"a\0b"</span><span style="color: #0000FF;">,</span> <span style="color: #000080;font-style:italic;">-- That is a NUL character</span>
<span style="color: #008000;">"a\nb"</span><span style="color: #0000FF;">,</span> <span style="color: #000080;font-style:italic;">-- That is a LF (linefeed) character</span>
<span style="color: #008000;">`a\b`</span><span style="color: #0000FF;">}</span>
<span style="color: #7060A8;">papply</span><span style="color: #0000FF;">(</span><span style="color: #000000;">test_data</span><span style="color: #0000FF;">,</span><span style="color: #000000;">csv2tsv</span><span style="color: #0000FF;">)</span>
<!--</syntaxhighlight>-->
{{out}}
<pre>
a,"b" => a<tab>b
"a","b""c" => a<tab>b"c
=>
,a => <tab>a
a," => a<tab>
a , "b" => a <tab>b
"12",34 => 12<tab>34
a\tb, => a\tb<tab>
a\\tb => a\\tb
a\\n\\rb => a\\n\\rb
a\0b => a\0b
a\nb => a\nb
a\\b => a\\b
</pre>
=={{header|Wren}}==
Line 243 ⟶ 496:
{{libheader|Wren-str}}
Backslashes are only duplicated for escaped \t, \n and \r.
<syntaxhighlight lang="
import "./str" for Str
Line 308 ⟶ 561:
'a\\n\\rb'
'a\0b<tab> That is a NUL character'
'a\rb<tab> That is a
'a\b'
</pre>
|