Convert CSV records to TSV

The present task is focused on the conversion of strings of comma-separated values (CSV) to strings of tab-separated values (TSV) subject to the specific set of rules given below. These generally correspond to conventions regarding CSV and TSV files, but this task is primarily concerned with records considered separately. When considering a text file, a "record" should be seen as a line in the file as determined by the NEWLINE conventions of the operating system.

In accordance with the so-called robustness principle of processing, if a string or input line begins with a valid (possibly empty) sequence of comma-separated values but is followed by nonsense, then processing should continue based on the identified values, with the remainder of the line or string being ignored.

A CSV record

Our starting point will be a character set that includes ASCII; the language of regular expressions (which will be denoted by strings of the form r'REGEX'); and the following PEG (parsing expression grammar) grammar for a single CSV record:

  record = field (',' field)*
  field  = (ws* quoted_field ws*) / unquoted_field
  quoted_field = '"' quoted_field_content '"'
  ws     = ' '

  unquoted_field := r'[^,"]*'

  quoted_field_content := r'(("")|([^"]))*'

Since it has already been stipulated that partitioning of a file into lines is determined by the NEWLINE convention of the computing platform, the NEWLINE cannot appear in either `quoted_field` or `quoted_field_content` so there is no need to express that as a restriction in the PEG.

"Robustness"

As mentioned above, if an input line or string begins with a valid sequence of comma-separated values but is followed by nonsense, then processing should continue using the valid values.

Files, NEWLINE, and end-of-file

When converting a file of text on a particular platform, the converter should adhere to the conventions regarding lines, NEWLINES and end-of-file for that platform as much as possible, but if the last line in the file is not followed by a NEWLINE, then that line should be processed as though it were followed by a new line.

csv2tsv

When converting a CSV record to a TSV record, only the sequence of `unquoted_field` and `quoted_field_content` values is relevant. Before emitting these values as tab-separated values, the following transformations must be performed:

each literal tab (TAB) must be replaced by the two-character string "\t";
each embedded LF (LINEFEED) must be replaced by the two-character string "\n";
each embedded CR (CARRIAGERETURN) must be replaced by the two-character string "\r";
each backslash that occurs in a two-character substring of the form r'\\[nrt]' must be duplicated;
each literal NUL character (#x0) should be replaced by the two-character string "\0".

Whether or not backslashes in general must be duplicated is optional.

EXAMPLES

In these examples, a line of the form:

'input' => 'tsv' # comment

signifies that the input as given on the left should be transformed into the 'tsv' string given on the right, in both cases minus the surrounding quotes. On either side, <tab> signifies the tab character.

'a,"b"'      => 'a<tab>b'   # <tab> is the tab character
'"a","b""c"' => 'a<tab>b"c' # double-quotes are normalized
''           => ''          # an empty line is translated to an empty line
',a'         => '<tab>a'    # an empty field is translated to an empty field
'a,"'        => 'a<tab>'    # the trailing nonsense is ignored, but the comma is not
' a , "b" '  => ' a <tab>b' # whitespace around quoted fields is ignored
'"12",34'    => '12<tab>34' # numeric strings are not treated specially
'a<tab>b'    => 'a\tb'      # literal tabs in the CSV must be converted to "\t"
'a\tb'       => 'a\\tb'     # ... and so an escaped t must be distinguished
'a\n\rb'     => 'a\\n\\rb'  # literal backslashes are also duplicated for n, r
'a�b'        => 'a\0b'      # NUL is translated to the two-character string "\0"
'a<RETURN>b' => 'a\rb'      # <RETURN> is the control character

TASKS

Display a program, procedure or function, preferably named csv2tsv, which, when given a text file of lines or strings beginning with a CSV record as defined above, produces a corresponding stream of TSV records in accordance with all the above requirements.

When run on different platforms, the program should transform a given sequence of records, when presented in accordance with the NEWLINE conventions of each computing platform, in the same way, adjusting for the NEWLINE conventions of the platform.

Use the following as a test file (after changing the RETURN string to the RETURN control character) and note any discrepancies with the requirements; in addition, indicate how strings containing backslashes (besides those preceding r'[nrt]) are transformed.

TEST FILE

The test file should include a line with a RETURN ('\r') character. One way to change the first RETURN in the following to the '\r' control character would be to use `sed`, perhaps along the lines of: sed -i.bak $'s/RETURN/\r/' TEST.csv

a,"b"
"a","b""c"

,a
a,"
 a , "b"
"12",34
a	b, That is a TAB character
a\tb
a\n\rb
a�b, That is a NUL character
aRETURNb, Be sure to change RETURN to the '\r' control character (#xd) 
a\b

ALGOL 68

All input \ characters are doubled in the output.
As with some of the other samples, the input data is stored in an array, not read from a file.

BEGIN # convert lines of CSV to TSV #
    CHAR nul = REPR  0;
    CHAR tab = REPR  9;
    CHAR lf  = REPR 10;
    CHAR cr  = REPR 13;
    # returns s with some control characters converted to <name> #
    PROC show ctrl = ( STRING s )STRING:
         BEGIN
            STRING result := "";
            FOR i FROM LWB s TO UPB s DO
                result +:= IF   s[ i ] = nul THEN "<nul>"
                           ELIF s[ i ] = tab THEN "<tab>"
                           ELIF s[ i ] = cr  THEN "<cr>"
                           ELIF s[ i ] = lf  THEN "<lf>"
                           ELSE s[ i ]
                           FI
            OD;
            result
         END # show ctrl # ;
    # returns csv converted to TSV #
    PROC csv2tsv = ( STRING csv )STRING:
         BEGIN
             BOOL at end := FALSE;
             CHAR ch     := nul;
             # sets ch to the next character in csv, if there is one #
             PROC next = VOID: ch := IF   c pos < c max
                                     THEN csv[ c pos +:= 1 ]
                                     ELSE at end := TRUE
                                        ; nul
                                     FI;
             # skips over spaces and returns the count of skipped spaces #
             PROC spaces = INT:
                  BEGIN
                     INT s count := 0;
                     WHILE NOT at end AND ch = " " DO s count +:= 1; next OD;
                     s count
                  END # spaces # ;
             # adds ch to the TSV - converting some characters to escaped form #
             PROC add = VOID: tsv +:= IF   ch = "\" THEN "\\"
                                      ELIF ch = nul THEN "\0"
                                      ELIF ch = cr  THEN "\r"
                                      ELIF ch = lf  THEN "\n"
                                      ELIF ch = tab THEN "\t"
                                      ELSE ch
                                      FI;
             # parse the csv and generate the tsv #
             STRING tsv := "";
             INT  c pos := LWB csv - 1;
             INT  c max  = UPB csv;
             WHILE NOT at end DO
                 # spaces are not significant around quoted fields but are part of unquoted fields #
                 next;
                 INT space count := spaces;
                 IF ch = """" THEN
                     # quoted field part #
                     WHILE next;
                           WHILE NOT at end AND ch /= """" DO add; next OD;
                           IF NOT at end THEN
                               next;
                               IF ch = """" THEN
                                   # embedded quote #
                                   add
                               FI
                           FI;
                           NOT at end AND ch = """"
                     DO SKIP OD;
                     space count := spaces;
                     IF at end OR ch = "," THEN
                         # nothing significant after the quoted field part #
                         space count := 0
                     FI
                 FI;
                 # unquoted field part #
                 tsv +:= space count * " ";
                 WHILE NOT at end AND ch /= "," DO add; next OD;
                 IF ch = "," THEN
                     # have another field following this one #
                     tsv +:= tab
                 FI
             OD;
             tsv
         END # csv2tsv # ;
    # task test cases #
    []STRING tests =
        ( "a,""b"""
        , """a"",""b""""c"""
        , ""
        , ",a"
        , "a,"""
        , " a , ""b"""
        , """12"",34"
        , "a" + tab + "b, That is a TAB character"
        , "a\tb"
        , "a\n\rb"
        , "a" + nul + "b, That is a NUL character"
        , "a" + cr + "b, Be sure to change RETURN to the '\r' control character (#xd)"
        , "a\b"
        );
    FOR i FROM LWB tests TO UPB tests DO
        print( ( "    {{", show ctrl(          tests[ i ] ),   "}}", newline
               , " -> {{", show ctrl( csv2tsv( tests[ i ] ) ), "}}", newline
               )
             )
    OD
END

Output:

    {{a,"b"}}
 -> {{a<tab>b}}
    {{"a","b""c"}}
 -> {{a<tab>b"c}}
    {{}}
 -> {{}}
    {{,a}}
 -> {{<tab>a}}
    {{a,"}}
 -> {{a<tab>}}
    {{ a , "b"}}
 -> {{ a <tab>b}}
    {{"12",34}}
 -> {{12<tab>34}}
    {{a<tab>b, That is a TAB character}}
 -> {{a\tb<tab> That is a TAB character}}
    {{a\tb}}
 -> {{a\\tb}}
    {{a\n\rb}}
 -> {{a\\n\\rb}}
    {{a<nul>b, That is a NUL character}}
 -> {{a\0b<tab> That is a NUL character}}
    {{a<cr>b, Be sure to change RETURN to the '\r' control character (#xd)}}
 -> {{a\rb<tab> Be sure to change RETURN to the '\\r' control character (#xd)}}
    {{a\b}}
 -> {{a\\b}}

J

Implementation:

tokenize=: (0;(0 10#:10*do@>cutLF{{)n
    1.1 1.1 1.1 1.1 1.1 NB. 0  start here
    1.2 2.0 3.2 1.2 1.0 NB. 1  , or CR or LF starts a new "token"
    2   4   2   2   2   NB. 2  quote toggles "quoted field mode"
    1.2 2.0 1.0 2.2 1.0 NB. 3  CR,LF: 1 "token"  CR,CR and CR,',': 2 "tokens"
    1.2 2.0 3.2 1.2 5.3 NB. 4  closing quote must be followed by a delimiter
    5   5   1.1 1.1 5   NB. 5  resync on newline after encountering nonsense
}});(;/',"',LF,CR);0 _1 0 _1) ;: LF,]
NB. ,   "   CR  LF  ...

unquote=: {{
  txt=. (1+CRLF-:2{.y)}.y  NB. discard leading ',', LF, CR or CRLF
  if. '"'={.txt do.
    rplc&('""';'"') }. txt}.~-'"'={:txt
  else.
    txt
  end.
}}

NUL=: 0{a.

canonical=: {{ NB. read if file and drop optional trailing newline
  txt=. fread^:L. y
  txt}.~-(CRLF-:_2{.txt)+({:txt)e.CRLF
}}

csv2mat=: (e.&CRLF@{.@> unquote each;.1 ])@tokenize@canonical

escape=: rplc&('\';'\\';NUL;'\0';TAB;'\t';LF;'\n';CR;'\r')@":

mat2tsv=: {{ ;LF,~each ([,TAB,])each/"1 escape each y }}

csv2tsv=: mat2tsv@csv2mat

In current J contexts, LF is always a newline character. CR and CRLF are considered obsolete. Here we support CR and CRLF as newline delimiters when reading, but standardize on LF when writing.

Also, we escape all backslashes, regardless of what character follows, during tsv encoding.

Here, also, we interpret "nonsense" as starting immediately after a closing quote which is not followed by a delimiter, and ending immediately before the most immediately following newline.

For csv parsing we first break out fields using ;:. Here, each field is preceded by a delimiter. (We discard an optional trailing newline from the csv text and prepend a newline at the beginning so that every field has a preceding delimiter. Also, of course, if we were given a file reference, we work with the text of the file rather than its name.)

Then, these fields are formed into rows (fields which began immediately following newlines start a new row), and each field is stripped of delimiters and non-textual quotes are removed. The result is a two-dimensional matrix: the result of csv2mat.

To translate to tsv form, we would first escape special characters in each field, then insert delimiters between each field and terminate each record with a newline. Thus, mat2tsv takes a two-dimensional matrix and its result is a tsv string. (For utility, mat2tsv also supports numeric matrices, since that was trivial.)

Task example:

   csv2tsv test
a	b
a	b"c
\n	a
a	\n a , 
12	34
a\tb	 That is a TAB character
a\\tb	
a\\n\\rb	
a\0b	 That is a NUL character
a	
b	 That is a LF (linefeed) character
a\\b

jq

Works with: jq

The following program can also be used with gojq, the Go implementation of jq, but until recently NUL (#x0) was left unaltered.

In this entry, the PEG grammar for "record" as defined in the task description is used directly, the point being that PEG operators correspond closely to jq operators, notably PEG's '/' to jq's '//'.

In translating the PEG grammar to a jq program, the main idea is to define a pipeline for each grammar rule. A JSON object with keys "remainder" and "result" is passed through this pipeline, consuming the text in .remainder and building up .result.

## The PEG * operator:
def star(E): (E | star(E)) // . ;

## Helper functions:

# Consume a regular expression rooted at the start of .remainder, or emit empty;
# on success, update .remainder and set .match but do NOT update .result
def consume($re):
  # on failure, match yields empty
  (.remainder | match("^" + $re)) as $match
  | .remainder |= .[$match.length :]
  | .match = $match.string;

def parse($re):
  consume($re)
  | .result = .result + [.match] ;

def ws: consume(" *");

### Parse a string into comma-separated values

def quoted_field_content:
  parse("((\"\")|([^\"]))*")
  | .result[-1] |= gsub("\"\""; "\"");

def unquoted_field: parse("[^,\"]*");

def quoted_field: consume("\"") | quoted_field_content | consume("\"");

def field: (ws | quoted_field | ws) // unquoted_field;

def record: field | star(consume(",") | field);

def csv2tsv:
  {remainder: .} | record | .result | @tsv ;

# Transform an entire file assuming jq is invoked with the -n option
inputs | csv2tsv

Output:

As required:

Backslashes are uniformly duplicated.
Until recently gojq did not handle NUL (#x0) properly.

Julia

Translation of: Phix

function csv_tsv(str)
    p = split(str, ",")
    for (i, f) in enumerate(p)
        if count(==('"'), f) > 1
            p[i] = replace(strip(f, [' ', '"']), "\"\"" => "\"")
        elseif f == "\""
            p[i] = ""
        end
    end
    t = join(p,"<TAB>")
    s = replace(str, "\\" => "\\\\", "\t" => "\\t", "\0" => "\\0", "\n" => "\\n", "\r" => "\\r")
    t = replace(t, "\\" => "\\\\", "\t" => "\\t", "\0" => "\\0", "\n" => "\\n", "\r" => "\\r")
    return s, t
end

const testfile = "test.tmp"
fh = open(testfile, "w")

write(fh, """
a,"b"
"a","b""c"

,a
a,"
 a , "b"
"12",34
a\tb, TAB
a\\tb
a\\n\\rb
a\0b, NUL
a\rb, RETURN
a\\b""")

close(fh)

for test_string in split(read(testfile, String), "\n")
    csv, tsv = csv_tsv(test_string)
    println(lpad(csv, 12), " => ", tsv)
end

Output:

       a,"b" => a<TAB>b
  "a","b""c" => a<TAB>b"c
             =>
          ,a => <TAB>a
         a," => a<TAB>
     a , "b" =>  a <TAB>b
     "12",34 => 12<TAB>34
   a\tb, TAB => a\tb<TAB> TAB
       a\\tb => a\\tb
    a\\n\\rb => a\\n\\rb
   a\0b, NUL => a\0b<TAB> NUL
a\rb, RETURN => a\rb<TAB> RETURN
        a\\b => a\\b

Phix

Note that backtick strings are not escaped, whereas double-quoted strings are.

with javascript_semantics
procedure csv2tsv(string s)
    sequence p = split(s,',',false)
    for i,f in p do
        if length(find_all('"',f))>1 then
            p[i] = substitute(trim(f,` "`),`""`,`"`)
        elsif f=`"` then
            p[i] = ""
        end if
    end for
    string t = join(p,"<tab>")
    s = substitute_all(s,{`\`,"\t",'\0','\n'},{`\\`,"\\t","\\0","\\n"})
    t = substitute_all(t,{`\`,"\t",'\0','\n'},{`\\`,"\\t","\\0","\\n"})
    printf(1,"%12s => %s\n",{s,t})
end procedure

constant test_data = {`a,"b"`,
                      `"a","b""c"`,
                      ``,
                      `,a`,
                      `a,"`,
                      ` a , "b"`,
                      `"12",34`,
                      "a\tb,",      -- That is a TAB character
                      `a\tb`,       -- That is not
                      `a\n\rb`,
                      "a\0b",       -- That is a NUL character
                      "a\nb",       -- That is a LF (linefeed) character
                      `a\b`}
papply(test_data,csv2tsv)

Output:

       a,"b" => a<tab>b
  "a","b""c" => a<tab>b"c
             =>
          ,a => <tab>a
         a," => a<tab>
     a , "b" =>  a <tab>b
     "12",34 => 12<tab>34
       a\tb, => a\tb<tab>
       a\\tb => a\\tb
    a\\n\\rb => a\\n\\rb
        a\0b => a\0b
        a\nb => a\nb
        a\\b => a\\b

Wren

Library: Wren-ioutil

Library: Wren-str

Backslashes are only duplicated for escaped \t, \n and \r.

import "./ioutil" for FileUtil
import "./str" for Str

// These could be given as command line arguments
// but we hard-code them for the purposes of this task.
var fileName1 = "test.csv"
var fileName2 = "test.tsv"

// This next line assumes a line break of "\r\n" for Windows or "\n" otherwise.
var lines = FileUtil.readLines(fileName1)

// Get rid of an empty last line if there is one.
if (lines.count > 1 && lines[-1] == "") lines = lines[0...-1]

var lc = lines.count

// Normalize fields before rejoining with \t.
for (i in 0...lc) {
    // Str.splitCSv treats quoted fields by default as unquoted if there's any
    // leading whitespace but we can't do that here.
    var fields = Str.splitCsv(lines[i], ",", false)
    for (i in 0...fields.count) {
        var numQuotes = fields[i].count { |c| c == "\"" }
        // Treat it as a quoted field for this task if there's at least two quotes
        // and then remove any surrounding whitespace and the outer quotes.
        if (numQuotes > 1) fields[i] = fields[i].trim().trim("\"")

        fields[i] = fields[i].replace("\"\"", "\"")
                             .replace("\\t", "\\\\t")
                             .replace("\\r", "\\\\r") 
                             .replace("\\n", "\\\\n")
                             .replace("\t", "\\t")
                             .replace("\n", "\\n")
                             .replace("\r", "\\r")                             
                             .replace("\0", "\\0")                              
    }
    // Not sure how 'nonsense' is defined but for now blank
    // the final field if it contains nothing but quotes.
    if (fields[-1].count > 0 && fields[-1].all { |c| c == "\"" }) fields[-1] = ""
    lines[i] = fields.join("\t")
}

// Write lines (as amended) to fileName2.
FileUtil.writeLines(fileName2, lines)

// Write contents of fileName2 to the terminal with tabs replaced by <tab> and
// with each line surrounded with single quotes.
lines = FileUtil.readLines(fileName2)[0...-1]
                .map { |line| "'%(line)'".replace("\t", "<tab>") }
                .join("\n")
System.print(lines)

Output:

'a<tab>b'
'a<tab>b"c'
''
'<tab>a'
'a<tab>'
' a <tab>b'
'12<tab>34'
'a\tb<tab> That is a TAB character'
'a\\tb'
'a\\n\\rb'
'a\0b<tab> That is a NUL character'
'a\rb<tab> That is a RETURN character'
'a\b'