XML/Input: Difference between revisions

Line 1,936:
=={{header|jq}}==
Neither the C nor the Go implementations of jq natively support XML,
so in this entry we firstpresent usethree `xq`, a jq "wrapper", and secondsolutions:
 
a third-party XML-to-JSON translator, `knead`.
* the first uses `xq`, a jq "wrapper";
* the second uses a third-party XML-to-JSON translator, `knead`;
* the third is a "pure jq" solution based on a Parsing Expression Grammar for XML.
 
===xq===
Line 1,961 ⟶ 1,964:
As above.
 
===PEG-based Parsing===
In this section, a PEG-based XML parser is presented. Its main goal is
to translate valid XML documents into valid JSON losslessly, rather
than to check for validity.
 
In particular, the relative ordering of embedded tags and "text"
fragments is preserved, as is "white space" when significant in
accordance with the XML specification.
 
Being PEG-based, however, the parser should be quite easy to adapt for other purposes.
 
A jq filter, `jsonify`, is also provided for converting hex character codes
of the form `&#x....;' to the corresponding character, e.g. "Émily" -> "Émily".
It also removes strings of the form '^\n *$' in the "text" portions of the XML document.
 
Some other noteworthy points:
 
* since "duplicate attribute names within a tag are not permitted with XML", we can group the attributes within a tag as a JSON object, as jq respects key ordering.
 
* since XML tags cannot begin with `@`, the "PROLOG" is rendered as a JSON object with key "@PROLOG" and likewise for "COMMENT", "DTD" and "CDATA".
 
* consecutive attribute-value pairs are grouped together under the key named "@attributes".
 
The grammar is primarily adapted from:
* (1) https://peerj.com/preprints/1503/
* (2) https://cs.lmu.edu/~ray/notes/xmlgrammar/
====PEG Infrastructure====
<syntaxhighlight lang=jq>
# PEG to jq transcription is based on these equivalences:
# Sequence: e1 e2 e1 | e2
# Ordered choice: e1 / e2 e1 // e2
# Zero-or-more: e* star(E)
# One-or-more: e+ plus(E)
# Optional: e? optional(E)
# And-predicate: &e amp(E) # no input is consumed
# Not-predicate: !e neg(E) # no input is consumed
 
# The idea is to pass a JSON object {remainder:_, result:_ } through a
# pipeline, consuming the text in .remainder and building up .result.
 
def star(E): ((E | star(E)) // .) ;
def plus(E): E | (plus(E) // . );
def optional(E): (E // .);
def amp(E): . as $in | E | $in;
def neg(E): select( [E] == [] );
 
### Helper functions:
 
# Consume a regular expression rooted at the start of .remainder, or emit empty;
# on success, update .remainder and set .match but do NOT update .result
def consume($re):
# on failure, match yields empty
(.remainder | match("^" + $re)) as $match
| .remainder |= .[$match.length :]
| .match = $match.string;
 
def parse($re):
consume($re)
| .result = .result + [.match] ;
 
def consumeliteral($s):
select(.remainder | startswith($s))
| .remainder |= .[$s | length :] ;
 
def literal($s):
consumeliteral($s)
| .result += [$s];
 
# Tagging
def box(E):
((.result = null) | E) as $e
| .remainder = $e.remainder
| .result += [$e.result] # the magic sauce
;
 
def box(name; E):
((.result = null) | E) as $e
| .remainder = $e.remainder
| .result += [{(name): (try ($e.result|join("")) catch $e.result) }] # the magic sauce
;
 
def objectify(E):
box(E)
| .result[-1] |= {(.[0]): .[1:]} ;
 
def keyvalue(E):
box(E)
| .result[-1] |= {(.[0]): .[1]} ;
 
# optional whitespace
def ws: consume("[ \n\r\t]*");
 
def string_except($regex):
box(star(neg( parse($regex) ) | parse("."))) | .result[-1] |= add;
 
</syntaxhighlight>
====XML Grammar====
<syntaxhighlight lang=jq>
def XML:
def String : ((consume("\"") | parse("[^\"]*") | consume("\"")) //
(consume("'") | parse("[^']*") | consume("'")));
 
def CDataSec : box("@CDATA"; consume("<!\\[CDATA\\[") | string_except("]]") | consume("]]>") ) | ws;
def PROLOG : box("@PROLOG"; consume("<\\?xml") | string_except("\\?>") | consume("\\?>"));
def DTD : box("@DTD"; consume("<!") | parse("[^>]") | consume(">"));
def COMMENT : box("@COMMENT"; consume("<!--") | string_except("-->") | consume("-->"));
 
def CharData : parse("[^<]+"); # only `<` is disallowed
 
def Name : parse("[A-Za-z:_][^/=<>\n\r\t ]*");
 
def Attribute : keyvalue(Name | ws | consume("=") | ws | String | ws);
def Attributes: box( plus(Attribute) ) | .result[-1] |= {"@attributes": add} ;
 
# <foo> must be matched with </foo>
def Element :
def Content : star(Element // CDataSec // CharData // COMMENT);
objectify( consume("<")
| Name
| .result[-1] as $name
| ws
| (Attributes // ws)
| ( (consume("/>")
// (consume(">") | Content | consume("</") | consumeliteral($name) | consume(">")))
| ws) ) ;
 
{remainder: . }
| ws
| optional(PROLOG) | ws
| optional(DTD) | ws
| star(COMMENT | ws)
| Element | ws # for HTML, one would use star(Element) here
| star(COMMENT | ws)
| .result;
</syntaxhighlight>
====The Task====
<syntaxhighlight lang=jq>
# For handling hex character codes &#x
def hex2i:
def toi: if . >= 87 then .-87 else . - 48 end;
reduce ( ascii_downcase | explode | map(toi) | reverse[]) as $i ([1, 0]; # [power, sum]
.[1] += $i * .[0]
| .[0] *= 16 )
| .[1];
 
def hexcode2json:
gsub("&#x(?<x>....);" ; .x | [hex2i] | implode) ;
 
def jsonify:
walk( if type == "array"
then map(select(type == "string" and test("^\n *$") | not))
elif type == "string" then hexcode2json
else . end);
 
# First convert to JSON ...
XML | jsonify
# ... and then extract Student Names
| .[]
| (.Students[].Student[]["@attributes"] // empty).Name
</syntaxhighlight>
'''Invocation''': jq -Rrs -f xml.jq students.xml
{{output}}
As above.
 
=={{header|Julia}}==
2,484

edits