XML/Input: Difference between revisions

Content added Content deleted

Inline

@@ Line 1,936: / Line 1,936: @@
 =={{header|jq}}==
 Neither the C nor the Go implementations of jq natively support XML,
-so in this entry we first use `xq`, a jq "wrapper", and second
+so in this entry we present three solutions:
-a third-party XML-to-JSON translator, `knead`.
+* the first uses `xq`, a jq "wrapper";
+* the second uses a third-party XML-to-JSON translator, `knead`;
+* the third is a "pure jq" solution based on a Parsing Expression Grammar for XML.
 ===xq===
@@ Line 1,961: / Line 1,964: @@
 As above.
+===PEG-based Parsing===
+In this section, a PEG-based XML parser is presented. Its main goal is
+to translate valid XML documents into valid JSON losslessly, rather
+than to check for validity.
+In particular, the relative ordering of embedded tags and "text"
+fragments is preserved, as is "white space" when significant in
+accordance with the XML specification.
+Being PEG-based, however, the parser should be quite easy to adapt for other purposes.
+A jq filter, `jsonify`, is also provided for converting hex character codes
+of the form `&#x....;' to the corresponding character, e.g. "&#x00C9;mily" -> "Émily".
+It also removes strings of the form '^\n *$' in the "text" portions of the XML document.
+Some other noteworthy points:
+* since "duplicate attribute names within a tag are not permitted with XML", we can group the attributes within a tag as a JSON object, as jq respects key ordering.
+* since XML tags cannot begin with `@`, the "PROLOG" is rendered as a JSON object with key "@PROLOG" and likewise for "COMMENT", "DTD" and "CDATA".
+* consecutive attribute-value pairs are grouped together under the key named "@attributes".
+The grammar is primarily adapted from:
+* (1) https://peerj.com/preprints/1503/
+* (2) https://cs.lmu.edu/~ray/notes/xmlgrammar/
+====PEG Infrastructure====
+<syntaxhighlight lang=jq>
+# PEG to jq transcription is based on these equivalences:
+# Sequence: e1 e2             e1 | e2
+# Ordered choice: e1 / e2     e1 // e2
+# Zero-or-more: e*            star(E)
+# One-or-more: e+             plus(E)
+# Optional: e?                optional(E)
+# And-predicate: &e           amp(E)      # no input is consumed
+# Not-predicate: !e           neg(E)      # no input is consumed
+# The idea is to pass a JSON object {remainder:_, result:_ } through a
+# pipeline, consuming the text in .remainder and building up .result.
+def star(E): ((E | star(E)) // .) ;
+def plus(E): E | (plus(E) // . );
+def optional(E): (E // .);
+def amp(E): . as $in | E | $in;
+def neg(E): select( [E] == [] );
+### Helper functions:
+# Consume a regular expression rooted at the start of .remainder, or emit empty;
+# on success, update .remainder and set .match but do NOT update .result
+def consume($re):
+  # on failure, match yields empty
+  (.remainder | match("^" + $re)) as $match
+  | .remainder |= .[$match.length :]
+  | .match = $match.string;
+def parse($re):
+  consume($re)
+  | .result = .result + [.match] ;
+def consumeliteral($s):
+  select(.remainder | startswith($s))
+  | .remainder |= .[$s | length :] ;
+def literal($s):
+  consumeliteral($s)
+  | .result += [$s];
+# Tagging
+def box(E):
+  ((.result = null) | E) as $e
+  | .remainder = $e.remainder
+  | .result += [$e.result]  # the magic sauce
+  ;
+def box(name; E):
+  ((.result = null) | E) as $e
+  | .remainder = $e.remainder
+  | .result += [{(name): (try ($e.result|join("")) catch $e.result) }]  # the magic sauce
+  ;
+def objectify(E):
+  box(E)
+  | .result[-1] |= {(.[0]): .[1:]} ;
+def keyvalue(E):
+  box(E)
+  | .result[-1] |= {(.[0]): .[1]} ;
+# optional whitespace
+def ws: consume("[ \n\r\t]*");
+def string_except($regex):
+  box(star(neg( parse($regex) ) | parse("."))) | .result[-1] |= add;
+</syntaxhighlight>
+====XML Grammar====
+<syntaxhighlight lang=jq>
+def XML:
+  def String    : ((consume("\"") | parse("[^\"]*") | consume("\"")) //
+                   (consume("'") | parse("[^']*") | consume("'")));
+  def CDataSec  : box("@CDATA";  consume("<!\\[CDATA\\[") | string_except("]]") | consume("]]>") ) | ws;
+  def PROLOG    : box("@PROLOG"; consume("<\\?xml") | string_except("\\?>") | consume("\\?>"));
+  def DTD       : box("@DTD";    consume("<!") | parse("[^>]") | consume(">"));
+  def COMMENT   : box("@COMMENT"; consume("<!--") | string_except("-->") | consume("-->"));
+  def CharData  : parse("[^<]+");  # only `<` is disallowed
+  def Name      : parse("[A-Za-z:_][^/=<>\n\r\t ]*");
+  def Attribute : keyvalue(Name | ws | consume("=") | ws | String | ws);
+  def Attributes: box( plus(Attribute) ) | .result[-1] |= {"@attributes": add} ;
+  # <foo> must be matched with </foo>
+  def Element   :
+    def Content : star(Element // CDataSec // CharData // COMMENT);
+    objectify( consume("<")
+         | Name
+         | .result[-1] as $name
+	 | ws
+         | (Attributes // ws)
+         | (  (consume("/>")
+	   // (consume(">") | Content | consume("</") | consumeliteral($name) | consume(">")))
+         | ws) ) ;
+  {remainder: . }
+  | ws
+  | optional(PROLOG) | ws
+  | optional(DTD) | ws
+  | star(COMMENT | ws)
+  | Element | ws             # for HTML, one would use star(Element) here
+  | star(COMMENT | ws)
+  | .result;
+</syntaxhighlight>
+====The Task====
+<syntaxhighlight lang=jq>
+# For handling hex character codes &#x
+def hex2i:
+  def toi: if . >= 87 then .-87 else . - 48 end;
+  reduce ( ascii_downcase | explode | map(toi) | reverse[]) as $i ([1, 0]; # [power, sum]
+    .[1] += $i * .[0]
+    | .[0] *= 16 )
+  | .[1];
+def hexcode2json:
+  gsub("&#x(?<x>....);" ; .x | [hex2i] | implode) ;
+def jsonify:
+  walk( if type == "array"
+        then map(select(type == "string" and test("^\n *$") | not))
+	elif type == "string" then hexcode2json
+	else . end);
+# First convert to JSON ...
+XML | jsonify
+# ... and then extract Student Names
+| .[]
+| (.Students[].Student[]["@attributes"] // empty).Name
+</syntaxhighlight>
+'''Invocation''': jq -Rrs -f xml.jq students.xml
+{{output}}
+As above.
 =={{header|Julia}}==