Revision as of 19:52, 29 December 2022 (view source) Peak (talk \| contribs) (→‎{{header\|jq}}) ← Older edit		Revision as of 22:33, 31 December 2022 (view source) Peak (talk \| contribs) (→‎{{header\|jq}}) Newer edit →
Line 1,936: =={{header\|jq}}== Neither the C nor the Go implementations of jq natively support XML, so in this entry we ~~first~~present ~~use~~three ~~`xq`, a jq "wrapper", and second~~solutions: ~~a third-party XML-to-JSON translator, `knead`.~~ * the first uses `xq`, a jq "wrapper"; * the second uses a third-party XML-to-JSON translator, `knead`; * the third is a "pure jq" solution based on a Parsing Expression Grammar for XML. ===xq=== Line 1,961 ⟶ 1,964: As above. ===PEG-based Parsing=== In this section, a PEG-based XML parser is presented. Its main goal is to translate valid XML documents into valid JSON losslessly, rather than to check for validity. In particular, the relative ordering of embedded tags and "text" fragments is preserved, as is "white space" when significant in accordance with the XML specification. Being PEG-based, however, the parser should be quite easy to adapt for other purposes. A jq filter, `jsonify`, is also provided for converting hex character codes of the form `&#x....;' to the corresponding character, e.g. "Émily" -> "Émily". It also removes strings of the form '^\n $' in the "text" portions of the XML document. Some other noteworthy points: since "duplicate attribute names within a tag are not permitted with XML", we can group the attributes within a tag as a JSON object, as jq respects key ordering. * since XML tags cannot begin with `@`, the "PROLOG" is rendered as a JSON object with key "@PROLOG" and likewise for "COMMENT", "DTD" and "CDATA". * consecutive attribute-value pairs are grouped together under the key named "@attributes". The grammar is primarily adapted from: * (1) https://peerj.com/preprints/1503/ * (2) https://cs.lmu.edu/~ray/notes/xmlgrammar/ ====PEG Infrastructure==== <syntaxhighlight lang=jq> # PEG to jq transcription is based on these equivalences: # Sequence: e1 e2 e1 \| e2 # Ordered choice: e1 / e2 e1 // e2 # Zero-or-more: e* star(E) # One-or-more: e+ plus(E) # Optional: e? optional(E) # And-predicate: &e amp(E) # no input is consumed # Not-predicate: !e neg(E) # no input is consumed # The idea is to pass a JSON object {remainder:_, result:_ } through a # pipeline, consuming the text in .remainder and building up .result. def star(E): ((E \| star(E)) // .) ; def plus(E): E \| (plus(E) // . ); def optional(E): (E // .); def amp(E): . as $in \| E \| $in; def neg(E): select( [E] == [] ); ### Helper functions: # Consume a regular expression rooted at the start of .remainder, or emit empty; # on success, update .remainder and set .match but do NOT update .result def consume($re): # on failure, match yields empty (.remainder \| match("^" + $re)) as $match \| .remainder \|= .[$match.length :] \| .match = $match.string; def parse($re): consume($re) \| .result = .result + [.match] ; def consumeliteral($s): select(.remainder \| startswith($s)) \| .remainder \|= .[$s \| length :] ; def literal($s): consumeliteral($s) \| .result += [$s]; # Tagging def box(E): ((.result = null) \| E) as $e \| .remainder = $e.remainder \| .result += [$e.result] # the magic sauce ; def box(name; E): ((.result = null) \| E) as $e \| .remainder = $e.remainder \| .result += [{(name): (try ($e.result\|join("")) catch $e.result) }] # the magic sauce ; def objectify(E): box(E) \| .result[-1] \|= {(.[0]): .[1:]} ; def keyvalue(E): box(E) \| .result[-1] \|= {(.[0]): .[1]} ; # optional whitespace def ws: consume("[ \n\r\t]"); def string_except($regex): box(star(neg( parse($regex) ) \| parse("."))) \| .result[-1] \|= add; </syntaxhighlight> ====XML Grammar==== <syntaxhighlight lang=jq> def XML: def String : ((consume("\"") \| parse("[^\"]") \| consume("\"")) // (consume("'") \| parse("[^']") \| consume("'"))); def CDataSec : box("@CDATA"; consume("<!\\[CDATA\\[") \| string_except("]]") \| consume("]]>") ) \| ws; def PROLOG : box("@PROLOG"; consume("<\\?xml") \| string_except("\\?>") \| consume("\\?>")); def DTD : box("@DTD"; consume("<!") \| parse("[^>]") \| consume(">")); def COMMENT : box("@COMMENT"; consume("<!--") \| string_except("-->") \| consume("-->")); def CharData : parse("[^<]+"); # only `<` is disallowed def Name : parse("[A-Za-z:_][^/=<>\n\r\t ]"); def Attribute : keyvalue(Name \| ws \| consume("=") \| ws \| String \| ws); def Attributes: box( plus(Attribute) ) \| .result[-1] \|= {"@attributes": add} ; # <foo> must be matched with </foo> def Element : def Content : star(Element // CDataSec // CharData // COMMENT); objectify( consume("<") \| Name \| .result[-1] as $name \| ws \| (Attributes // ws) \| ( (consume("/>") // (consume(">") \| Content \| consume("</") \| consumeliteral($name) \| consume(">"))) \| ws) ) ; {remainder: . } \| ws \| optional(PROLOG) \| ws \| optional(DTD) \| ws \| star(COMMENT \| ws) \| Element \| ws # for HTML, one would use star(Element) here \| star(COMMENT \| ws) \| .result; </syntaxhighlight> ====The Task==== <syntaxhighlight lang=jq> # For handling hex character codes &#x def hex2i: def toi: if . >= 87 then .-87 else . - 48 end; reduce ( ascii_downcase \| explode \| map(toi) \| reverse[]) as $i ([1, 0]; # [power, sum] .[1] += $i * .[0] \| .[0] = 16 ) \| .[1]; def hexcode2json: gsub("&#x(?<x>....);" ; .x \| [hex2i] \| implode) ; def jsonify: walk( if type == "array" then map(select(type == "string" and test("^\n $") \| not)) elif type == "string" then hexcode2json else . end); # First convert to JSON ... XML \| jsonify # ... and then extract Student Names \| .[] \| (.Students[].Student[]["@attributes"] // empty).Name </syntaxhighlight> '''Invocation''': jq -Rrs -f xml.jq students.xml {{output}} As above. =={{header\|Julia}}==

XML/Input: Difference between revisions

XML/Input (view source)

Revision as of 22:33, 31 December 2022