XML/Input: Difference between revisions
Content added Content deleted
Line 1,936: | Line 1,936: | ||
=={{header|jq}}== |
=={{header|jq}}== |
||
Neither the C nor the Go implementations of jq natively support XML, |
Neither the C nor the Go implementations of jq natively support XML, |
||
so in this entry we |
so in this entry we present three solutions: |
||
a third-party XML-to-JSON translator, `knead`. |
|||
* the first uses `xq`, a jq "wrapper"; |
|||
* the second uses a third-party XML-to-JSON translator, `knead`; |
|||
* the third is a "pure jq" solution based on a Parsing Expression Grammar for XML. |
|||
===xq=== |
===xq=== |
||
Line 1,961: | Line 1,964: | ||
As above. |
As above. |
||
===PEG-based Parsing=== |
|||
In this section, a PEG-based XML parser is presented. Its main goal is |
|||
to translate valid XML documents into valid JSON losslessly, rather |
|||
than to check for validity. |
|||
In particular, the relative ordering of embedded tags and "text" |
|||
fragments is preserved, as is "white space" when significant in |
|||
accordance with the XML specification. |
|||
Being PEG-based, however, the parser should be quite easy to adapt for other purposes. |
|||
A jq filter, `jsonify`, is also provided for converting hex character codes |
|||
of the form `&#x....;' to the corresponding character, e.g. "Émily" -> "Émily". |
|||
It also removes strings of the form '^\n *$' in the "text" portions of the XML document. |
|||
Some other noteworthy points: |
|||
* since "duplicate attribute names within a tag are not permitted with XML", we can group the attributes within a tag as a JSON object, as jq respects key ordering. |
|||
* since XML tags cannot begin with `@`, the "PROLOG" is rendered as a JSON object with key "@PROLOG" and likewise for "COMMENT", "DTD" and "CDATA". |
|||
* consecutive attribute-value pairs are grouped together under the key named "@attributes". |
|||
The grammar is primarily adapted from: |
|||
* (1) https://peerj.com/preprints/1503/ |
|||
* (2) https://cs.lmu.edu/~ray/notes/xmlgrammar/ |
|||
====PEG Infrastructure==== |
|||
<syntaxhighlight lang=jq> |
|||
# PEG to jq transcription is based on these equivalences: |
|||
# Sequence: e1 e2 e1 | e2 |
|||
# Ordered choice: e1 / e2 e1 // e2 |
|||
# Zero-or-more: e* star(E) |
|||
# One-or-more: e+ plus(E) |
|||
# Optional: e? optional(E) |
|||
# And-predicate: &e amp(E) # no input is consumed |
|||
# Not-predicate: !e neg(E) # no input is consumed |
|||
# The idea is to pass a JSON object {remainder:_, result:_ } through a |
|||
# pipeline, consuming the text in .remainder and building up .result. |
|||
def star(E): ((E | star(E)) // .) ; |
|||
def plus(E): E | (plus(E) // . ); |
|||
def optional(E): (E // .); |
|||
def amp(E): . as $in | E | $in; |
|||
def neg(E): select( [E] == [] ); |
|||
### Helper functions: |
|||
# Consume a regular expression rooted at the start of .remainder, or emit empty; |
|||
# on success, update .remainder and set .match but do NOT update .result |
|||
def consume($re): |
|||
# on failure, match yields empty |
|||
(.remainder | match("^" + $re)) as $match |
|||
| .remainder |= .[$match.length :] |
|||
| .match = $match.string; |
|||
def parse($re): |
|||
consume($re) |
|||
| .result = .result + [.match] ; |
|||
def consumeliteral($s): |
|||
select(.remainder | startswith($s)) |
|||
| .remainder |= .[$s | length :] ; |
|||
def literal($s): |
|||
consumeliteral($s) |
|||
| .result += [$s]; |
|||
# Tagging |
|||
def box(E): |
|||
((.result = null) | E) as $e |
|||
| .remainder = $e.remainder |
|||
| .result += [$e.result] # the magic sauce |
|||
; |
|||
def box(name; E): |
|||
((.result = null) | E) as $e |
|||
| .remainder = $e.remainder |
|||
| .result += [{(name): (try ($e.result|join("")) catch $e.result) }] # the magic sauce |
|||
; |
|||
def objectify(E): |
|||
box(E) |
|||
| .result[-1] |= {(.[0]): .[1:]} ; |
|||
def keyvalue(E): |
|||
box(E) |
|||
| .result[-1] |= {(.[0]): .[1]} ; |
|||
# optional whitespace |
|||
def ws: consume("[ \n\r\t]*"); |
|||
def string_except($regex): |
|||
box(star(neg( parse($regex) ) | parse("."))) | .result[-1] |= add; |
|||
</syntaxhighlight> |
|||
====XML Grammar==== |
|||
<syntaxhighlight lang=jq> |
|||
def XML: |
|||
def String : ((consume("\"") | parse("[^\"]*") | consume("\"")) // |
|||
(consume("'") | parse("[^']*") | consume("'"))); |
|||
def CDataSec : box("@CDATA"; consume("<!\\[CDATA\\[") | string_except("]]") | consume("]]>") ) | ws; |
|||
def PROLOG : box("@PROLOG"; consume("<\\?xml") | string_except("\\?>") | consume("\\?>")); |
|||
def DTD : box("@DTD"; consume("<!") | parse("[^>]") | consume(">")); |
|||
def COMMENT : box("@COMMENT"; consume("<!--") | string_except("-->") | consume("-->")); |
|||
def CharData : parse("[^<]+"); # only `<` is disallowed |
|||
def Name : parse("[A-Za-z:_][^/=<>\n\r\t ]*"); |
|||
def Attribute : keyvalue(Name | ws | consume("=") | ws | String | ws); |
|||
def Attributes: box( plus(Attribute) ) | .result[-1] |= {"@attributes": add} ; |
|||
# <foo> must be matched with </foo> |
|||
def Element : |
|||
def Content : star(Element // CDataSec // CharData // COMMENT); |
|||
objectify( consume("<") |
|||
| Name |
|||
| .result[-1] as $name |
|||
| ws |
|||
| (Attributes // ws) |
|||
| ( (consume("/>") |
|||
// (consume(">") | Content | consume("</") | consumeliteral($name) | consume(">"))) |
|||
| ws) ) ; |
|||
{remainder: . } |
|||
| ws |
|||
| optional(PROLOG) | ws |
|||
| optional(DTD) | ws |
|||
| star(COMMENT | ws) |
|||
| Element | ws # for HTML, one would use star(Element) here |
|||
| star(COMMENT | ws) |
|||
| .result; |
|||
</syntaxhighlight> |
|||
====The Task==== |
|||
<syntaxhighlight lang=jq> |
|||
# For handling hex character codes &#x |
|||
def hex2i: |
|||
def toi: if . >= 87 then .-87 else . - 48 end; |
|||
reduce ( ascii_downcase | explode | map(toi) | reverse[]) as $i ([1, 0]; # [power, sum] |
|||
.[1] += $i * .[0] |
|||
| .[0] *= 16 ) |
|||
| .[1]; |
|||
def hexcode2json: |
|||
gsub("&#x(?<x>....);" ; .x | [hex2i] | implode) ; |
|||
def jsonify: |
|||
walk( if type == "array" |
|||
then map(select(type == "string" and test("^\n *$") | not)) |
|||
elif type == "string" then hexcode2json |
|||
else . end); |
|||
# First convert to JSON ... |
|||
XML | jsonify |
|||
# ... and then extract Student Names |
|||
| .[] |
|||
| (.Students[].Student[]["@attributes"] // empty).Name |
|||
</syntaxhighlight> |
|||
'''Invocation''': jq -Rrs -f xml.jq students.xml |
|||
{{output}} |
|||
As above. |
|||
=={{header|Julia}}== |
=={{header|Julia}}== |