URL parser
You are encouraged to solve this task according to the task description, using any language you may know.
URL are very common strings with a simple syntax:
scheme://[username:password@]domain[:port]/path?query_string#fragment_id
This task (which has nothing to do with URL encoding or URL decoding) is to parse a well-formed URL to retrieve the relevant information scheme, domain, path,...
According to the standards, the characters [!*'();:@&=+$,/?%#[]] only need to be percent-encoded in case of possible confusion. So warn the splits and regular expressions. Note also that, the path, query and fragment are case sensitive, even if the scheme and domain are not.
The way the returned information is provided (set of variables, array, structured, record, object,...) is language dependent and left to the programer, but the code should be clear enough to reuse.
Extra credit is given for clear errors diagnostic.
- Here is the official standard: https://tools.ietf.org/html/rfc3986,
- and here a simpler BNF http://www.w3.org/Addressing/URL/5_URI_BNF.html.
Test cases
According to T. Berners-Lee
foo://example.com:8042/over/there?name=ferret#nose should parse into:
- scheme = foo
- domain = example.com
- port = :8042
- path = over/there
- query = name=ferret
- fragment = nose
urn:example:animal:ferret:nose should parse into:
- scheme = urn
- path = example:animal:ferret:nose
other must parse include:
- jdbc:mysql://test_user:ouupppssss@localhost:3306/sakila?profileSQL=true
- ftp://ftp.is.co.za/rfc/rfc1808.txt
- http://www.ietf.org/rfc/rfc2396.txt#header1
- ldap://[2001:db8::7]/c=GB?objectClass=one&objectClass=two
- mailto:John.Doe@example.com
- news:comp.infosystems.www.servers.unix
- tel:+1-816-555-1212
- telnet://192.0.2.16:80/
- urn:oasis:names:specification:docbook:dtd:xml:4.1.2
Go
This uses Go's standard net/url package. The source code for this package (excluding tests) is in a single file of ~720 lines. <lang go>package main
import ( "fmt" "log" "net" "net/url" )
func main() { for _, in := range []string{ "foo://example.com:8042/over/there?name=ferret#nose", "urn:example:animal:ferret:nose", "jdbc:mysql://test_user:ouupppssss@localhost:3306/sakila?profileSQL=true", "ftp://ftp.is.co.za/rfc/rfc1808.txt", "http://www.ietf.org/rfc/rfc2396.txt#header1", "ldap://[2001:db8::7]/c=GB?objectClass=one&objectClass=two", "mailto:John.Doe@example.com", "news:comp.infosystems.www.servers.unix", "tel:+1-816-555-1212", "telnet://192.0.2.16:80/", "urn:oasis:names:specification:docbook:dtd:xml:4.1.2",
"ssh://alice@example.com", "https://bob:pass@example.com/place", "http://example.com/?a=1&b=2+2&c=3&c=4&d=%65%6e%63%6F%64%65%64", } { fmt.Println(in) u, err := url.Parse(in) if err != nil { log.Println(err) continue } if in != u.String() { fmt.Printf("Note: reassmebles as %q\n", u) } printURL(u) } }
func printURL(u *url.URL) { fmt.Println(" Scheme:", u.Scheme) if u.Opaque != "" { fmt.Println(" Opaque:", u.Opaque) } if u.User != nil { fmt.Println(" Username:", u.User.Username()) if pwd, ok := u.User.Password(); ok { fmt.Println(" Password:", pwd) } } if u.Host != "" { if host, port, err := net.SplitHostPort(u.Host); err == nil { fmt.Println(" Host:", host) fmt.Println(" Port:", port) } else { fmt.Println(" Host:", u.Host) } } if u.Path != "" { fmt.Println(" Path:", u.Path) } if u.RawQuery != "" { fmt.Println(" RawQuery:", u.RawQuery) m, err := url.ParseQuery(u.RawQuery) if err == nil { for k, v := range m { fmt.Printf(" Key: %q Values: %q\n", k, v) } } } if u.Fragment != "" { fmt.Println(" Fragment:", u.Fragment) } }</lang>
- Output:
foo://example.com:8042/over/there?name=ferret#nose Scheme: foo Host: example.com Port: 8042 Path: /over/there RawQuery: name=ferret Key: "name" Values: ["ferret"] Fragment: nose urn:example:animal:ferret:nose Scheme: urn Opaque: example:animal:ferret:nose jdbc:mysql://test_user:ouupppssss@localhost:3306/sakila?profileSQL=true Scheme: jdbc Opaque: mysql://test_user:ouupppssss@localhost:3306/sakila RawQuery: profileSQL=true Key: "profileSQL" Values: ["true"] ftp://ftp.is.co.za/rfc/rfc1808.txt Scheme: ftp Host: ftp.is.co.za Path: /rfc/rfc1808.txt http://www.ietf.org/rfc/rfc2396.txt#header1 Scheme: http Host: www.ietf.org Path: /rfc/rfc2396.txt Fragment: header1 ldap://[2001:db8::7]/c=GB?objectClass=one&objectClass=two Scheme: ldap Host: [2001:db8::7] Path: /c=GB RawQuery: objectClass=one&objectClass=two Key: "objectClass" Values: ["one" "two"] mailto:John.Doe@example.com Scheme: mailto Opaque: John.Doe@example.com news:comp.infosystems.www.servers.unix Scheme: news Opaque: comp.infosystems.www.servers.unix tel:+1-816-555-1212 Scheme: tel Opaque: +1-816-555-1212 telnet://192.0.2.16:80/ Scheme: telnet Host: 192.0.2.16 Port: 80 Path: / urn:oasis:names:specification:docbook:dtd:xml:4.1.2 Scheme: urn Opaque: oasis:names:specification:docbook:dtd:xml:4.1.2 ssh://alice@example.com Scheme: ssh Username: alice Host: example.com https://bob:pass@example.com/place Scheme: https Username: bob Password: pass Host: example.com Path: /place http://example.com/?a=1&b=2+2&c=3&c=4&d=%65%6e%63%6F%64%65%64 Scheme: http Host: example.com Path: / RawQuery: a=1&b=2+2&c=3&c=4&d=%65%6e%63%6F%64%65%64 Key: "a" Values: ["1"] Key: "b" Values: ["2 2"] Key: "c" Values: ["3" "4"] Key: "d" Values: ["encoded"]
J
As most errors are contextual (e.g. invalid authority, invalid path, unrecognized scheme), we shall defer error testing to the relevant consumers. This might offend some on the grounds of temporary safety, but consumers already bear responsibility to parse and validate their relevant uri element(s).
Our parsing strategy is fixed format recursive descent. (Please do not criticize this on efficiency grounds without first investigating the implementations of other parsers.)
Implementation:
<lang J>split=:1 :0
({. ; ] }.~ 1+[)~ i.&m
)
uriparts=:3 :0
'server fragment'=. '#' split y 'sa query'=. '?' split server 'scheme authpath'=. ':' split sa scheme;authpath;query;fragment
)
queryparts=:3 :0
(0<#y)#('='split);._1 '&',y
)
authpathparts=:3 :0
if. '//' -: 2{.y do. split=. <;.1 y (}.1{::split);;2}.split else. ;y end.
)
authparts=:3 :0
if. '@' e. y do. 'userinfo hostport'=. '@' split y else. userinfo=. [ hostport=. y end. if. '[' = {.hostport do. 'host_t port_t'=. ']' split hostport assert. (0=#port_t)+.':'={.port_t (':' split userinfo),(host_t,']');}.port_t else. (':' split userinfo),':' split hostport end.
)
taskparts=:3 :0
'scheme authpath querystring fragment'=. uriparts y 'auth path'=. authpathparts authpath 'userinfo host port'=. authparts auth query=. queryparts querystring export=. ;:'scheme userinfo host port path query fragment' (#~ 0<#@>@{:"1) (,. do each) export
)</lang>
Task examples:
<lang j> taskparts 'foo://example.com:8042/over/there?name=ferret#nose' ┌────────┬─────────────┐ │scheme │foo │ ├────────┼─────────────┤ │host │example.com │ ├────────┼─────────────┤ │port │8042 │ ├────────┼─────────────┤ │path │/over/there │ ├────────┼─────────────┤ │query │┌────┬──────┐│ │ ││name│ferret││ │ │└────┴──────┘│ ├────────┼─────────────┤ │fragment│nose │ └────────┴─────────────┘
taskparts 'urn:example:animal:ferret:nose'
┌──────┬──────────────────────────┐ │scheme│urn │ ├──────┼──────────────────────────┤ │path │example:animal:ferret:nose│ └──────┴──────────────────────────┘
taskparts 'jdbc:mysql://test_user:ouupppssss@localhost:3306/sakila?profileSQL=true'
┌──────┬──────────────────────────────────────────────────┐ │scheme│jdbc │ ├──────┼──────────────────────────────────────────────────┤ │path │mysql://test_user:ouupppssss@localhost:3306/sakila│ ├──────┼──────────────────────────────────────────────────┤ │query │┌──────────┬────┐ │ │ ││profileSQL│true│ │ │ │└──────────┴────┘ │ └──────┴──────────────────────────────────────────────────┘
taskparts 'ftp://ftp.is.co.za/rfc/rfc1808.txt'
┌──────┬────────────────┐ │scheme│ftp │ ├──────┼────────────────┤ │host │ftp.is.co.za │ ├──────┼────────────────┤ │path │/rfc/rfc1808.txt│ └──────┴────────────────┘
taskparts 'http://www.ietf.org/rfc/rfc2396.txt#header1'
┌────────┬────────────────┐ │scheme │http │ ├────────┼────────────────┤ │host │www.ietf.org │ ├────────┼────────────────┤ │path │/rfc/rfc2396.txt│ ├────────┼────────────────┤ │fragment│header1 │ └────────┴────────────────┘
taskparts 'ldap://[2001:db8::7]/c=GB?objectClass=one&objectClass=two'
┌──────┬─────────────────┐ │scheme│ldap │ ├──────┼─────────────────┤ │host │[2001:db8::7] │ ├──────┼─────────────────┤ │path │/c=GB │ ├──────┼─────────────────┤ │query │┌───────────┬───┐│ │ ││objectClass│one││ │ │├───────────┼───┤│ │ ││objectClass│two││ │ │└───────────┴───┘│ └──────┴─────────────────┘
taskparts 'mailto:John.Doe@example.com'
┌──────┬────────────────────┐ │scheme│mailto │ ├──────┼────────────────────┤ │path │John.Doe@example.com│ └──────┴────────────────────┘
taskparts 'news:comp.infosystems.www.servers.unix'
┌──────┬─────────────────────────────────┐ │scheme│news │ ├──────┼─────────────────────────────────┤ │path │comp.infosystems.www.servers.unix│ └──────┴─────────────────────────────────┘
taskparts 'tel:+1-816-555-1212'
┌──────┬───────────────┐ │scheme│tel │ ├──────┼───────────────┤ │path │+1-816-555-1212│ └──────┴───────────────┘
taskparts 'telnet://192.0.2.16:80/'
┌──────┬──────────┐ │scheme│telnet │ ├──────┼──────────┤ │host │192.0.2.16│ ├──────┼──────────┤ │port │80 │ ├──────┼──────────┤ │path │/ │ └──────┴──────────┘
taskparts 'urn:oasis:names:specification:docbook:dtd:xml:4.1.2'
┌──────┬───────────────────────────────────────────────┐ │scheme│urn │ ├──────┼───────────────────────────────────────────────┤ │path │oasis:names:specification:docbook:dtd:xml:4.1.2│ └──────┴───────────────────────────────────────────────┘</lang>
Note that the path
of the example jdbc
uri is itself a uri which may be parsed:
<lang J> taskparts 'mysql://test_user:ouupppssss@localhost:3306/sakila' ┌──────┬──────────┐ │scheme│mysql │ ├──────┼──────────┤ │user │test_user │ ├──────┼──────────┤ │pass │ouupppssss│ ├──────┼──────────┤ │host │localhost │ ├──────┼──────────┤ │port │3306 │ ├──────┼──────────┤ │path │/sakila │ └──────┴──────────┘</lang>
Also, examples borrowed from the go
implementation:
<lang J> taskparts 'ssh://alice@example.com' ┌──────┬───────────┐ │scheme│ssh │ ├──────┼───────────┤ │user │alice │ ├──────┼───────────┤ │host │example.com│ └──────┴───────────┘
taskparts 'https://bob:pass@example.com/place'
┌──────┬───────────┐ │scheme│https │ ├──────┼───────────┤ │user │bob │ ├──────┼───────────┤ │pass │pass │ ├──────┼───────────┤ │host │example.com│ ├──────┼───────────┤ │path │/place │ └──────┴───────────┘
taskparts 'http://example.com/?a=1&b=2+2&c=3&c=4&d=%65%6e%63%6F%64%65%64'
┌──────┬─────────────────────────┐ │scheme│http │ ├──────┼─────────────────────────┤ │host │example.com │ ├──────┼─────────────────────────┤ │path │/ │ ├──────┼─────────────────────────┤ │query │┌─┬─────────────────────┐│ │ ││a│1 ││ │ │├─┼─────────────────────┤│ │ ││b│2+2 ││ │ │├─┼─────────────────────┤│ │ ││c│3 ││ │ │├─┼─────────────────────┤│ │ ││c│4 ││ │ │├─┼─────────────────────┤│ │ ││d│%65%6e%63%6F%64%65%64││ │ │└─┴─────────────────────┘│ └──────┴─────────────────────────┘</lang>
Note that escape decoding is left to the consumer (as well as decoding things like '+' as a replacement for the space character and determining the absolute significance of relative paths and the details of ip address parsing and so on...). This seems like a good match to the hierarchical nature of uri parsing. See URL decoding for an implementation of escape decoding.
Note that taskparts
was engineered specifically for the requirements of this task -- in idiomatic use you should instead expect to call the relevant ____parts routines directly as illustrated by the first four lines of taskparts
.
Note that w3c recommends a handling for query strings which differs from that of RFC-3986. For example, the use of ;
as replacement for the &
delimiter, or the use of the query element name as the query element value when the =
delimiter is omitted from the name/value pair. We do not implement that here, as it's not a part of this task. But that sort of implementation could be achieved by replacing the definition of queryparts
. And, of course, other treatments of query strings are also possible, should that become necessary...
Racket
Links: url
structure in Racket documentation.
<lang racket>#lang racket/base (require racket/match net/url) (define (debug-url-string U)
(match-define (url s u h p pa? (list (path/param pas prms) ...) q f) (string->url U)) (printf "URL: ~s~%" U) (printf "-----~a~%" (make-string (string-length (format "~s" U)) #\-)) (when #t (printf "scheme: ~s~%" s)) (when u (printf "user: ~s~%" u)) (when h (printf "host: ~s~%" h)) (when p (printf "port: ~s~%" p)) ;; From documentation link in text: ;; > For Unix paths, the root directory is not included in `path'; ;; > its presence or absence is implicit in the path-absolute? flag. (printf "path-absolute?: ~s~%" pa?) (printf "path bits: ~s~%" pas) ;; prms will often be a list of lists. this will print iff ;; one of the inner lists is not null (when (memf pair? prms) (printf "param bits: ~s [interleaved with path bits]~%" prms)) (unless (null? q) (printf "query: ~s~%" q)) (when f (printf "fragment: ~s~%" f)) (newline))
(for-each
debug-url-string '("foo://example.com:8042/over/there?name=ferret#nose" "urn:example:animal:ferret:nose" "jdbc:mysql://test_user:ouupppssss@localhost:3306/sakila?profileSQL=true" "ftp://ftp.is.co.za/rfc/rfc1808.txt" "http://www.ietf.org/rfc/rfc2396.txt#header1" "ldap://[2001:db8::7]/c=GB?objectClass=one&objectClass=two" "mailto:John.Doe@example.com" "news:comp.infosystems.www.servers.unix" "tel:+1-816-555-1212" "telnet://192.0.2.16:80/" "urn:oasis:names:specification:docbook:dtd:xml:4.1.2"))</lang>
- Output:
URL: "foo://example.com:8042/over/there?name=ferret#nose" --------------------------------------------------------- scheme: "foo" host: "example.com" port: 8042 path-absolute?: #t path bits: ("over" "there") query: ((name . "ferret")) fragment: "nose" URL: "urn:example:animal:ferret:nose" ------------------------------------- scheme: "urn" path-absolute?: #f path bits: ("example:animal:ferret:nose") URL: "jdbc:mysql://test_user:ouupppssss@localhost:3306/sakila?profileSQL=true" ------------------------------------------------------------------------------ scheme: "jdbc" path-absolute?: #f path bits: ("mysql:" "" "test_user:ouupppssss@localhost:3306" "sakila") query: ((profileSQL . "true")) URL: "ftp://ftp.is.co.za/rfc/rfc1808.txt" ----------------------------------------- scheme: "ftp" host: "ftp.is.co.za" path-absolute?: #t path bits: ("rfc" "rfc1808.txt") URL: "http://www.ietf.org/rfc/rfc2396.txt#header1" -------------------------------------------------- scheme: "http" host: "www.ietf.org" path-absolute?: #t path bits: ("rfc" "rfc2396.txt") fragment: "header1" URL: "ldap://[2001:db8::7]/c=GB?objectClass=one&objectClass=two" ---------------------------------------------------------------- scheme: "ldap" host: "[2001" path-absolute?: #f path bits: ("db8::7]" "c=GB") query: ((objectClass . "one") (objectClass . "two"))
IPv6 URL address parses incorrectly. See issue https://github.com/plt/racket/issues/980
URL: "mailto:John.Doe@example.com" ---------------------------------- scheme: "mailto" path-absolute?: #f path bits: ("John.Doe@example.com") URL: "news:comp.infosystems.www.servers.unix" --------------------------------------------- scheme: "news" path-absolute?: #f path bits: ("comp.infosystems.www.servers.unix") URL: "tel:+1-816-555-1212" -------------------------- scheme: "tel" path-absolute?: #f path bits: ("+1-816-555-1212") URL: "telnet://192.0.2.16:80/" ------------------------------ scheme: "telnet" host: "192.0.2.16" port: 80 path-absolute?: #t path bits: ("") URL: "urn:oasis:names:specification:docbook:dtd:xml:4.1.2" ---------------------------------------------------------- scheme: "urn" path-absolute?: #f path bits: ("oasis:names:specification:docbook:dtd:xml:4.1.2")
Tcl
Tcllib's uri package already knows how to decompose many kinds of URIs. The implementation is a a quite readable example of this kind of parsing. For this task, we'll use it directly.
Schemes can be added with uri::register, but the rules for this task assume HTTP-style decomposition for unknown schemes, which is done below by reaching into the documented interfaces $::uri::schemes and uri::SplitHttp.
For some URI types (such as urn, news, mailto), this provides more information than the task description demands, which is simply to parse them all as HTTP URIs.
The uri package doesn't presently handle IPv6 syntx as used in the example: a bug and patch will be submitted presently ..
<lang Tcl>package require uri package require uri::urn
- a little bit of trickery to format results:
proc pdict {d} {
array set \t $d parray \t
}
proc parse_uri {uri} {
regexp {^(.*?):(.*)$} $uri -> scheme rest if {$scheme in $::uri::schemes} { # uri already knows how to split it: set parts [uri::split $uri] } else { # parse as though it's http: set parts [uri::SplitHttp $rest] dict set parts scheme $scheme } dict filter $parts value ?* ;# omit empty sections
}
set tests {
foo://example.com:8042/over/there?name=ferret#nose urn:example:animal:ferret:nose jdbc:mysql://test_user:ouupppssss@localhost:3306/sakila?profileSQL=true ftp://ftp.is.co.za/rfc/rfc1808.txt http://www.ietf.org/rfc/rfc2396.txt#header1 ldap://[2001:db8::7]/c=GB?objectClass=one&objectClass=two mailto:John.Doe@example.com news:comp.infosystems.www.servers.unix tel:+1-816-555-1212 telnet://192.0.2.16:80/ urn:oasis:names:specification:docbook:dtd:xml:4.1.2
}
foreach uri $tests {
puts \n$uri pdict [parse_uri $uri]
}</lang>
- Output:
foo://example.com:8042/over/there?name=ferret#nose (fragment) = nose (host) = example.com (path) = over/there (port) = 8042 (query) = name=ferret (scheme) = foo urn:example:animal:ferret:nose (nid) = example (nss) = animal:ferret:nose (scheme) = urn jdbc:mysql://test_user:ouupppssss@localhost:3306/sakila?profileSQL=true (path) = mysql://test_user:ouupppssss@localhost:3306/sakila (query) = profileSQL=true (scheme) = jdbc ftp://ftp.is.co.za/rfc/rfc1808.txt (host) = ftp.is.co.za (path) = rfc/rfc1808.txt (scheme) = ftp http://www.ietf.org/rfc/rfc2396.txt#header1 (fragment) = header1 (host) = www.ietf.org (path) = rfc/rfc2396.txt (scheme) = http ldap://[2001:db8::7]/c=GB?objectClass=one&objectClass=two (host) = [2001 (scheme) = ldap mailto:John.Doe@example.com (host) = example.com (scheme) = mailto (user) = John.Doe news:comp.infosystems.www.servers.unix (newsgroup-name) = comp.infosystems.www.servers.unix (scheme) = news tel:+1-816-555-1212 (path) = +1-816-555-1212 (scheme) = tel telnet://192.0.2.16:80/ (host) = 192.0.2.16 (port) = 80 (scheme) = telnet urn:oasis:names:specification:docbook:dtd:xml:4.1.2 (nid) = oasis (nss) = names:specification:docbook:dtd:xml:4.1.2 (scheme) = urn