Jump to content

URL parser

From Rosetta Code
Revision as of 07:19, 25 July 2015 by Rdm (talk | contribs) ({{header|J}})
Task
URL parser
You are encouraged to solve this task according to the task description, using any language you may know.

URL are very common strings with a simple syntax:

scheme://[username:password@]domain[:port]/path?query_string#fragment_id

This task (which has nothing to do with URL encoding or URL decoding) is to parse a well-formed URL to retrieve the relevant information scheme, domain, path,...

According to the standards, the characters [!*'();:@&=+$,/?%#[]] only need to be percent-encoded in case of possible confusion. So warn the splits and regular expressions. Note also that, the path, query and fragment are case sensitive, even if the scheme and domain are not.

The way the returned information is provided (set of variables, array, structured, record, object,...) is language dependent and left to the programer, but the code should be clear enough to reuse.

Extra credit is given for clear errors diagnostic.



Test cases

According to T. Berners-Lee

foo://example.com:8042/over/there?name=ferret#nose should parse into:

  • scheme = foo
  • domain = example.com
  • port = :8042
  • path = over/there
  • query = name=ferret
  • fragment = nose

urn:example:animal:ferret:nose should parse into:

  • scheme = urn
  • path = example:animal:ferret:nose

other must parse include:

  • jdbc:mysql://test_user:ouupppssss@localhost:3306/sakila?profileSQL=true
  • ftp://ftp.is.co.za/rfc/rfc1808.txt
  • http://www.ietf.org/rfc/rfc2396.txt#header1
  • ldap://[2001:db8::7]/c=GB?objectClass=one&objectClass=two
  • mailto:John.Doe@example.com
  • news:comp.infosystems.www.servers.unix
  • tel:+1-816-555-1212
  • telnet://192.0.2.16:80/
  • urn:oasis:names:specification:docbook:dtd:xml:4.1.2

Go

This uses Go's standard net/url package. The source code for this package (excluding tests) is in a single file of ~720 lines. <lang go>package main

import ( "fmt" "log" "net" "net/url" )

func main() { for _, in := range []string{ "foo://example.com:8042/over/there?name=ferret#nose", "urn:example:animal:ferret:nose", "jdbc:mysql://test_user:ouupppssss@localhost:3306/sakila?profileSQL=true", "ftp://ftp.is.co.za/rfc/rfc1808.txt", "http://www.ietf.org/rfc/rfc2396.txt#header1", "ldap://[2001:db8::7]/c=GB?objectClass=one&objectClass=two", "mailto:John.Doe@example.com", "news:comp.infosystems.www.servers.unix", "tel:+1-816-555-1212", "telnet://192.0.2.16:80/", "urn:oasis:names:specification:docbook:dtd:xml:4.1.2",

"ssh://alice@example.com", "https://bob:pass@example.com/place", "http://example.com/?a=1&b=2+2&c=3&c=4&d=%65%6e%63%6F%64%65%64", } { fmt.Println(in) u, err := url.Parse(in) if err != nil { log.Println(err) continue } if in != u.String() { fmt.Printf("Note: reassmebles as %q\n", u) } printURL(u) } }

func printURL(u *url.URL) { fmt.Println(" Scheme:", u.Scheme) if u.Opaque != "" { fmt.Println(" Opaque:", u.Opaque) } if u.User != nil { fmt.Println(" Username:", u.User.Username()) if pwd, ok := u.User.Password(); ok { fmt.Println(" Password:", pwd) } } if u.Host != "" { if host, port, err := net.SplitHostPort(u.Host); err == nil { fmt.Println(" Host:", host) fmt.Println(" Port:", port) } else { fmt.Println(" Host:", u.Host) } } if u.Path != "" { fmt.Println(" Path:", u.Path) } if u.RawQuery != "" { fmt.Println(" RawQuery:", u.RawQuery) m, err := url.ParseQuery(u.RawQuery) if err == nil { for k, v := range m { fmt.Printf(" Key: %q Values: %q\n", k, v) } } } if u.Fragment != "" { fmt.Println(" Fragment:", u.Fragment) } }</lang>

Output:
foo://example.com:8042/over/there?name=ferret#nose
    Scheme: foo
    Host: example.com
    Port: 8042
    Path: /over/there
    RawQuery: name=ferret
        Key: "name" Values: ["ferret"]
    Fragment: nose
urn:example:animal:ferret:nose
    Scheme: urn
    Opaque: example:animal:ferret:nose
jdbc:mysql://test_user:ouupppssss@localhost:3306/sakila?profileSQL=true
    Scheme: jdbc
    Opaque: mysql://test_user:ouupppssss@localhost:3306/sakila
    RawQuery: profileSQL=true
        Key: "profileSQL" Values: ["true"]
ftp://ftp.is.co.za/rfc/rfc1808.txt
    Scheme: ftp
    Host: ftp.is.co.za
    Path: /rfc/rfc1808.txt
http://www.ietf.org/rfc/rfc2396.txt#header1
    Scheme: http
    Host: www.ietf.org
    Path: /rfc/rfc2396.txt
    Fragment: header1
ldap://[2001:db8::7]/c=GB?objectClass=one&objectClass=two
    Scheme: ldap
    Host: [2001:db8::7]
    Path: /c=GB
    RawQuery: objectClass=one&objectClass=two
        Key: "objectClass" Values: ["one" "two"]
mailto:John.Doe@example.com
    Scheme: mailto
    Opaque: John.Doe@example.com
news:comp.infosystems.www.servers.unix
    Scheme: news
    Opaque: comp.infosystems.www.servers.unix
tel:+1-816-555-1212
    Scheme: tel
    Opaque: +1-816-555-1212
telnet://192.0.2.16:80/
    Scheme: telnet
    Host: 192.0.2.16
    Port: 80
    Path: /
urn:oasis:names:specification:docbook:dtd:xml:4.1.2
    Scheme: urn
    Opaque: oasis:names:specification:docbook:dtd:xml:4.1.2
ssh://alice@example.com
    Scheme: ssh
    Username: alice
    Host: example.com
https://bob:pass@example.com/place
    Scheme: https
    Username: bob
    Password: pass
    Host: example.com
    Path: /place
http://example.com/?a=1&b=2+2&c=3&c=4&d=%65%6e%63%6F%64%65%64
    Scheme: http
    Host: example.com
    Path: /
    RawQuery: a=1&b=2+2&c=3&c=4&d=%65%6e%63%6F%64%65%64
        Key: "a" Values: ["1"]
        Key: "b" Values: ["2 2"]
        Key: "c" Values: ["3" "4"]
        Key: "d" Values: ["encoded"]

J

As most errors are contextual (e.g. invalid authority, invalid path, unrecognized scheme), we shall defer error testing to the relevant consumers. This might offend some on the grounds of temporary safety, but consumers already bear responsibility to parse and validate their relevant uri element(s).

Our parsing strategy is fixed format recursive descent. (Please do not criticize this on efficiency grounds without first investigating the implementations of other parsers.)

Implementation:

<lang J>split=:1 :0

 ({. ; ] }.~ 1+[)~ i.&m

)

uriparts=:3 :0

 'server fragment'=. '#' split y
 'sa query'=. '?' split server
 'scheme authpath'=. ':' split sa
 scheme;authpath;query;fragment

)

queryparts=:3 :0

 (0<#y)#('='split);._1 '&',y

)

authpathparts=:3 :0

 if. '//' -: 2{.y do.
   split=. <;.1 y
   (}.1{::split);;2}.split
 else.
   ;y
 end.

)

authparts=:3 :0

 if. '@' e. y do.
   'userinfo hostport'=. '@' split y
 else.
   userinfo=. [ hostport=. y
 end.
 if. '[' = {.hostport do.
    'host_t port_t'=. ']' split hostport
    assert. (0=#port_t)+.':'={.port_t
    (':' split userinfo),(host_t,']');}.port_t
 else.
    (':' split userinfo),':' split hostport
 end.

)

taskparts=:3 :0

 'scheme authpath querystring fragment'=. uriparts y
 'auth path'=. authpathparts authpath
 'userinfo host port'=. authparts auth
 query=. queryparts querystring
 export=. ;:'scheme userinfo host port path query fragment'
 (#~ 0<#@>@{:"1) (,. do each) export

)</lang>

Task examples:

<lang j> taskparts 'foo://example.com:8042/over/there?name=ferret#nose' ┌────────┬─────────────┐ │scheme │foo │ ├────────┼─────────────┤ │host │example.com │ ├────────┼─────────────┤ │port │8042 │ ├────────┼─────────────┤ │path │/over/there │ ├────────┼─────────────┤ │query │┌────┬──────┐│ │ ││name│ferret││ │ │└────┴──────┘│ ├────────┼─────────────┤ │fragment│nose │ └────────┴─────────────┘

  taskparts 'urn:example:animal:ferret:nose'

┌──────┬──────────────────────────┐ │scheme│urn │ ├──────┼──────────────────────────┤ │path │example:animal:ferret:nose│ └──────┴──────────────────────────┘

  taskparts 'jdbc:mysql://test_user:ouupppssss@localhost:3306/sakila?profileSQL=true'

┌──────┬──────────────────────────────────────────────────┐ │scheme│jdbc │ ├──────┼──────────────────────────────────────────────────┤ │path │mysql://test_user:ouupppssss@localhost:3306/sakila│ ├──────┼──────────────────────────────────────────────────┤ │query │┌──────────┬────┐ │ │ ││profileSQL│true│ │ │ │└──────────┴────┘ │ └──────┴──────────────────────────────────────────────────┘

  taskparts 'ftp://ftp.is.co.za/rfc/rfc1808.txt'

┌──────┬────────────────┐ │scheme│ftp │ ├──────┼────────────────┤ │host │ftp.is.co.za │ ├──────┼────────────────┤ │path │/rfc/rfc1808.txt│ └──────┴────────────────┘

  taskparts 'http://www.ietf.org/rfc/rfc2396.txt#header1'

┌────────┬────────────────┐ │scheme │http │ ├────────┼────────────────┤ │host │www.ietf.org │ ├────────┼────────────────┤ │path │/rfc/rfc2396.txt│ ├────────┼────────────────┤ │fragment│header1 │ └────────┴────────────────┘

  taskparts 'ldap://[2001:db8::7]/c=GB?objectClass=one&objectClass=two'

┌──────┬─────────────────┐ │scheme│ldap │ ├──────┼─────────────────┤ │host │[2001:db8::7] │ ├──────┼─────────────────┤ │path │/c=GB │ ├──────┼─────────────────┤ │query │┌───────────┬───┐│ │ ││objectClass│one││ │ │├───────────┼───┤│ │ ││objectClass│two││ │ │└───────────┴───┘│ └──────┴─────────────────┘

  taskparts 'mailto:John.Doe@example.com'

┌──────┬────────────────────┐ │scheme│mailto │ ├──────┼────────────────────┤ │path │John.Doe@example.com│ └──────┴────────────────────┘

  taskparts 'news:comp.infosystems.www.servers.unix'

┌──────┬─────────────────────────────────┐ │scheme│news │ ├──────┼─────────────────────────────────┤ │path │comp.infosystems.www.servers.unix│ └──────┴─────────────────────────────────┘

  taskparts 'tel:+1-816-555-1212'

┌──────┬───────────────┐ │scheme│tel │ ├──────┼───────────────┤ │path │+1-816-555-1212│ └──────┴───────────────┘

  taskparts 'telnet://192.0.2.16:80/'

┌──────┬──────────┐ │scheme│telnet │ ├──────┼──────────┤ │host │192.0.2.16│ ├──────┼──────────┤ │port │80 │ ├──────┼──────────┤ │path │/ │ └──────┴──────────┘

  taskparts 'urn:oasis:names:specification:docbook:dtd:xml:4.1.2'

┌──────┬───────────────────────────────────────────────┐ │scheme│urn │ ├──────┼───────────────────────────────────────────────┤ │path │oasis:names:specification:docbook:dtd:xml:4.1.2│ └──────┴───────────────────────────────────────────────┘</lang>

Note that the path of the example jdbc uri is itself a uri which may be parsed:

<lang J> taskparts 'mysql://test_user:ouupppssss@localhost:3306/sakila' ┌──────┬──────────┐ │scheme│mysql │ ├──────┼──────────┤ │user │test_user │ ├──────┼──────────┤ │pass │ouupppssss│ ├──────┼──────────┤ │host │localhost │ ├──────┼──────────┤ │port │3306 │ ├──────┼──────────┤ │path │/sakila │ └──────┴──────────┘</lang>

Also, examples borrowed from the go implementation:

<lang J> taskparts 'ssh://alice@example.com' ┌──────┬───────────┐ │scheme│ssh │ ├──────┼───────────┤ │user │alice │ ├──────┼───────────┤ │host │example.com│ └──────┴───────────┘

  taskparts 'https://bob:pass@example.com/place'

┌──────┬───────────┐ │scheme│https │ ├──────┼───────────┤ │user │bob │ ├──────┼───────────┤ │pass │pass │ ├──────┼───────────┤ │host │example.com│ ├──────┼───────────┤ │path │/place │ └──────┴───────────┘

  taskparts 'http://example.com/?a=1&b=2+2&c=3&c=4&d=%65%6e%63%6F%64%65%64'

┌──────┬─────────────────────────┐ │scheme│http │ ├──────┼─────────────────────────┤ │host │example.com │ ├──────┼─────────────────────────┤ │path │/ │ ├──────┼─────────────────────────┤ │query │┌─┬─────────────────────┐│ │ ││a│1 ││ │ │├─┼─────────────────────┤│ │ ││b│2+2 ││ │ │├─┼─────────────────────┤│ │ ││c│3 ││ │ │├─┼─────────────────────┤│ │ ││c│4 ││ │ │├─┼─────────────────────┤│ │ ││d│%65%6e%63%6F%64%65%64││ │ │└─┴─────────────────────┘│ └──────┴─────────────────────────┘</lang>

Note that escape decoding is left to the consumer (as well as decoding things like '+' as a replacement for the space character and determining the absolute significance of relative paths and the details of ip address parsing and so on...). This seems like a good match to the hierarchical nature of uri parsing. See URL decoding for an implementation of escape decoding.

Note that taskparts was engineered specifically for the requirements of this task -- in idiomatic use you should instead expect to call the relevant ____parts routines directly as illustrated by the first four lines of taskparts.

Note that w3c recommends a handling for query strings which differs from that of RFC-3986. For example, the use of ; as replacement for the & delimiter, or the use of the query element name as the query element value when the = delimiter is omitted from the name/value pair. We do not implement that here, as it's not a part of this task. But that sort of implementation could be achieved by replacing the definition of queryparts. And, of course, other treatments of query strings are also possible, should that become necessary...

Racket

Links: url structure in Racket documentation.

<lang racket>#lang racket/base (require racket/match net/url) (define (debug-url-string U)

 (match-define (url s u h p pa? (list (path/param pas prms) ...) q f) (string->url U))
 (printf "URL: ~s~%" U)
 (printf "-----~a~%" (make-string (string-length (format "~s" U)) #\-))
 (when #t          (printf "scheme:         ~s~%" s))
 (when u           (printf "user:           ~s~%" u))
 (when h           (printf "host:           ~s~%" h))
 (when p           (printf "port:           ~s~%" p))
 ;; From documentation link in text:
 ;; > For Unix paths, the root directory is not included in `path';
 ;; > its presence or absence is implicit in the path-absolute? flag.
 (printf "path-absolute?: ~s~%" pa?)
 (printf "path  bits:     ~s~%" pas)
 ;; prms will often be a list of lists. this will print iff
 ;; one of the inner lists is not null
 (when (memf pair? prms) 
   (printf "param bits:     ~s [interleaved with path bits]~%" prms))
 (unless (null? q) (printf "query:          ~s~%" q))
 (when f           (printf "fragment:       ~s~%" f))
 (newline))

(for-each

debug-url-string
'("foo://example.com:8042/over/there?name=ferret#nose"
  "urn:example:animal:ferret:nose"
  "jdbc:mysql://test_user:ouupppssss@localhost:3306/sakila?profileSQL=true"
  "ftp://ftp.is.co.za/rfc/rfc1808.txt"
  "http://www.ietf.org/rfc/rfc2396.txt#header1"
  "ldap://[2001:db8::7]/c=GB?objectClass=one&objectClass=two"
  "mailto:John.Doe@example.com"
  "news:comp.infosystems.www.servers.unix"
  "tel:+1-816-555-1212"
  "telnet://192.0.2.16:80/"
  "urn:oasis:names:specification:docbook:dtd:xml:4.1.2"))</lang>
Output:
URL: "foo://example.com:8042/over/there?name=ferret#nose"
---------------------------------------------------------
scheme:         "foo"
host:           "example.com"
port:           8042
path-absolute?: #t
path  bits:     ("over" "there")
query:          ((name . "ferret"))
fragment:       "nose"

URL: "urn:example:animal:ferret:nose"
-------------------------------------
scheme:         "urn"
path-absolute?: #f
path  bits:     ("example:animal:ferret:nose")

URL: "jdbc:mysql://test_user:ouupppssss@localhost:3306/sakila?profileSQL=true"
------------------------------------------------------------------------------
scheme:         "jdbc"
path-absolute?: #f
path  bits:     ("mysql:" "" "test_user:ouupppssss@localhost:3306" "sakila")
query:          ((profileSQL . "true"))

URL: "ftp://ftp.is.co.za/rfc/rfc1808.txt"
-----------------------------------------
scheme:         "ftp"
host:           "ftp.is.co.za"
path-absolute?: #t
path  bits:     ("rfc" "rfc1808.txt")

URL: "http://www.ietf.org/rfc/rfc2396.txt#header1"
--------------------------------------------------
scheme:         "http"
host:           "www.ietf.org"
path-absolute?: #t
path  bits:     ("rfc" "rfc2396.txt")
fragment:       "header1"

URL: "ldap://[2001:db8::7]/c=GB?objectClass=one&objectClass=two"
----------------------------------------------------------------
scheme:         "ldap"
host:           "[2001"
path-absolute?: #f
path  bits:     ("db8::7]" "c=GB")
query:          ((objectClass . "one") (objectClass . "two"))

IPv6 URL address parses incorrectly. See issue https://github.com/plt/racket/issues/980

URL: "mailto:John.Doe@example.com"
----------------------------------
scheme:         "mailto"
path-absolute?: #f
path  bits:     ("John.Doe@example.com")

URL: "news:comp.infosystems.www.servers.unix"
---------------------------------------------
scheme:         "news"
path-absolute?: #f
path  bits:     ("comp.infosystems.www.servers.unix")

URL: "tel:+1-816-555-1212"
--------------------------
scheme:         "tel"
path-absolute?: #f
path  bits:     ("+1-816-555-1212")

URL: "telnet://192.0.2.16:80/"
------------------------------
scheme:         "telnet"
host:           "192.0.2.16"
port:           80
path-absolute?: #t
path  bits:     ("")

URL: "urn:oasis:names:specification:docbook:dtd:xml:4.1.2"
----------------------------------------------------------
scheme:         "urn"
path-absolute?: #f
path  bits:     ("oasis:names:specification:docbook:dtd:xml:4.1.2")

Tcl

Library: tcllib

Tcllib's uri package already knows how to decompose many kinds of URIs. The implementation is a a quite readable example of this kind of parsing. For this task, we'll use it directly.

Schemes can be added with uri::register, but the rules for this task assume HTTP-style decomposition for unknown schemes, which is done below by reaching into the documented interfaces $::uri::schemes and uri::SplitHttp.

For some URI types (such as urn, news, mailto), this provides more information than the task description demands, which is simply to parse them all as HTTP URIs.

The uri package doesn't presently handle IPv6 syntx as used in the example: a bug and patch will be submitted presently ..

<lang Tcl>package require uri package require uri::urn

  1. a little bit of trickery to format results:

proc pdict {d} {

   array set \t $d
   parray \t

}

proc parse_uri {uri} {

   regexp {^(.*?):(.*)$} $uri -> scheme rest
   if {$scheme in $::uri::schemes} {
       # uri already knows how to split it:
       set parts [uri::split $uri]
   } else {
       # parse as though it's http:
       set parts [uri::SplitHttp $rest]
       dict set parts scheme $scheme
   }
   dict filter $parts value ?* ;# omit empty sections

}

set tests {

   foo://example.com:8042/over/there?name=ferret#nose
   urn:example:animal:ferret:nose
   jdbc:mysql://test_user:ouupppssss@localhost:3306/sakila?profileSQL=true
   ftp://ftp.is.co.za/rfc/rfc1808.txt
   http://www.ietf.org/rfc/rfc2396.txt#header1
   ldap://[2001:db8::7]/c=GB?objectClass=one&objectClass=two
   mailto:John.Doe@example.com
   news:comp.infosystems.www.servers.unix
   tel:+1-816-555-1212
   telnet://192.0.2.16:80/
   urn:oasis:names:specification:docbook:dtd:xml:4.1.2 

}

foreach uri $tests {

   puts \n$uri
   pdict [parse_uri $uri]

}</lang>

Output:
foo://example.com:8042/over/there?name=ferret#nose
	(fragment) = nose
	(host)     = example.com
	(path)     = over/there
	(port)     = 8042
	(query)    = name=ferret
	(scheme)   = foo

urn:example:animal:ferret:nose
	(nid)    = example
	(nss)    = animal:ferret:nose
	(scheme) = urn

jdbc:mysql://test_user:ouupppssss@localhost:3306/sakila?profileSQL=true
	(path)   = mysql://test_user:ouupppssss@localhost:3306/sakila
	(query)  = profileSQL=true
	(scheme) = jdbc

ftp://ftp.is.co.za/rfc/rfc1808.txt
	(host)   = ftp.is.co.za
	(path)   = rfc/rfc1808.txt
	(scheme) = ftp

http://www.ietf.org/rfc/rfc2396.txt#header1
	(fragment) = header1
	(host)     = www.ietf.org
	(path)     = rfc/rfc2396.txt
	(scheme)   = http

ldap://[2001:db8::7]/c=GB?objectClass=one&objectClass=two
	(host)   = [2001
	(scheme) = ldap

mailto:John.Doe@example.com
	(host)   = example.com
	(scheme) = mailto
	(user)   = John.Doe

news:comp.infosystems.www.servers.unix
	(newsgroup-name) = comp.infosystems.www.servers.unix
	(scheme)         = news

tel:+1-816-555-1212
	(path)   = +1-816-555-1212
	(scheme) = tel

telnet://192.0.2.16:80/
	(host)   = 192.0.2.16
	(port)   = 80
	(scheme) = telnet

urn:oasis:names:specification:docbook:dtd:xml:4.1.2
	(nid)    = oasis
	(nss)    = names:specification:docbook:dtd:xml:4.1.2
	(scheme) = urn
Cookies help us deliver our services. By using our services, you agree to our use of cookies.