Find URI in text

From Rosetta Code
Jump to: navigation, search
Find URI in text is a draft programming task. It is not yet considered ready to be promoted as a complete task, for reasons that should be found in its talk page.

Write a function to search plain text for URIs or IRIs.

The function should return a list of URIs or IRIs found in the text.

The definition of a URI is given in RFC 3986. IRI is defined in RFC 3987.

For searching URIs in particular "Appendix C. Delimiting a URI in Context" is noteworthy.

The abbreviation IRI isn't as well known as URI and the short description is that an IRI is just an alternate form of a URI that supports Internationalization and hence Unicode. While many specifications support both forms this isn't universal.

Consider the following issues:

  • . , ; ' ? ( ) are legal characters in a URI, but they are often used in plain text as a delimiter.
  • IRIs allow most (but not all) unicode characters.
  • URIs can be something else besides http:// or https://

sample text:

this URI contains an illegal character, parentheses and a misplaced full stop:
http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/).
and another one just to confuse the parser: http://en.wikipedia.org/wiki/-)
")" is handled the wrong way by the mediawiki parser.
ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
leading junk ftp://domain.name/path/embedded?punct/uation.
leading junk ftp://domain.name/dangling_close_paren)
if you have other interesting URIs for testing, please add them here:

Regular expressions to solve the task are fine, but alternative approaches are welcome too. (otherwise, this task would degrade into 'finding and applying the best regular expression')

Extra Credit: implement the parser to match the IRI specification in RFC 3987.

Contents

[edit] Icon and Unicon

This example follows RFC 3986 very closely (see Talk page for discussion). For better IP parsing see Parse_an_IP_Address. This solution doesn't handle IRIs per RFC 3987. Neither Icon nor Unicon natively support Unicode although ObjectIcon does.

Delimited examples of the form <URI> or "URI" will be correctly parse in any event. Handling of other possibly ambiguous examples that include valid URI characters is done by the 'findURItext' and 'disambURI' procedures. All candidate URIs are returned since once information is removed it will be lost and may be difficult for a user to reconstruct. This solution deals with all of the trailing character and balance considerations.

procedure main()
every write(findURItext("this URI contains an illegal character, parentheses_
and a misplaced full stop:\n_
http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). _
which is handled by http://mediawiki.org/).\n_
and another one just to confuse the parser: _
http://en.wikipedia.org/wiki/-)\n_
\")\" is handled the wrong way by the mediawiki parser.\n_
ftp://domain.name/path(balanced_brackets)/foo.html\n_
ftp://domain.name/path(balanced_brackets)/ending.in.dot.\n_
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.\n_
leading junk ftp://domain.name/path/embedded?punct/uation.\n_
leading junk ftp://domain.name/dangling_close_paren)\n_
if you have other interesting URIs for testing, please add them here:\n_
blah (foo://domain.hld/))))"
))
end
 
$define GENDELIM ':/?#[]@'
$define SUBDELIM '!$&()*+,;=\''
$define UNRESERVED &letters ++ &digits ++ '-._~'
$define RESERVED GENDELIM++SUBDELIM
$define HEXDIGITS '0123456789aAbBcCdDeEfF'
 
procedure findURItext(s) #: generate all syntatically valid URI's from s
local u,p
s ? while tab(upto(&letters)) || (u := 2(p := &pos, URI())) do {
suspend u # return parsed URI
every suspend disambURI(u,p) # deal with text ambiguities, return many
}
end
 
procedure disambURI(u,p) #: generate disambiguous URIs from parsed
local u2
repeat {
if any('.,;?',u[-1]) then
suspend u := u[1:-1] # remove trailing .,;? from URI
else if u[-1] == "'" == &subject[p-:=1] then
suspend u := u[1:-1] # remove trailing ' from 'URI'
else if any('()',u[-1]) then {
every u ? u2 := tab(bal())
if u ~==:= u2 then suspend u # longest balanced URI wrt ()
}
else break # done
}
end
 
procedure URI() #: match longest URI at cursor
static sc2
initial sc2 := &letters ++ &digits ++ '+-.' # scheme
suspend (
( tab(any(&letters)) || (tab(many(sc2)) |="") || =":" ) || # scheme
( (="//" || authority() || arbsp("/",segment)) | # heir ...
(="/" || ( path_rootless() |="")) |
path_rootless() |
=""
) ||
( ( ="?" || queryfrag() ) |="" ) || # query
( ( ="#" || queryfrag() ) |="" ) # fragment
)
end
 
procedure queryfrag() #: match a query or fragment
static pc
initial pc := UNRESERVED ++ SUBDELIM ++ ':@/?'
suspend arbcp(pc,pctencode)
end
 
procedure segment(n) #: match a pchar segment
static sc
initial sc := UNRESERVED ++ SUBDELIM ++ ':@'
suspend arbcp(sc,pctencode,n)
end
 
procedure segmentnc(n) #: match a pchar--':' segment
static sc
initial sc := UNRESERVED ++ SUBDELIM ++ '@'
suspend arbcp(sc,pctencode,n)
end
 
procedure path_rootless() #: match a rootless path
suspend segment(1) || arbsp("/",segment)
end
 
procedure authority() #: match authority
static uic,rnc
initial {
rnc := UNRESERVED ++ SUBDELIM # regular name
uic := rnc ++ ':' # userinfo
}
suspend ( (arbcp(uic,pctencode) || ="@") |="") || # userinfo
( IPsimple() | arbcp(rnc,pctencode) ) || # host
( (=":" || tab(many(&digits))) |="")
end
 
procedure IPsimple() #: match ip address (trickable )
static i4c,i6c,ifc
initial {
i4c := &digits ++ '.'
i6c := HEXDIGITS ++ '.:'
ifc := UNRESERVED ++ SUBDELIM ++ ':'
}
suspend (
="[" ||
( tab(many(i6c)) |
( ="v"||tab(any(HEXDIGITS))||="."||tab(any(ifc))||tab(many(ifc)) )
) || ="]" ) | tab(many(i4c))
end
 
procedure arbcp(cs,pr,n) #: match arbitrary numbers of (cset|proc,n)
local p,i
/n := 0 # for 0* / 1*
runerr(0 > n,205)
p := &pos
i := 0
while tab(many(cs)) | pr() do i +:= 1
if i >= n then suspend &subject[p:&pos]
&pos := p # restore &pos
end
 
procedure arbsp(st,pr,n) #: match arbitrary numbers of (string || proc,n)
local p,i
/n := 0 # for 0* / 1*
runerr(0 > n,205)
p := &pos
i := 0
while =st || pr() do i +:= 1
if i >= n then suspend &subject[p:&pos]
&pos := p # restore &pos
end
 
procedure pctencode() #: match 1 % encoded single byte character
suspend ="%" || tab(any(HEXDIGITS)) || tab(any(HEXDIGITS))
end
Output:
stop:
http://en.wikipedia.org/wiki/Erich_K
http://mediawiki.org/).
http://mediawiki.org/)
http://mediawiki.org/
parser:
http://en.wikipedia.org/wiki/-)
http://en.wikipedia.org/wiki/-
ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
ftp://domain.name/path(balanced_brackets)/ending.in.dot
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
ftp://domain.name/path(unbalanced_brackets/ending.in.dot
ftp://domain.name/path/embedded?punct/uation.
ftp://domain.name/path/embedded?punct/uation
ftp://domain.name/dangling_close_paren)
ftp://domain.name/dangling_close_paren
here:
foo://domain.hld/))))
foo://domain.hld/

[edit] Pike

string uritext = #"this URI contains an illegal character, parentheses and a misplaced full stop:
http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer).
which is handled by http://mediawiki.org/).
and another one just to confuse the parser: http://en.wikipedia.org/wiki/-)
\")\" is handled the wrong way by the mediawiki parser.
ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
leading junk ftp://domain.name/path/embedded?punct/uation.
leading junk ftp://domain.name/dangling_close_paren)
if you have other interesting URIs for testing, please add them here:"
;
 
array find_uris(string uritext)
{
array uris=({});
int pos=0;
while((pos = search(uritext, ":", pos+1))>0)
{
int prepos = sizeof(array_sscanf(reverse(uritext[pos-20..pos-1]), "%[a-zA-Z0-9+.-]%s")[0]);
int postpos = sizeof(array_sscanf(uritext[pos+1..], "%[^\n\r\t <>\"]%s")[0]);
 
if ((<'.',',','?','!',';'>)[uritext[pos+postpos]])
postpos--;
if (uritext[pos-prepos-1]=='(' && uritext[pos+postpos]==')')
postpos--;
if (uritext[pos-prepos-1]=='\'' && uritext[pos+postpos]=='\'')
postpos--;
uris+= ({ uritext[pos-prepos..pos+postpos] });
}
return uris;
}
 
find_uris(uritext);
Result: ({ /* 11 elements */
"stop:",
"http://en.wikipedia.org/wiki/Erich_K\303\244stner_(camera_designer)",
"http://mediawiki.org/)",
"parser:",
"http://en.wikipedia.org/wiki/-)",
"ftp://domain.name/path(balanced_brackets)/foo.html",
"ftp://domain.name/path(balanced_brackets)/ending.in.dot",
"ftp://domain.name/path(unbalanced_brackets/ending.in.dot",
"ftp://domain.name/path/embedded?punct/uation",
"ftp://domain.name/dangling_close_paren)",
"here:"
})

[edit] Ruby

 
require 'uri'
 
str = 'this URI contains an illegal character, parentheses and a misplaced full stop:
http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/).
and another one just to confuse the parser: http://en.wikipedia.org/wiki/-)
")" is handled the wrong way by the mediawiki parser.
ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
leading junk ftp://domain.name/path/embedded?punct/uation.
leading junk ftp://domain.name/dangling_close_paren)
if you have other interesting URIs for testing, please add them here:'

 
 
puts URI.extract(str)
 
puts "\nFiltered for HTTP and HTTPS:"
puts URI.extract(str, ["http", "https"])
 
puts "\nThis is the (extendible) list of supported schemes: #{URI.scheme_list.keys}"
Output:
stop:
http://en.wikipedia.org/wiki/Erich_K
http://mediawiki.org/).
parser:
http://en.wikipedia.org/wiki/-)
ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
ftp://domain.name/path/embedded?punct/uation.
ftp://domain.name/dangling_close_paren)
here:

Filtered for HTTP and HTTPS:
http://en.wikipedia.org/wiki/Erich_K
http://mediawiki.org/).
http://en.wikipedia.org/wiki/-)

This is the (extendible) list of supported schemes: ["FTP", "HTTP", "HTTPS", "LDAP", "LDAPS", "MAILTO"]

[edit] Tcl

This uses regular expressions to do the matching. It doesn't match a URL without a scheme (too problematic in general text) and it requires more than just the scheme too, but apart from that it matches slightly too broad a range of strings (though not usually problematically much). Matches some IRIs correctly too, but does not tackle the <bracketed> form (especially not if it includes extra spaces).

proc findURIs {text args} {
# This is an ERE with embedded comments. Rare, but useful with something
# this complex.
set URI {(?x)
[a-z][-a-z0-9+.]*: # Scheme...
(?=[/\w]) # ... but not just the scheme
(?://[-\w.@:]+)? # Host
[-\w.~/%!$&'()*+,;=]* # Path
(?:\?[-\w.~%!$&'
()*+,;=/?]*)? # Query
(?:[#][-\w.~%!$&'()*+,;=/?]*)? # Fragment
}
regexp -inline -all {*}$args -- $URI $text
}
Demonstrating:

Note that the last line of output is showing that we haven't just extracted the URI substrings, but can also get the match positions within the text.

set sample {
this URI contains an illegal character, parentheses and a misplaced full stop:
http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/).
and another one just to confuse the parser: http://en.wikipedia.org/wiki/-)
")" is handled the wrong way by the mediawiki parser.
ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
leading junk ftp://domain.name/path/embedded?punct/uation.
leading junk ftp://domain.name/dangling_close_paren)
}
 
puts [join [findURIs $sample] \n]
puts [findURIs $sample -indices]
Output:
http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer).
http://mediawiki.org/).
http://en.wikipedia.org/wiki/-)
ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
ftp://domain.name/path/embedded?punct/uation.
ftp://domain.name/dangling_close_paren)
{80 140} {163 185} {231 261} {317 366} {368 423} {425 481} {496 540} {555 593}

[edit] TXR

@(define path (path))@\
@(local x y)@\
@(cases)@\
(@(path x))@(path y)@(bind path `(@x)@y`)@\
@(or)@\
@{x /[.,;'!?][^ \t\f\v]/}@(path y)@(bind path `@x@y`)@\
@(or)@\
@{x /[^ .,;'!?()\t\f\v]/}@(path y)@(bind path `@x@y`)@\
@(or)@\
@(bind path "")@\
@(end)@\
@(end)
@(define url (url))@\
@(local proto domain path)@\
@{proto /[A-Za-z]+/}://@{domain /[^ \/\t\f\v]+/}@\
@(cases)/@(path path)@\
@(bind url `@proto://@domain/@path`)@\
@(or)@\
@(bind url `@proto://@domain`)@\
@(end)@\
@(end)
@(collect)
@ (all)
@line
@ (and)
@ (coll)@(url url)@(end)@(flatten url)
@ (end)
@(end)
@(output)
LINE
URLS
----------------------
@ (repeat)
@line
@ (repeat)
@url
@ (end)
@ (end)
@(end)

Test file:

$ cat url-data 
Blah blah http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (Handled by http://mediawiki.org/).
Confuse the parser: http://en.wikipedia.org/wiki/-)
ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
leading junk ftp://domain.name/path/embedded?punct/uation.
leading junk ftp://domain.name/dangling_close_paren)

Run:

$ txr url.txr url-data 
LINE 
    URLS
----------------------
Blah blah http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (Handled by http://mediawiki.org/).
    http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer)
    http://mediawiki.org/
Confuse the parser: http://en.wikipedia.org/wiki/-)
    http://en.wikipedia.org/wiki/-
ftp://domain.name/path(balanced_brackets)/foo.html
    ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
    ftp://domain.name/path(balanced_brackets)/ending.in.dot
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
    ftp://domain.name/path
leading junk ftp://domain.name/path/embedded?punct/uation.
    ftp://domain.name/path/embedded?punct/uation
leading junk ftp://domain.name/dangling_close_paren)
    ftp://domain.name/dangling_close_paren
Personal tools
Namespaces

Variants
Actions
Community
Explore
Misc
Toolbox