Find URI in text: Difference between revisions

m
m (grammar)
Line 35:
This example follows RFC 3986 very closely (see Talk page for discussion). For better IP parsing see [[Parse_an_IP_Address]].
This solution doesn't handle IRIs per RFC 3987. Neither Icon nor Unicon natively support Unicode although ObjectIcon does.
This solution doesn't currently handle delimitation explicitly. Examples of the form ''<URI>'' or ''"URI"'' aren't needed as they will correctly parse in any event. Ambiguous examples like ''(URI)'' which use valid URI characters will currently parse as ''URI)'' and not ''URI''. URIs are returned per the RFC. For example URIs ending in dots are currently returned with the dot. Once the information is lost the user must guess and reconstruct; however, it's far easier to remove a character if the URI doesn't work.
 
Delimited examples of the form ''<URI>'' or ''"URI"'' will be correctly parse in any event. Handling of other possibly ambiguous examples that include valid URI characters is done by the 'findURItext' and 'disambURI' procedures. All candidate URIs are returned since once information is removed it will be lost and may be difficult for a user to reconstruct. This solution deals with all of the trailing character and balance considerations.
Filtering of URIs for disambiguation and delineation would be best handled in the 'findURItext' procedure. It might also be a good idea to return both unfiltered and filtered URIs here.
 
<lang Icon>procedure main()
Line 62 ⟶ 61:
 
procedure findURItext(s) #: generate all syntatically valid URI's from s
local u,p
s ? while tab(upto(&letters)) || (u := 2(p := &pos, URI())) do {
suspend u # suspend result as parsed # return parsed URI
every suspend disambURI(u,p) # deal with text ambiguities, return many
}
end
 
procedure disambURI(u,p) #: generate disambiguous URIs from parsed
local u2
repeat {
if any('.,;?',u[-1]) then
suspend u := u[1:-1] # remove trailing .,;? from URI
else if u[-1] == "'" == &subject[p-:=1] then
suspend u := u[1:-1] # remove trailing ' from 'URI'
else if any('()',u[-1]) then {
every u ? u2 := tab(bal())
if u ~==:= u2 then suspend u # longest balanced URI wrt ()
}
else break # done
}
end
procedure URI() #: match longest URI at cursor
Line 158 ⟶ 174:
http://en.wikipedia.org/wiki/Erich_K
http://mediawiki.org/).
http://mediawiki.org/)
http://mediawiki.org/
parser:
http://en.wikipedia.org/wiki/-)
http://en.wikipedia.org/wiki/-
ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
ftp://domain.name/path(balanced_brackets)/ending.in.dot
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
ftp://domain.name/path(unbalanced_brackets/ending.in.dot
ftp://domain.name/path/embedded?punct/uation.
ftp://domain.name/path/embedded?punct/uation
ftp://domain.name/dangling_close_paren)
ftp://domain.name/dangling_close_paren
here:</pre>
 
Anonymous user