Anonymous user
Find URI in text: Difference between revisions
m
→{{header|Icon}} and {{header|Unicon}}: tweak description
m (grammar) |
m (→{{header|Icon}} and {{header|Unicon}}: tweak description) |
||
Line 35:
This example follows RFC 3986 very closely (see Talk page for discussion). For better IP parsing see [[Parse_an_IP_Address]].
This solution doesn't handle IRIs per RFC 3987. Neither Icon nor Unicon natively support Unicode although ObjectIcon does.
Delimited examples of the form ''<URI>'' or ''"URI"'' will be correctly parse in any event. Handling of other possibly ambiguous examples that include valid URI characters is done by the 'findURItext' and 'disambURI' procedures. All candidate URIs are returned since once information is removed it will be lost and may be difficult for a user to reconstruct. This solution deals with all of the trailing character and balance considerations.
<lang Icon>procedure main()
Line 62 ⟶ 61:
procedure findURItext(s) #: generate all syntatically valid URI's from s
local u,p
s ? while tab(upto(&letters)) || (u := 2(p := &pos, URI())) do {
suspend u
every suspend disambURI(u,p) # deal with text ambiguities, return many
}
end
procedure disambURI(u,p) #: generate disambiguous URIs from parsed
local u2
repeat {
if any('.,;?',u[-1]) then
suspend u := u[1:-1] # remove trailing .,;? from URI
else if u[-1] == "'" == &subject[p-:=1] then
suspend u := u[1:-1] # remove trailing ' from 'URI'
else if any('()',u[-1]) then {
every u ? u2 := tab(bal())
if u ~==:= u2 then suspend u # longest balanced URI wrt ()
}
else break # done
}
end
procedure URI() #: match longest URI at cursor
Line 158 ⟶ 174:
http://en.wikipedia.org/wiki/Erich_K
http://mediawiki.org/).
http://mediawiki.org/)
http://mediawiki.org/
parser:
http://en.wikipedia.org/wiki/-)
http://en.wikipedia.org/wiki/-
ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
ftp://domain.name/path(balanced_brackets)/ending.in.dot
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
ftp://domain.name/path(unbalanced_brackets/ending.in.dot
ftp://domain.name/path/embedded?punct/uation.
ftp://domain.name/path/embedded?punct/uation
ftp://domain.name/dangling_close_paren)
ftp://domain.name/dangling_close_paren
here:</pre>
|