Find URI in text: Difference between revisions
Content added Content deleted
m (grammar) |
m (→{{header|Icon}} and {{header|Unicon}}: tweak description) |
||
Line 35: | Line 35: | ||
This example follows RFC 3986 very closely (see Talk page for discussion). For better IP parsing see [[Parse_an_IP_Address]]. |
This example follows RFC 3986 very closely (see Talk page for discussion). For better IP parsing see [[Parse_an_IP_Address]]. |
||
This solution doesn't handle IRIs per RFC 3987. Neither Icon nor Unicon natively support Unicode although ObjectIcon does. |
This solution doesn't handle IRIs per RFC 3987. Neither Icon nor Unicon natively support Unicode although ObjectIcon does. |
||
This solution doesn't currently handle delimitation explicitly. Examples of the form ''<URI>'' or ''"URI"'' aren't needed as they will correctly parse in any event. Ambiguous examples like ''(URI)'' which use valid URI characters will currently parse as ''URI)'' and not ''URI''. URIs are returned per the RFC. For example URIs ending in dots are currently returned with the dot. Once the information is lost the user must guess and reconstruct; however, it's far easier to remove a character if the URI doesn't work. |
|||
Delimited examples of the form ''<URI>'' or ''"URI"'' will be correctly parse in any event. Handling of other possibly ambiguous examples that include valid URI characters is done by the 'findURItext' and 'disambURI' procedures. All candidate URIs are returned since once information is removed it will be lost and may be difficult for a user to reconstruct. This solution deals with all of the trailing character and balance considerations. |
|||
Filtering of URIs for disambiguation and delineation would be best handled in the 'findURItext' procedure. It might also be a good idea to return both unfiltered and filtered URIs here. |
|||
<lang Icon>procedure main() |
<lang Icon>procedure main() |
||
Line 62: | Line 61: | ||
procedure findURItext(s) #: generate all syntatically valid URI's from s |
procedure findURItext(s) #: generate all syntatically valid URI's from s |
||
local u |
local u,p |
||
s ? while tab(upto(&letters)) || (u := URI()) do |
s ? while tab(upto(&letters)) || (u := 2(p := &pos, URI())) do { |
||
suspend u |
suspend u # return parsed URI |
||
every suspend disambURI(u,p) # deal with text ambiguities, return many |
|||
} |
|||
end |
end |
||
procedure disambURI(u,p) #: generate disambiguous URIs from parsed |
|||
local u2 |
|||
repeat { |
|||
if any('.,;?',u[-1]) then |
|||
suspend u := u[1:-1] # remove trailing .,;? from URI |
|||
else if u[-1] == "'" == &subject[p-:=1] then |
|||
suspend u := u[1:-1] # remove trailing ' from 'URI' |
|||
else if any('()',u[-1]) then { |
|||
every u ? u2 := tab(bal()) |
|||
if u ~==:= u2 then suspend u # longest balanced URI wrt () |
|||
} |
|||
else break # done |
|||
} |
|||
end |
|||
procedure URI() #: match longest URI at cursor |
procedure URI() #: match longest URI at cursor |
||
Line 158: | Line 174: | ||
http://en.wikipedia.org/wiki/Erich_K |
http://en.wikipedia.org/wiki/Erich_K |
||
http://mediawiki.org/). |
http://mediawiki.org/). |
||
http://mediawiki.org/) |
|||
http://mediawiki.org/ |
|||
parser: |
parser: |
||
http://en.wikipedia.org/wiki/-) |
http://en.wikipedia.org/wiki/-) |
||
http://en.wikipedia.org/wiki/- |
|||
ftp://domain.name/path(balanced_brackets)/foo.html |
ftp://domain.name/path(balanced_brackets)/foo.html |
||
ftp://domain.name/path(balanced_brackets)/ending.in.dot. |
ftp://domain.name/path(balanced_brackets)/ending.in.dot. |
||
ftp://domain.name/path(balanced_brackets)/ending.in.dot |
|||
ftp://domain.name/path(unbalanced_brackets/ending.in.dot. |
ftp://domain.name/path(unbalanced_brackets/ending.in.dot. |
||
ftp://domain.name/path(unbalanced_brackets/ending.in.dot |
|||
ftp://domain.name/path/embedded?punct/uation. |
ftp://domain.name/path/embedded?punct/uation. |
||
ftp://domain.name/path/embedded?punct/uation |
|||
ftp://domain.name/dangling_close_paren) |
ftp://domain.name/dangling_close_paren) |
||
ftp://domain.name/dangling_close_paren |
|||
here:</pre> |
here:</pre> |
||