Find URI in text: Difference between revisions

Content added Content deleted
m (→‎{{header|Icon}} and {{header|Unicon}}: added other example text)
(→‎Tcl: Added implementation)
Line 242: Line 242:
})</lang>
})</lang>


=={{header|TXR}}==
=={{header|Tcl}}==
This uses regular expressions to do the matching. It doesn't match a URL without a scheme (too problematic in general text) and it requires more than ''just'' the scheme too, but apart from that it matches slightly too broad a range of strings (though not usually problematically much). Matches some IRIs correctly too, but does not tackle the <tt>&lt;bracketed&gt;</tt> form (especially not if it includes extra spaces).
<lang tcl>proc findURIs {text args} {
set URI {(?x)
[a-z][-a-z0-9+.]*: # Scheme...
(?=[/\w]) # ... but not just the scheme
(?://[-\w.@:]+)? # Host
[-\w.~/%!$&'()*+,;=]* # Path
(?:\?[-\w.~%!$&'()*+,;=/?]*)? # Query
(?:[#][-\w.~%!$&'()*+,;=/?]*)? # Fragment
}
regexp -inline -all {*}$args -- $URI $text
}</lang>
;Demonstrating<nowiki>:</nowiki>
Note that the last line of output is showing that we haven't just extracted the URI substrings, but can also get the match positions within the text.
<lang tcl>set sample {
this URI contains an illegal character, parentheses and a misplaced full stop:
http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/).
and another one just to confuse the parser: http://en.wikipedia.org/wiki/-)
")" is handled the wrong way by the mediawiki parser.
ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
leading junk ftp://domain.name/path/embedded?punct/uation.
leading junk ftp://domain.name/dangling_close_paren)
}


puts [join [findURIs $sample] \n]
puts [findURIs $sample -indices]</lang>
{{out}}
<pre>
http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer).
http://mediawiki.org/).
http://en.wikipedia.org/wiki/-)
ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
ftp://domain.name/path/embedded?punct/uation.
ftp://domain.name/dangling_close_paren)
{80 140} {163 185} {231 261} {317 366} {368 423} {425 481} {496 540} {555 593}
</pre>

=={{header|TXR}}==
<lang txr>@(define path (path))@\
<lang txr>@(define path (path))@\
@(local x y)@\
@(local x y)@\