Find URI in text: Difference between revisions

→‎{{header|REXX}}: added the REXX language. -- ~~~~
(→‎{{header|Ruby}}: Added Ruby Header and sample)
(→‎{{header|REXX}}: added the REXX language. -- ~~~~)
Line 241:
"here:"
})</lang>
 
=={{header|REXX}}==
<lang rexx>/*REXX program scans a text (contained within REXX pgm) to extract URIs.*/
text='this URI contains an illegal character, parentheses and a misplaced full stop:',
'http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/).',
'and another one just to confuse the parser: http://en.wikipedia.org/wiki/-)',
'")" is handled the wrong way by the mediawiki parser.',
'ftp://domain.name/path(balanced_brackets)/foo.html',
'ftp://domain.name/path(balanced_brackets)/ending.in.dot.',
'ftp://domain.name/path(unbalanced_brackets/ending.in.dot.',
'leading junk ftp://domain.name/path/embedded?punct/uation.',
'leading junk ftp://domain.name/dangling_close_paren)',
'if you have other interesting URIs for testing, please add them here:'
 
@abc='abcdefghijklmnopqrstuvwxyz'; @abcs=@abc||translate(@abc)
@scheme=@abcs || 0123456789 || '+-.'
@unreserved=@abcs || 0123456789 || '-._~'
@reserved=@unreserved"/?#[]@!$&)(*+,;=\'"
t=space(text)' ' /*variable T is a working copy.*/
#=0 /*count of URI's found so far. */
/*▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄*/
do while t\='' /*scan text for multiple URIs. */
y=pos(':',t) /*locate a colon in the text body*/
if y==0 then leave /*Colon found? No, we're done. */
if y==1 then do /*handle a bare colon by itself. */
parse var t . t /*ignore the bare colon (:). */
iterate /*go & keep scanning for a colon.*/
end /* [↑] a rare special case. */
sr=reverse(left(t,y-1)) /*extract the scheme and reverse.*/
se=verify(sr,@scheme) /*locate the end of the scheme. */
t=substr(t,y+1) /*assign an adjusted new text. */
if se\==0 then sr=left(sr,se-1) /*possibly crop the scheme name. */
s=reverse(sr) /*reverse again to rectify name. */
he=verify(t,@reserved) /*locate the end of the hier-part*/
s=s':'left(t,he-1) /*extract & append the hier-part.*/
t=substr(t,he) /*assign an adjusted new text. */
#=#+1 /*bump the URI counter. */
!.#=s /*assign the URI to an array. */
end /*while t\='' */ /* [↑] scan the text for URIs. */
/*▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀*/
do k=1 for #; say !.k; end /*stick a fork in it, we're done.*/</lang>
'''output'''
<pre>
stop:
http://en.wikipedia.org/wiki/Erich_K
http://mediawiki.org/).
parser:
http://en.wikipedia.org/wiki/-)
ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
ftp://domain.name/path/embedded?punct/uation.
ftp://domain.name/dangling_close_paren)
here:
</pre>
 
=={{header|Ruby}}==
Line 285 ⟶ 340:
This is the (extendible) list of supported schemes: ["FTP", "HTTP", "HTTPS", "LDAP", "LDAPS", "MAILTO"]
</pre>
 
=={{header|Tcl}}==
This uses regular expressions to do the matching. It doesn't match a URL without a scheme (too problematic in general text) and it requires more than ''just'' the scheme too, but apart from that it matches slightly too broad a range of strings (though not usually problematically much). Matches some IRIs correctly too, but does not tackle the <tt>&lt;bracketed&gt;</tt> form (especially not if it includes extra spaces).