Find URI in text: Difference between revisions

m
m (added whitespace to the task's preamble, added a ;Task:, added whitespace before the TOC.)
m (→‎{{header|Wren}}: Minor tidy)
 
(12 intermediate revisions by 8 users not shown)
Line 36:
'''Extra Credit:'''   implement the parser to match the IRI specification in RFC 3987.
<br><br>
=={{header|Delphi}}==
{{libheader| System.RegularExpressions}}
{{Trans|Go}}
<syntaxhighlight lang="delphi">
program Find_URI_in_text;
 
{$APPTYPE CONSOLE}
 
uses
System.SysUtils,
System.RegularExpressions;
 
const
pattern = '(*UTF)(*UCP)' + // Make \w unicode aware
'(?<URI>[a-z][-a-z0-9+.]*:' + // Scheme...
'(?=[/\w])' + // ... but not just the scheme
'(?://[-\w.@:]+)?)' + // Host
'[-\w.~/%!$&''()*+,;=]*' + // Path
'(?:\?[-\w.~%!$&''()*+,;=/?]*)?' + // Query
'(?:\#[-\w.~%!$&''()*+,;=/?]*)?'; // Fragment
 
Text =
'this URI contains an illegal character, parentheses and a misplaced full stop:' + #13#10 +
'http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). ' +
'(which is handled by http://mediawiki.org/).' + #13#10 +
'and another one just to confuse the parser: http://en.wikipedia.org/wiki/-)' + #13#10 +
'")" is handled the wrong way by the mediawiki parser.' + #13#10 +
'ftp://domain.name/path(balanced_brackets)/foo.html' + #13#10 +
'ftp://domain.name/path(balanced_brackets)/ending.in.dot.' + #13#10 +
'ftp://domain.name/path(unbalanced_brackets/ending.in.dot.' + #13#10 +
'leading junk ftp://domain.name/path/embedded?punct/uation.' + #13#10 +
'leading junk ftp://domain.name/dangling_close_paren)' + #13#10 +
'if you have other interesting URIs for testing, please add them here:' + #13#10 +
'http://www.example.org/foo.html#includes_fragment' + #13#10 +
'http://www.example.org/foo.html#enthält_Unicode-Fragment';
 
var
reg: TRegEx;
Match: TMatch;
IRIs: string = '';
URIs: string = '';
 
begin
reg := TRegEx.Create(pattern);
for Match in reg.Matches(Text) do
begin
URIs := URIs + #10 + Match.Groups['URI'].Value;
IRIs := IRIs + #10 + Match.Value;
end;
 
Write('URIs:-');
Writeln(URIs, #10);
 
Write('IRIs:-');
Writeln(IRIs, #10);
 
Readln;
end.</syntaxhighlight>
{{out}}
<pre>
URIs:-
http://en.wikipedia.org
http://mediawiki.org
http://en.wikipedia.org
ftp://domain.name
ftp://domain.name
ftp://domain.name
ftp://domain.name
ftp://domain.name
http://www.example.org
http://www.example.org
 
IRIs:-
http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer).
http://mediawiki.org/).
http://en.wikipedia.org/wiki/-)
ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
ftp://domain.name/path/embedded?punct/uation.
ftp://domain.name/dangling_close_paren)
http://www.example.org/foo.html#includes_fragment
http://www.example.org/foo.html#enthält_Unicode-Fragment
</pre>
=={{header|Go}}==
{{trans|Kotlin}}
Line 42 ⟶ 125:
<br>
The regexp package in the Go standard library is not fully compatible with PCRE and is unable to compile the regular expression used here. A third party library has therefore been used instead which is a PCRE shim for Go.
<langsyntaxhighlight lang="go">package main
 
import (
Line 89 ⟶ 172:
fmt.Println()
}
}</langsyntaxhighlight>
 
{{out}}
Line 124 ⟶ 207:
Delimited examples of the form ''<URI>'' or ''"URI"'' will be correctly parse in any event. Handling of other possibly ambiguous examples that include valid URI characters is done by the 'findURItext' and 'disambURI' procedures. All candidate URIs are returned since once information is removed it will be lost and may be difficult for a user to reconstruct. This solution deals with all of the trailing character and balance considerations.
 
<langsyntaxhighlight Iconlang="icon">procedure main()
every write(findURItext("this URI contains an illegal character, parentheses_
and a misplaced full stop:\n_
Line 256 ⟶ 339:
procedure pctencode() #: match 1 % encoded single byte character
suspend ="%" || tab(any(HEXDIGITS)) || tab(any(HEXDIGITS))
end</langsyntaxhighlight>
 
Output:<pre>stop:
Line 284 ⟶ 367:
 
The following uses essentially the same regular expression as is used in the [[#Tcl]] article (as of June 2015), and the results using the given input text are identical. Note in particular that scheme-only strings such as "stop:" are not extracted.
<langsyntaxhighlight lang="jq"># input: a JSON string
# output: a stream of URIs
# Each input string may contain more than one URI.
Line 301 ⟶ 384:
# Example: read in a file of arbitrary text and
# produce a stream of the URIs that are identified.
split("\n")[] | findURIs</langsyntaxhighlight>
 
{{out}}
<langsyntaxhighlight lang="sh">$ jq -R -r -f Find_URI_in_text.jq Find_URI_in_text.txt
http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer).
http://mediawiki.org/).
Line 312 ⟶ 395:
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
ftp://domain.name/path/embedded?punct/uation.
ftp://domain.name/dangling_close_paren)</langsyntaxhighlight>
 
=={{header|Julia}}==
The Julia URI parser treats stop: and here: as schemes with an empty path. Looking at the RFC this seems
technically correct except that the schemes "stop:" and "here:" do not exist, whereas http: and ftp: do.
<syntaxhighlight lang="julia">using URIParser, HTTP
 
function findvalidURI(txt)
results = String[]
# whitespace not allowed in URI, so split on whitespace
for str in split(txt, r"\s+")
# convert escaped chars to %dd format
s = replace(replace(str, r"\&\#x([\d\w]{2})\;" => s"\%\1"), "?" => "x")
try
if isvalid(parse(HTTP.URI, s))
push!(results, str)
end
catch
continue
end
end
return results
end
 
testtext = """
this URI contains an illegal character, parentheses and a misplaced full stop:
http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/).
and another one just to confuse the parser: http://en.wikipedia.org/wiki/-)
")" is handled the wrong way by the mediawiki parser.
ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
leading junk ftp://domain.name/path/embedded?punct/uation.
leading junk ftp://domain.name/dangling_close_paren)
if you have other interesting URIs for testing, please add them here:
"""
for t in strip.(split(testtext, "\n")), result in findvalidURI(t)
println(result)
end
</syntaxhighlight>{{out}}
<pre>
stop:
http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer).
http://mediawiki.org/).
parser:
http://en.wikipedia.org/wiki/-)
ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
ftp://domain.name/path/embedded?punct/uation.
ftp://domain.name/dangling_close_paren)
here:
</pre>
 
=={{header|Kotlin}}==
The regular expression used here is essentially the same as the one in the Tcl entry. However, the flag expression (?U) is needed to enable matching of Unicode characters. Without this only ASCII characters are matched.
<langsyntaxhighlight lang="scala">// version 1.2.21
 
val pattern =
Line 351 ⟶ 487:
println()
}
}</langsyntaxhighlight>
 
{{out}}
Line 380 ⟶ 516:
</pre>
 
=={{header|Mathematica}}/{{header|Wolfram Language}}==
Using the built-in text parser
<langsyntaxhighlight Mathematicalang="mathematica">TextCases[" this URI contains an illegal character, parentheses and a misplaced
full stop:
http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which
Line 395 ⟶ 531:
leading junk ftp://domain.name/dangling_close_paren)
if you have other interesting URIs for testing, please add them
here:", "URL"]</langsyntaxhighlight>
 
{{out}}
<pre>{"http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which",
Line 406 ⟶ 541:
=={{header|Objeck}}==
Used a regex instead of writing a parser.
<langsyntaxhighlight lang="objeck">
use RegEx;
 
Line 429 ⟶ 564:
};
}
}</langsyntaxhighlight>
 
<pre>
Line 443 ⟶ 578:
</pre>
 
=={{header|Perl 6}}==
Only covers whatever Regexp::Common::URI supports.
This needs an installed URI distribution. {{works with|Rakudo|2018.03}}
<syntaxhighlight lang="perl"># 20200821 added Perl programming solution
<lang perl6>use v6;
use IETF::RFC_Grammar::URI;
 
use strict;
say q:to/EOF/.match(/ <IETF::RFC_Grammar::URI::absolute-URI> /, :g).list.join("\n");
use warnings;
this URI contains an illegal character, parentheses and a misplaced full stop:
http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/).
and another one just to confuse the parser: http://en.wikipedia.org/wiki/-)
")" is handled the wrong way by the mediawiki parser.
ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
leading junk ftp://domain.name/path/embedded?punct/uation.
leading junk ftp://domain.name/dangling_close_paren)
EOF
 
use Regexp::Common qw /URI/; # https://metacpan.org/pod/Regexp::Common::URI
say $/[*-1];
say "We matched $/[*-1], which is a $/[*-1].^name() at position $/[*-1].from() to $/[*-1].to()"
</lang>
 
while ( my $line = <DATA> ) {
Like most of the solutions here it does not comply to IRI but only to URI:
chomp $line;
my @URIs = $line =~ /$RE{URI}/g and print "URI(s) found.\n";
foreach my $uri (@URIs) { print "URI : $uri\n" }
}
 
__DATA__
<pre>stop:
this URI contains an illegal character, parentheses and a misplaced full stop:
http://en.wikipedia.org/wiki/Erich_K
http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/).
and another one just to confuse the parser: http://en.wikipedia.org/wiki/-)
parser:
")" is handled the wrong way by the mediawiki parser.
http://en.wikipedia.org/wiki/-)
ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
leading junk ftp://domain.name/path/embedded?punct/uation.
leading junk ftp://domain.name/dangling_close_paren)</syntaxhighlight>
{{out}}
「ftp://domain.name/dangling_close_paren)」
<pre>URI(s) found.
IETF::RFC_Grammar::URI::absolute_URI => 「ftp://domain.name/dangling_close_paren)」
URI : http://en.wikipedia.org/wiki/Erich_K
scheme => 「ftp」
URI : http://mediawiki.org/).
We matched ftp://domain.name/dangling_close_paren), which is a Match, at position 554 to 593</pre>
URI(s) found.
 
URI : http://en.wikipedia.org/wiki/-)
The last lines show that we get Match objects back that we can query to get all kinds of information.
URI(s) found.
We even get the information what subrules matched, and since these are also Match objects we can obtain
URI : ftp://domain.name/path(balanced_brackets)/foo.html
their match position in the text.
URI(s) found.
URI : ftp://domain.name/path(balanced_brackets)/ending.in.dot.
URI(s) found.
URI : ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
URI(s) found.
URI : ftp://domain.name/path/embedded
URI(s) found.
URI : ftp://domain.name/dangling_close_paren)</pre>
 
=={{header|Phix}}==
ThereThe following is abased on scanForUrls() routine in demo\edita\src\easynclr.e you might be interested in which managesis thisused/tested (without regex) but it ison a (very)daily longbasis timeand sincemay Iget wroteadditional itbugfixes, andthough it is quite strongly coupled in with syntax colouring and other editor gubbins. Now handles dangling ")" and trailing "." to match the mediawiki handling, with Edita (1.0.4) similarly updated. Note that Edita does not highlight a quoted text literal in the same manner as medaiwiki, but does with comments.<br>
Deliberately handles IRI but not URI, in other words no attempt is made to prohibit unicode characters.
<!--<syntaxhighlight lang="phix">(phixonline)-->
<span style="color: #008080;">with</span> <span style="color: #008080;">javascript_semantics</span>
<span style="color: #008080;">constant</span> <span style="color: #000000;">schemes</span> <span style="color: #0000FF;">=</span> <span style="color: #0000FF;">{</span><span style="color: #008000;">`ftp`</span><span style="color: #0000FF;">,</span><span style="color: #008000;">`gopher`</span><span style="color: #0000FF;">,</span><span style="color: #008000;">`http`</span><span style="color: #0000FF;">,</span><span style="color: #008000;">`https`</span><span style="color: #0000FF;">,</span><span style="color: #008000;">`mailto`</span><span style="color: #0000FF;">,</span><span style="color: #008000;">`news`</span><span style="color: #0000FF;">,</span><span style="color: #008000;">`nntp`</span><span style="color: #0000FF;">,</span>
<span style="color: #008000;">`telnet`</span><span style="color: #0000FF;">,</span><span style="color: #008000;">`wais`</span><span style="color: #0000FF;">,</span><span style="color: #008000;">`file`</span><span style="color: #0000FF;">,</span><span style="color: #008000;">`prospero`</span><span style="color: #0000FF;">,</span><span style="color: #008000;">`edit`</span><span style="color: #0000FF;">,</span><span style="color: #008000;">`tel`</span><span style="color: #0000FF;">,</span><span style="color: #008000;">`urn`</span><span style="color: #0000FF;">}</span>
<span style="color: #008080;">function</span> <span style="color: #000000;">scan_for_urls</span><span style="color: #0000FF;">(</span><span style="color: #004080;">sequence</span> <span style="color: #000000;">text</span><span style="color: #0000FF;">)</span>
<span style="color: #000080;font-style:italic;">-- such as http::\\wikipedia.org</span>
<span style="color: #004080;">integer</span> <span style="color: #000000;">chidx</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">1</span><span style="color: #0000FF;">,</span> <span style="color: #000000;">chidx2</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">1</span><span style="color: #0000FF;">,</span> <span style="color: #000000;">lt</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #000000;">text</span><span style="color: #0000FF;">),</span> <span style="color: #000000;">ch2</span>
<span style="color: #004080;">sequence</span> <span style="color: #000000;">res</span> <span style="color: #0000FF;">=</span> <span style="color: #0000FF;">{}</span>
<span style="color: #008080;">while</span> <span style="color: #000000;">chidx2</span><span style="color: #0000FF;"><=</span><span style="color: #000000;">lt</span> <span style="color: #008080;">do</span>
<span style="color: #000000;">ch2</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">text</span><span style="color: #0000FF;">[</span><span style="color: #000000;">chidx2</span><span style="color: #0000FF;">]</span>
<span style="color: #008080;">if</span> <span style="color: #000000;">ch2</span><span style="color: #0000FF;">>=</span><span style="color: #008000;">'a'</span> <span style="color: #008080;">and</span> <span style="color: #000000;">ch2</span><span style="color: #0000FF;"><=</span><span style="color: #008000;">'z'</span> <span style="color: #008080;">then</span>
<span style="color: #008080;">if</span> <span style="color: #000000;">chidx2</span><span style="color: #0000FF;">-</span><span style="color: #000000;">1</span><span style="color: #0000FF;">></span><span style="color: #000000;">chidx</span> <span style="color: #008080;">or</span> <span style="color: #000000;">text</span><span style="color: #0000FF;">[</span><span style="color: #000000;">chidx</span><span style="color: #0000FF;">]<=</span><span style="color: #008000;">' '</span> <span style="color: #008080;">then</span>
<span style="color: #000000;">chidx</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">chidx2</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">if</span>
<span style="color: #008080;">while</span> <span style="color: #000000;">chidx2</span><span style="color: #0000FF;"><=</span><span style="color: #000000;">lt</span> <span style="color: #008080;">do</span>
<span style="color: #000000;">ch2</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">text</span><span style="color: #0000FF;">[</span><span style="color: #000000;">chidx2</span><span style="color: #0000FF;">]</span>
<span style="color: #008080;">if</span> <span style="color: #000000;">ch2</span><span style="color: #0000FF;"><</span><span style="color: #008000;">'a'</span> <span style="color: #008080;">or</span> <span style="color: #000000;">ch2</span><span style="color: #0000FF;">></span><span style="color: #008000;">'z'</span> <span style="color: #008080;">then</span> <span style="color: #008080;">exit</span> <span style="color: #008080;">end</span> <span style="color: #008080;">if</span>
<span style="color: #000000;">chidx2</span> <span style="color: #0000FF;">+=</span> <span style="color: #000000;">1</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">while</span>
<span style="color: #004080;">string</span> <span style="color: #000000;">oneword</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">text</span><span style="color: #0000FF;">[</span><span style="color: #000000;">chidx</span><span style="color: #0000FF;">..</span><span style="color: #000000;">chidx2</span><span style="color: #0000FF;">-</span><span style="color: #000000;">1</span><span style="color: #0000FF;">]</span>
<span style="color: #008080;">if</span> <span style="color: #000000;">chidx2</span><span style="color: #0000FF;">></span><span style="color: #000000;">lt</span> <span style="color: #008080;">then</span> <span style="color: #008080;">exit</span> <span style="color: #008080;">end</span> <span style="color: #008080;">if</span>
<span style="color: #000000;">ch2</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">text</span><span style="color: #0000FF;">[</span><span style="color: #000000;">chidx2</span><span style="color: #0000FF;">]</span>
<span style="color: #008080;">if</span> <span style="color: #0000FF;">(</span><span style="color: #000000;">ch2</span><span style="color: #0000FF;">=</span><span style="color: #008000;">':'</span> <span style="color: #008080;">and</span> <span style="color: #7060A8;">find</span><span style="color: #0000FF;">(</span><span style="color: #000000;">oneword</span><span style="color: #0000FF;">,</span><span style="color: #000000;">schemes</span><span style="color: #0000FF;">))</span>
<span style="color: #008080;">or</span> <span style="color: #0000FF;">(</span><span style="color: #000000;">ch2</span><span style="color: #0000FF;">=</span><span style="color: #008000;">'.'</span> <span style="color: #008080;">and</span> <span style="color: #7060A8;">equal</span><span style="color: #0000FF;">(</span><span style="color: #000000;">oneword</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"www"</span><span style="color: #0000FF;">))</span> <span style="color: #008080;">then</span>
<span style="color: #000000;">chidx2</span> <span style="color: #0000FF;">+=</span> <span style="color: #000000;">1</span>
<span style="color: #004080;">integer</span> <span style="color: #000000;">chidx0</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">chidx2</span>
<span style="color: #004080;">bool</span> <span style="color: #000000;">isUrl</span> <span style="color: #0000FF;">=</span> <span style="color: #004600;">false</span>
<span style="color: #004080;">string</span> <span style="color: #000000;">bstack</span> <span style="color: #0000FF;">=</span> <span style="color: #008000;">""</span>
<span style="color: #008080;">while</span> <span style="color: #000000;">chidx2</span><span style="color: #0000FF;"><=</span><span style="color: #000000;">lt</span> <span style="color: #008080;">do</span>
<span style="color: #000000;">ch2</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">text</span><span style="color: #0000FF;">[</span><span style="color: #000000;">chidx2</span><span style="color: #0000FF;">]</span>
<span style="color: #008080;">if</span> <span style="color: #000000;">ch2</span><span style="color: #0000FF;">=</span><span style="color: #008000;">'\"'</span> <span style="color: #008080;">and</span> <span style="color: #000000;">chidx2</span><span style="color: #0000FF;">=</span><span style="color: #000000;">chidx0</span> <span style="color: #008080;">then</span>
<span style="color: #008080;">while</span> <span style="color: #000000;">chidx2</span><span style="color: #0000FF;"><</span><span style="color: #000000;">lt</span> <span style="color: #008080;">do</span>
<span style="color: #000000;">chidx2</span> <span style="color: #0000FF;">+=</span> <span style="color: #000000;">1</span>
<span style="color: #000000;">ch2</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">text</span><span style="color: #0000FF;">[</span><span style="color: #000000;">chidx2</span><span style="color: #0000FF;">]</span>
<span style="color: #008080;">if</span> <span style="color: #000000;">ch2</span><span style="color: #0000FF;">=</span><span style="color: #008000;">'\"'</span> <span style="color: #008080;">then</span>
<span style="color: #000000;">chidx2</span> <span style="color: #0000FF;">+=</span> <span style="color: #000000;">1</span>
<span style="color: #000000;">isUrl</span> <span style="color: #0000FF;">=</span> <span style="color: #004600;">true</span>
<span style="color: #008080;">exit</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">if</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">while</span>
<span style="color: #008080;">exit</span>
<span style="color: #008080;">elsif</span> <span style="color: #7060A8;">find</span><span style="color: #0000FF;">(</span><span style="color: #000000;">ch2</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"(&lt;[{"</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">then</span>
<span style="color: #000000;">bstack</span> <span style="color: #0000FF;">&=</span> <span style="color: #000000;">ch2</span><span style="color: #0000FF;">+</span><span style="color: #7060A8;">iff</span><span style="color: #0000FF;">(</span><span style="color: #000000;">ch2</span><span style="color: #0000FF;">=</span><span style="color: #008000;">'('</span><span style="color: #0000FF;">?</span><span style="color: #000000;">1</span><span style="color: #0000FF;">:</span><span style="color: #000000;">2</span><span style="color: #0000FF;">)</span>
<span style="color: #008080;">elsif</span> <span style="color: #7060A8;">find</span><span style="color: #0000FF;">(</span><span style="color: #000000;">ch2</span><span style="color: #0000FF;">,</span><span style="color: #008000;">")&gt;]}"</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">then</span>
<span style="color: #008080;">if</span> <span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #000000;">bstack</span><span style="color: #0000FF;">)=</span><span style="color: #000000;">0</span> <span style="color: #008080;">or</span> <span style="color: #000000;">bstack</span><span style="color: #0000FF;">[$]!=</span><span style="color: #000000;">ch2</span> <span style="color: #008080;">then</span> <span style="color: #008080;">exit</span> <span style="color: #008080;">end</span> <span style="color: #008080;">if</span>
<span style="color: #000000;">bstack</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">bstack</span><span style="color: #0000FF;">[</span><span style="color: #000000;">1</span><span style="color: #0000FF;">..$-</span><span style="color: #000000;">1</span><span style="color: #0000FF;">]</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">if</span>
<span style="color: #008080;">if</span> <span style="color: #000000;">ch2</span><span style="color: #0000FF;">></span><span style="color: #000000;">255</span> <span style="color: #008080;">or</span> <span style="color: #000000;">ch2</span><span style="color: #0000FF;"><=</span><span style="color: #008000;">' '</span> <span style="color: #008080;">then</span> <span style="color: #008080;">exit</span> <span style="color: #008080;">end</span> <span style="color: #008080;">if</span>
<span style="color: #000000;">isUrl</span> <span style="color: #0000FF;">=</span> <span style="color: #004600;">true</span>
<span style="color: #000000;">chidx2</span> <span style="color: #0000FF;">+=</span> <span style="color: #000000;">1</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">while</span>
<span style="color: #008080;">if</span> <span style="color: #000000;">isUrl</span> <span style="color: #008080;">then</span>
<span style="color: #004080;">string</span> <span style="color: #000000;">oneurl</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">text</span><span style="color: #0000FF;">[</span><span style="color: #000000;">chidx</span><span style="color: #0000FF;">..</span><span style="color: #000000;">chidx2</span><span style="color: #0000FF;">-</span><span style="color: #000000;">1</span><span style="color: #0000FF;">]</span>
<span style="color: #008080;">if</span> <span style="color: #000000;">oneurl</span><span style="color: #0000FF;">[$]=</span><span style="color: #008000;">'.'</span> <span style="color: #008080;">then</span> <span style="color: #000000;">oneurl</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">oneurl</span><span style="color: #0000FF;">[</span><span style="color: #000000;">1</span><span style="color: #0000FF;">..$-</span><span style="color: #000000;">1</span><span style="color: #0000FF;">]</span> <span style="color: #008080;">end</span> <span style="color: #008080;">if</span>
<span style="color: #000000;">res</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">append</span><span style="color: #0000FF;">(</span><span style="color: #000000;">res</span><span style="color: #0000FF;">,</span><span style="color: #000000;">oneurl</span><span style="color: #0000FF;">)</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">if</span>
<span style="color: #000000;">chidx</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">chidx2</span>
<span style="color: #008080;">if</span> <span style="color: #000000;">chidx2</span><span style="color: #0000FF;">></span><span style="color: #000000;">lt</span> <span style="color: #008080;">then</span> <span style="color: #008080;">exit</span> <span style="color: #008080;">end</span> <span style="color: #008080;">if</span>
<span style="color: #008080;">else</span>
<span style="color: #000000;">chidx2</span> <span style="color: #0000FF;">-=</span> <span style="color: #000000;">1</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">if</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">if</span>
<span style="color: #000000;">chidx2</span> <span style="color: #0000FF;">+=</span> <span style="color: #000000;">1</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">while</span>
<span style="color: #008080;">return</span> <span style="color: #000000;">res</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">function</span>
<span style="color: #008080;">constant</span> <span style="color: #000000;">txt</span> <span style="color: #0000FF;">=</span> <span style="color: #008000;">"""
this URI contains an illegal character, parentheses and a misplaced full stop:
http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/).
and another one just to confuse the parser: http://en.wikipedia.org/wiki/-)
")" is handled the wrong way by the mediawiki parser.
ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
leading junk ftp://domain.name/path/embedded?punct/uation.
leading junk ftp://domain.name/dangling_close_paren)
if you have other interesting URIs for testing, please add them here:
http://www.example.org/foo.html#includes_fragment
http://www.example.org/foo.html#enthält_Unicode-Fragment
http://192.168.0.1/admin/?hackme=%%%%%%%%%true
blah (foo://domain.hld/))))
https://haxor.ur:4592/~mama/####&?foo
ftp://ftp.is.co.za/rfc/rfc1808.txt
http://www.ietf.org/rfc/rfc2396.txt
mailto:John.Doe@example.com
news:comp.infosystems.www.servers.unix
tel:+1-816-555-1212
telnet://192.0.2.16:80/
urn:oasis:names:specification:docbook:dtd:xml:4.1.2
"""</span>
<span style="color: #7060A8;">printf</span><span style="color: #0000FF;">(</span><span style="color: #000000;">1</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"%s\n"</span><span style="color: #0000FF;">,{</span><span style="color: #7060A8;">join</span><span style="color: #0000FF;">(</span><span style="color: #000000;">scan_for_urls</span><span style="color: #0000FF;">(</span><span style="color: #000000;">txt</span><span style="color: #0000FF;">),</span><span style="color: #008000;">"\n"</span><span style="color: #0000FF;">)})</span>
<!--</syntaxhighlight>-->
{{out}}
<pre>
http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer)
http://mediawiki.org/
http://en.wikipedia.org/wiki/-
ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot
ftp://domain.name/path(unbalanced_brackets/ending.in.dot
ftp://domain.name/path/embedded?punct/uation
ftp://domain.name/dangling_close_paren
http://www.example.org/foo.html#includes_fragment
http://www.example.org/foo.html#enthält_Unicode-Fragment
http://192.168.0.1/admin/?hackme=%%%%%%%%%true
https://haxor.ur:4592/~mama/####&?foo
ftp://ftp.is.co.za/rfc/rfc1808.txt
http://www.ietf.org/rfc/rfc2396.txt
mailto:John.Doe@example.com
news:comp.infosystems.www.servers.unix
tel:+1-816-555-1212
telnet://192.0.2.16:80/
urn:oasis:names:specification:docbook:dtd:xml:4.1.2
</pre>
 
=={{header|PHP}}==
Trivial example using PHP's built-in filter_var() function (which does not support IRIs).
<langsyntaxhighlight PHPlang="php">$tests = array(
'this URI contains an illegal character, parentheses and a misplaced full stop:',
'http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/).',
Line 514 ⟶ 766:
}
}
</syntaxhighlight>
</lang>
{{Out}}
<pre>
Line 530 ⟶ 782:
 
=={{header|Pike}}==
<langsyntaxhighlight Pikelang="pike">string uritext = #"this URI contains an illegal character, parentheses and a misplaced full stop:
http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer).
which is handled by http://mediawiki.org/).
Line 575 ⟶ 827:
"ftp://domain.name/dangling_close_paren)",
"here:"
})</langsyntaxhighlight>
 
=={{header|Racket}}==
Line 581 ⟶ 833:
{{trans|Tcl}}
 
<langsyntaxhighlight lang="racket">#lang racket
 
(define sample
Line 616 ⟶ 868:
;; ALPHA = %x41-5A / %x61-7A ; A-Z / a-z
(unless (= 228 (char->integer #\ä))
(error "a-umlaut is not 228, and therefore might be an ALPHA")))</langsyntaxhighlight>
 
{{out}}
Line 635 ⟶ 887:
"ftp://domain.name/dangling_close_paren)"
((79 . 115) (162 . 185) (230 . 261) (316 . 366) (367 . 423) (424 . 481) (495 . 540) (554 . 593))</pre>
 
=={{header|Raku}}==
(formerly Perl 6)
This needs an installed URI distribution. {{works with|Rakudo|2018.03}}
<syntaxhighlight lang="raku" line>use v6;
use IETF::RFC_Grammar::URI;
 
say q:to/EOF/.match(/ <IETF::RFC_Grammar::URI::absolute-URI> /, :g).list.join("\n");
this URI contains an illegal character, parentheses and a misplaced full stop:
http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/).
and another one just to confuse the parser: http://en.wikipedia.org/wiki/-)
")" is handled the wrong way by the mediawiki parser.
ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
leading junk ftp://domain.name/path/embedded?punct/uation.
leading junk ftp://domain.name/dangling_close_paren)
EOF
 
say $/[*-1];
say "We matched $/[*-1], which is a $/[*-1].^name() at position $/[*-1].from() to $/[*-1].to()"
</syntaxhighlight>
 
Like most of the solutions here it does not comply to IRI but only to URI:
 
<pre>stop:
http://en.wikipedia.org/wiki/Erich_K
http://mediawiki.org/).
parser:
http://en.wikipedia.org/wiki/-)
ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
ftp://domain.name/path/embedded
ftp://domain.name/dangling_close_paren)
「ftp://domain.name/dangling_close_paren)」
IETF::RFC_Grammar::URI::absolute_URI => 「ftp://domain.name/dangling_close_paren)」
scheme => 「ftp」
We matched ftp://domain.name/dangling_close_paren), which is a Match, at position 554 to 593</pre>
 
The last lines show that we get Match objects back that we can query to get all kinds of information.
We even get the information what subrules matched, and since these are also Match objects we can obtain
their match position in the text.
 
=={{header|REXX}}==
<langsyntaxhighlight lang="rexx">/*REXX program scans a text (contained within the REXX pgmprogram) to extract URIs. and IRIs*/
text$$= 'this URI contains an illegal character, parentheses and a misplaced full stop:',
'http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/).',
'and another one just to confuse the parser: http://en.wikipedia.org/wiki/-)',
'")" is handled the wrong way by the mediawiki parser.',
'ftp://domain.name/path(balanced_brackets)/foo.html',
'ftp://domain.name/path(balanced_brackets)/ending.in.dot.',
'ftp://domain.name/path(unbalanced_brackets/ending.in.dot.',
'leading junk ftp://domain.name/path/embedded?punct/uation.',
'leading junk ftp://domain.name/dangling_close_paren)',
'if you have other interesting URIs for testing, please add them here:'
 
@abc= 'abcdefghijklmnopqrstuvwxyz'; @abcs=@abc||translate /*construct lowercase (@abcLatin) alphabet.*/
@abcU= @abc; upper @abcU; @abcs= @abc || @abcU /* " lower & uppercase " */
@scheme=@abcs || 0123456789 || '+-.'
@unreservedscheme= @abcs || 0123456789 || '+-._~' /*add decimal digits & some punctuation*/
@unreserved= @abcs || 0123456789 || '-._~' /* " " " " " " */
@reserved=@unreserved"/?#[]@!$&)(*+,;=\'"
@reserved= @unreserved"/?#[]@!$&)(*+,;=\'" /*add other punctuation & special chars*/
t=space(text)' ' /*variable T is a working copy.*/
#$=0 space($$)' ' /*countvariable of URI's$ found sois far.a working copy of $$ */
#= 0 /*▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄the count of URI's found (so far).*/
/*▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄*/
do while t\='' /*scan text for multiple URIs. */
y=pos(':',t) do while $\=''; y= pos(':', $) /*locate a colon (:) in the text body.*/
if y==0 then leave /*ColonWas a colon found? NoNope, we're done. */
if y==1 then do; parse var $ . $ /*handle a bare colon by itself. */
parse var t . t iterate /*ignorego theand barekeep colonscanning (:).for a colon. */
iterateend /*go &[↑] keep scanning for (a colonrare special case.) */
sr= reverse( left($, y - 1) end ) /*extract [↑]the ascheme rare specialand casereverse it. */
srse=reverse(left verify(tsr,y-1) @scheme) /*extractlocate the scheme andend reverse of the scheme. */
se$=verify substr(sr$,@scheme y + 1) /*locateassign thean endadjusted ofnew the schemetext. */
tif se\=substr(t,y+1)=0 then sr= left(sr, se - 1) /*possibly "crop" /*assignthe an adjustedscheme new textname. */
if se\=s=0 then sr=leftreverse(sr,se-1) /*possiblyreverse cropit theagain schemeto rectify the name. */
she=reverse verify(sr$, @reserved) /*reverselocate againthe toend rectifyof name. hierarchical part.*/
hes=verify s':'left(t$,@reserved he - 1) /*locateextract theand endappend of the hier-part " " */
s$=s':'left substr(t$, he-1) /*extractassign &an appendadjusted thenew hier-part of text. */
t#=substr(t,he) # + 1 /*assignbump anthe adjusted newURI text counter. */
!.#=#+1 s /*bumpassign the URI counter. to an array (!.) */
!.#=send /*while*/ /*assign [↑] scan the text for URI's. to an array. */
/*▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀*/
end /*while t\='' */ /* [↑] scan the text for URIs. */
do k=1 for #; say !.k; end /*stick a fork in it, we're all done. */</syntaxhighlight>
/*▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀*/
{{out|output|text=&nbsp; when using the internal default inputs:}}
do k=1 for #; say !.k; end /*stick a fork in it, we're done.*/</lang>
'''output'''
<pre>
stop:
Line 692 ⟶ 986:
 
=={{header|Ruby}}==
<langsyntaxhighlight lang="ruby">
require 'uri'
 
Line 712 ⟶ 1,006:
puts URI.extract(str, ["http", "https"])
 
puts "\nThis is the (extendible) list of supported schemes: #{URI.scheme_list.keys}"</langsyntaxhighlight>
{{Output}}
<pre>
Line 737 ⟶ 1,031:
=={{header|Tcl}}==
This uses regular expressions to do the matching. It doesn't match a URL without a scheme (too problematic in general text) and it requires more than ''just'' the scheme too, but apart from that it matches slightly too broad a range of strings (though not usually problematically much). Matches some IRIs correctly too, but does not tackle the <tt>&lt;bracketed&gt;</tt> form (especially not if it includes extra spaces).
<langsyntaxhighlight lang="tcl">proc findURIs {text args} {
# This is an ERE with embedded comments. Rare, but useful with something
# this complex.
Line 749 ⟶ 1,043:
}
regexp -inline -all {*}$args -- $URI $text
}</langsyntaxhighlight>
;Demonstrating<nowiki>:</nowiki>
Note that the last line of output is showing that we haven't just extracted the URI substrings, but can also get the match positions within the text.
<langsyntaxhighlight lang="tcl">set sample {
this URI contains an illegal character, parentheses and a misplaced full stop:
http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/).
Line 765 ⟶ 1,059:
 
puts [join [findURIs $sample] \n]
puts [findURIs $sample -indices]</langsyntaxhighlight>
{{out}}
<pre>
Line 780 ⟶ 1,074:
 
=={{header|TXR}}==
<langsyntaxhighlight lang="txr">@(define path (path))@\
@(local x y)@\
@(cases)@\
Line 818 ⟶ 1,112:
@ (end)
@ (end)
@(end)</langsyntaxhighlight>
 
Test file:
Line 851 ⟶ 1,145:
leading junk ftp://domain.name/dangling_close_paren)
ftp://domain.name/dangling_close_paren</pre>
 
=={{header|Wren}}==
{{libheader|Wren-pattern}}
Wren's simple pattern matcher lacks the sophistication of regular expressions and I've had to make considerable simplifications to the search pattern needed for complete URI/IRI matching whilst still doing enough to identify the ones embedded in the sample text for this task.
<syntaxhighlight lang="wren">import "./pattern" for Pattern
 
var text = """
this URI contains an illegal character, parentheses and a misplaced full stop:
http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/).
and another one just to confuse the parser: http://en.wikipedia.org/wiki/-)
")" is handled the wrong way by the mediawiki parser.
ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
leading junk ftp://domain.name/path/embedded?punct/uation.
leading junk ftp://domain.name/dangling_close_paren)
if you have other interesting URIs for testing, please add them here:
http://www.example.org/foo.html#includes_fragment
http://www.example.org/foo.html#enthält_Unicode-Fragment
"""
 
var i = Pattern.lower + Pattern.digit + "-+."
var j = Pattern.alpha + "_-.@:"
var k = j + "~\%!$&'()*+,;=?/#"
var e = "/l+0/i:////+1/j//+1/k"
var p = Pattern.new(e, Pattern.within, i, j, k)
var matches = p.findAll(text)
System.print("URI's found:\n")
for (m in matches) System.print(m.text)
k = k + "ä"
p = Pattern.new(e, Pattern.within, i, j, k)
System.print("\nIRI's found:\n")
matches = p.findAll(text)
for (m in matches) System.print(m.text)</syntaxhighlight>
 
{{out}}
<pre>
URI's found:
 
http://en.wikipedia.org/wiki/Erich_K
http://mediawiki.org/).
http://en.wikipedia.org/wiki/-)
ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
ftp://domain.name/path/embedded?punct/uation.
ftp://domain.name/dangling_close_paren)
http://www.example.org/foo.html#includes_fragment
http://www.example.org/foo.html#enth
 
IRI's found:
 
http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer).
http://mediawiki.org/).
http://en.wikipedia.org/wiki/-)
ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
ftp://domain.name/path/embedded?punct/uation.
ftp://domain.name/dangling_close_paren)
http://www.example.org/foo.html#includes_fragment
http://www.example.org/foo.html#enthält_Unicode-Fragment
</pre>
9,482

edits