Find URI in text: Difference between revisions

← Older edit

Find URI in text (view source)

Revision as of 12:07, 5 December 2023

53,975 bytes added , 5 months ago

m

→‎{{header|Wren}}: Minor tidy

PureFox

9,476

edits

Revision as of 15:08, 8 January 2012 (view source) rosettacode>Dgamey m (annotation and comments pertaining to delineation, IRIs, Unicode) ← Older edit		Latest revision as of 12:07, 5 December 2023 (view source) PureFox (talk \| contribs) m (→‎{{header\|Wren}}: Minor tidy)
(42 intermediate revisions by 19 users not shown)
Line 1: {{Draft task\|Text processing}} ~~Write a function to search plain text for URIs.~~ ;Task: ~~The function should return a list of URIs found in the text.~~ Write a function to search plain text for URIs or IRIs. The function should return a list of URIs or IRIs found in the text. ~~The definition of a URI is given in [http://www.ietf.org/rfc/rfc3986.txt RFC 3986].~~ The definition of a URI is given in RFC 3986. IRI is defined in RFC 3987. For searching URIs in particular "Appendix C. Delimiting a URI in Context" is noteworthy. The abbreviation IRI isn't as well known as URI and the short description is that an IRI is just an alternate form of a URI that supports Internationalization and hence Unicode. While many specifications support both forms this isn't universal. Consider the following issues: :*   <big> <code>. , ; ' ? ( )</code> </big>   are legal characters in a URI, but they are often used in plain text as a delimiter. :*   IRIs allow most (but not all) unicode characters. * a user may type an URI as seen in the browser location-bar with non-ascii characters (which are not legal). :*   URIs can be something else besides http:// or https:// ~~sample text:~~ ;Sample text: ~~this URI contains an illegal character, parentheses and a misplaced full stop:~~ <nowiki> this URI contains an illegal character, parentheses and a misplaced full stop: ~~http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/).~~ ~~and~~ ~~another~~ ~~one~~ ~~just~~ ~~to confuse the parser:~~ http://en.wikipedia.org/wiki/-Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/). and another one just to confuse the parser: http://en.wikipedia.org/wiki/-) ~~")" is handled the wrong way by the mediawiki parser.~~ ")" is handled the wrong way by the mediawiki parser. ~~ftp://domain.name/path(balanced_brackets)/foo.html~~ ftp://domain.name/path(balanced_brackets)/~~ending.in.dot~~foo.html ftp://domain.name/path(~~unbalanced_brackets~~balanced_brackets)/ending.in.dot. ~~leading~~ ~~junk~~ ftp://domain.name/path(unbalanced_brackets/~~embedded?punct/uation~~ending.in.dot. leading junk ftp://domain.name/~~dangling_close_paren)~~path/embedded?punct/uation. leading junk ftp://domain.name/dangling_close_paren) ~~if you have other interesting URIs for testing, please add them here:~~ if you have other interesting URIs for testing, please add them here: </nowiki> Regular expressions to solve the task are fine, but alternative approaches are welcome too. (otherwise, this task would degrade into 'finding and applying the best regular expression') '''Extra Credit:'''   implement the parser to match the IRI specification in RFC 3987. <br><br> =={{header\|Delphi}}== {{libheader\| System.RegularExpressions}} {{Trans\|Go}} <syntaxhighlight lang="delphi"> program Find_URI_in_text; {$APPTYPE CONSOLE} uses System.SysUtils, System.RegularExpressions; const pattern = '(UTF)(UCP)' + // Make \w unicode aware '(?<URI>[a-z][-a-z0-9+.]:' + // Scheme... '(?=[/\w])' + // ... but not just the scheme '(?://[-\w.@:]+)?)' + // Host '[-\w.~/%!$&''()+,;=]' + // Path '(?:\?[-\w.~%!$&''()+,;=/?])?' + // Query '(?:\#[-\w.~%!$&''()+,;=/?])?'; // Fragment Text = 'this URI contains an illegal character, parentheses and a misplaced full stop:' + #13#10 + 'http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). ' + '(which is handled by http://mediawiki.org/).' + #13#10 + 'and another one just to confuse the parser: http://en.wikipedia.org/wiki/-)' + #13#10 + '")" is handled the wrong way by the mediawiki parser.' + #13#10 + 'ftp://domain.name/path(balanced_brackets)/foo.html' + #13#10 + 'ftp://domain.name/path(balanced_brackets)/ending.in.dot.' + #13#10 + 'ftp://domain.name/path(unbalanced_brackets/ending.in.dot.' + #13#10 + 'leading junk ftp://domain.name/path/embedded?punct/uation.' + #13#10 + 'leading junk ftp://domain.name/dangling_close_paren)' + #13#10 + 'if you have other interesting URIs for testing, please add them here:' + #13#10 + 'http://www.example.org/foo.html#includes_fragment' + #13#10 + 'http://www.example.org/foo.html#enthält_Unicode-Fragment'; var reg: TRegEx; Match: TMatch; IRIs: string = ''; URIs: string = ''; begin reg := TRegEx.Create(pattern); for Match in reg.Matches(Text) do begin URIs := URIs + #10 + Match.Groups['URI'].Value; IRIs := IRIs + #10 + Match.Value; end; Write('URIs:-'); Writeln(URIs, #10); Write('IRIs:-'); Writeln(IRIs, #10); Readln; end.</syntaxhighlight> {{out}} <pre> URIs:- http://en.wikipedia.org http://mediawiki.org http://en.wikipedia.org ftp://domain.name ftp://domain.name ftp://domain.name ftp://domain.name ftp://domain.name http://www.example.org http://www.example.org IRIs:- http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). http://mediawiki.org/). http://en.wikipedia.org/wiki/-) ftp://domain.name/path(balanced_brackets)/foo.html ftp://domain.name/path(balanced_brackets)/ending.in.dot. ftp://domain.name/path(unbalanced_brackets/ending.in.dot. ftp://domain.name/path/embedded?punct/uation. ftp://domain.name/dangling_close_paren) http://www.example.org/foo.html#includes_fragment http://www.example.org/foo.html#enthält_Unicode-Fragment </pre> =={{header\|Go}}== {{trans\|Kotlin}} {{libheader\|golang-pkg-pcre}} <br> The regexp package in the Go standard library is not fully compatible with PCRE and is unable to compile the regular expression used here. A third party library has therefore been used instead which is a PCRE shim for Go. <syntaxhighlight lang="go">package main import ( "fmt" "github.com/glenn-brown/golang-pkg-pcre/src/pkg/pcre" ) var pattern = "(UTF)(UCP)" + // Make \w unicode aware "[a-z][-a-z0-9+.]:" + // Scheme... "(?=[/\\w])" + // ... but not just the scheme "(?://[-\\w.@:]+)?" + // Host "[-\\w.~/%!$&'()+,;=]" + // Path "(?:\\?[-\\w.~%!$&'()+,;=/?])?" + // Query "(?:\\#[-\\w.~%!$&'()+,;=/?])?" // Fragment func main() { text := ` this URI contains an illegal character, parentheses and a misplaced full stop: http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/). and another one just to confuse the parser: http://en.wikipedia.org/wiki/-) ")" is handled the wrong way by the mediawiki parser. ftp://domain.name/path(balanced_brackets)/foo.html ftp://domain.name/path(balanced_brackets)/ending.in.dot. ftp://domain.name/path(unbalanced_brackets/ending.in.dot. leading junk ftp://domain.name/path/embedded?punct/uation. leading junk ftp://domain.name/dangling_close_paren) if you have other interesting URIs for testing, please add them here: http://www.example.org/foo.html#includes_fragment http://www.example.org/foo.html#enthält_Unicode-Fragment ` descs := []string{"URIs:-", "IRIs:-"} patterns := []string{pattern[12:], pattern} for i := 0; i <= 1; i++ { fmt.Println(descs[i]) re := pcre.MustCompile(patterns[i], 0) t := text for { se := re.FindIndex([]byte(t), 0) if se == nil { break } fmt.Println(t[se[0]:se[1]]) t = t[se[1]:] } fmt.Println() } }</syntaxhighlight> {{out}} <pre> URIs:- http://en.wikipedia.org/wiki/Erich_K http://mediawiki.org/). http://en.wikipedia.org/wiki/-) ftp://domain.name/path(balanced_brackets)/foo.html ftp://domain.name/path(balanced_brackets)/ending.in.dot. ftp://domain.name/path(unbalanced_brackets/ending.in.dot. ftp://domain.name/path/embedded?punct/uation. ftp://domain.name/dangling_close_paren) http://www.example.org/foo.html#includes_fragment http://www.example.org/foo.html#enth IRIs:- http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). http://mediawiki.org/). http://en.wikipedia.org/wiki/-) ftp://domain.name/path(balanced_brackets)/foo.html ftp://domain.name/path(balanced_brackets)/ending.in.dot. ftp://domain.name/path(unbalanced_brackets/ending.in.dot. ftp://domain.name/path/embedded?punct/uation. ftp://domain.name/dangling_close_paren) http://www.example.org/foo.html#includes_fragment http://www.example.org/foo.html#enthält_Unicode-Fragment </pre> =={{header\|Icon}} and {{header\|Unicon}}== This example follows RFC 3986 very closely (see Talk page for discussion). For better IP parsing see [[Parse_an_IP_Address]]. This solution doesn't handle IRIs per RFC 3987. Neither Icon nor Unicon natively support Unicode although ObjectIcon does. This solution doesn't currently handle delimitation explicitly. Examples of the form ''<URI>'' or ''"URI"'' aren't needed as they will correctly parse in any event. Ambiguous examples like ''(URI)'' which use valid URI characters will currently parse as ''URI)'' and not ''URI''. URIs are returned per the RFC. For example URIs ending in dots are currently returned with the dot. Once the information is lost the user must guess and reconstruct; however, it's far easier to make remove a character if the URI doesn't work. Delimited examples of the form ''<URI>'' or ''"URI"'' will be correctly parse in any event. Handling of other possibly ambiguous examples that include valid URI characters is done by the 'findURItext' and 'disambURI' procedures. All candidate URIs are returned since once information is removed it will be lost and may be difficult for a user to reconstruct. This solution deals with all of the trailing character and balance considerations. ~~Filtering of URIs for disambiguation and delineation would be best handled in the 'findURItext' procedure. It might also be a good idea to return both unfiltered and filtered URIs here.~~ <~~lang~~syntaxhighlight ~~Icon~~lang="icon">procedure main() every write(findURItext("this URI contains an illegal character, parentheses_ and a misplaced full stop:\n_ Line 47 ⟶ 220: leading junk ftp://domain.name/path/embedded?punct/uation.\n_ leading junk ftp://domain.name/dangling_close_paren)\n_ if you have other interesting URIs for testing, please add them here:~~"))~~\n_ blah (foo://domain.hld/))))")) end Line 57 ⟶ 231: procedure findURItext(s) #: generate all syntatically valid URI's from s local u,p s ? while tab(upto(&letters)) \|\| (u := 2(p := &pos, URI())) do { suspend u # ~~suspend result as parsed~~ # return parsed URI every suspend disambURI(u,p) # deal with text ambiguities, return many } end procedure disambURI(u,p) #: generate disambiguous URIs from parsed local u2 repeat { if any('.,;?',u[-1]) then suspend u := u[1:-1] # remove trailing .,;? from URI else if u[-1] == "'" == &subject[p-:=1] then suspend u := u[1:-1] # remove trailing ' from 'URI' else if any('()',u[-1]) then { every u ? u2 := tab(bal()) if u ~==:= u2 then suspend u # longest balanced URI wrt () } else break # done } end procedure URI() #: match longest URI at cursor Line 148 ⟶ 339: procedure pctencode() #: match 1 % encoded single byte character suspend ="%" \|\| tab(any(HEXDIGITS)) \|\| tab(any(HEXDIGITS)) end</~~lang~~syntaxhighlight> Output:<pre>stop: http://en.wikipedia.org/wiki/Erich_K http://mediawiki.org/). http://mediawiki.org/) http://mediawiki.org/ parser: http://en.wikipedia.org/wiki/-) http://en.wikipedia.org/wiki/- ftp://domain.name/path(balanced_brackets)/foo.html ftp://domain.name/path(balanced_brackets)/ending.in.dot. ftp://domain.name/path(balanced_brackets)/ending.in.dot ftp://domain.name/path(unbalanced_brackets/ending.in.dot. ftp://domain.name/path(unbalanced_brackets/ending.in.dot ftp://domain.name/path/embedded?punct/uation. ftp://domain.name/path/embedded?punct/uation ftp://domain.name/dangling_close_paren) ftp://domain.name/dangling_close_paren ~~here:</pre>~~ here: foo://domain.hld/)))) foo://domain.hld/ </pre> =={{header\|jq}}== {{works with\|jq\|with regex}} The following uses essentially the same regular expression as is used in the [[#Tcl]] article (as of June 2015), and the results using the given input text are identical. Note in particular that scheme-only strings such as "stop:" are not extracted. <syntaxhighlight lang="jq"># input: a JSON string # output: a stream of URIs # Each input string may contain more than one URI. def findURIs: match( " [a-z][-a-z0-9+.]: # Scheme... (?=[/\\w]) # ... but not just the scheme (?://[-\\w.@:]+)? # Host [-\\w.~/%!$&'()+,;=]* # Path (?:\\?[-\\w.~%!$&'()+,;=/?])? # Query (?:[#][-\\w.~%!$&'()+,;=/?])? # Fragment "; "gx") \| .string ; # Example: read in a file of arbitrary text and # produce a stream of the URIs that are identified. split("\n")[] \| findURIs</syntaxhighlight> {{out}} <syntaxhighlight lang="sh">$ jq -R -r -f Find_URI_in_text.jq Find_URI_in_text.txt http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). http://mediawiki.org/). http://en.wikipedia.org/wiki/-) ftp://domain.name/path(balanced_brackets)/foo.html ftp://domain.name/path(balanced_brackets)/ending.in.dot. ftp://domain.name/path(unbalanced_brackets/ending.in.dot. ftp://domain.name/path/embedded?punct/uation. ftp://domain.name/dangling_close_paren)</syntaxhighlight> =={{header\|Julia}}== The Julia URI parser treats stop: and here: as schemes with an empty path. Looking at the RFC this seems technically correct except that the schemes "stop:" and "here:" do not exist, whereas http: and ftp: do. <syntaxhighlight lang="julia">using URIParser, HTTP function findvalidURI(txt) results = String[] # whitespace not allowed in URI, so split on whitespace for str in split(txt, r"\s+") # convert escaped chars to %dd format s = replace(replace(str, r"\&\#x([\d\w]{2})\;" => s"\%\1"), "?" => "x") try if isvalid(parse(HTTP.URI, s)) push!(results, str) end catch continue end end return results end testtext = """ this URI contains an illegal character, parentheses and a misplaced full stop: http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/). and another one just to confuse the parser: http://en.wikipedia.org/wiki/-) ")" is handled the wrong way by the mediawiki parser. ftp://domain.name/path(balanced_brackets)/foo.html ftp://domain.name/path(balanced_brackets)/ending.in.dot. ftp://domain.name/path(unbalanced_brackets/ending.in.dot. leading junk ftp://domain.name/path/embedded?punct/uation. leading junk ftp://domain.name/dangling_close_paren) if you have other interesting URIs for testing, please add them here: """ for t in strip.(split(testtext, "\n")), result in findvalidURI(t) println(result) end </syntaxhighlight>{{out}} <pre> stop: http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). http://mediawiki.org/). parser: http://en.wikipedia.org/wiki/-) ftp://domain.name/path(balanced_brackets)/foo.html ftp://domain.name/path(balanced_brackets)/ending.in.dot. ftp://domain.name/path(unbalanced_brackets/ending.in.dot. ftp://domain.name/path/embedded?punct/uation. ftp://domain.name/dangling_close_paren) here: </pre> =={{header\|Kotlin}}== The regular expression used here is essentially the same as the one in the Tcl entry. However, the flag expression (?U) is needed to enable matching of Unicode characters. Without this only ASCII characters are matched. <syntaxhighlight lang="scala">// version 1.2.21 val pattern = "(?U)" + // Enable matching of non-ascii characters "[a-z][-a-z0-9+.]:" + // Scheme... "(?=[/\\w])" + // ... but not just the scheme "(?://[-\\w.@:]+)?" + // Host "[-\\w.~/%!\$&'()+,;=]" + // Path "(?:\\?[-\\w.~%!\$&'()+,;=/?])?" + // Query "(?:\\#[-\\w.~%!\$&'()+,;=/?])?" // Fragment fun main(args: Array<String>) { val text = """ \|this URI contains an illegal character, parentheses and a misplaced full stop: \|http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/). \|and another one just to confuse the parser: http://en.wikipedia.org/wiki/-) \|")" is handled the wrong way by the mediawiki parser. \|ftp://domain.name/path(balanced_brackets)/foo.html \|ftp://domain.name/path(balanced_brackets)/ending.in.dot. \|ftp://domain.name/path(unbalanced_brackets/ending.in.dot. \|leading junk ftp://domain.name/path/embedded?punct/uation. \|leading junk ftp://domain.name/dangling_close_paren) \|if you have other interesting URIs for testing, please add them here: \|http://www.example.org/foo.html#includes_fragment \|http://www.example.org/foo.html#enthält_Unicode-Fragment """.trimMargin() val patterns = listOf(pattern.drop(4), pattern) val descs = listOf("URIs:-", "IRIs:-") for (i in 0..1) { println(descs[i]) val regex = Regex(patterns[i]) val matches = regex.findAll(text) matches.forEach { println(it.value) } println() } }</syntaxhighlight> {{out}} <pre> URIs:- http://en.wikipedia.org/wiki/Erich_K http://mediawiki.org/). http://en.wikipedia.org/wiki/-) ftp://domain.name/path(balanced_brackets)/foo.html ftp://domain.name/path(balanced_brackets)/ending.in.dot. ftp://domain.name/path(unbalanced_brackets/ending.in.dot. ftp://domain.name/path/embedded?punct/uation. ftp://domain.name/dangling_close_paren) http://www.example.org/foo.html#includes_fragment http://www.example.org/foo.html#enth IRIs:- http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). http://mediawiki.org/). http://en.wikipedia.org/wiki/-) ftp://domain.name/path(balanced_brackets)/foo.html ftp://domain.name/path(balanced_brackets)/ending.in.dot. ftp://domain.name/path(unbalanced_brackets/ending.in.dot. ftp://domain.name/path/embedded?punct/uation. ftp://domain.name/dangling_close_paren) http://www.example.org/foo.html#includes_fragment http://www.example.org/foo.html#enthält_Unicode-Fragment </pre> =={{header\|Mathematica}}/{{header\|Wolfram Language}}== Using the built-in text parser <syntaxhighlight lang="mathematica">TextCases[" this URI contains an illegal character, parentheses and a misplaced full stop: http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/). and another one just to confuse the parser: http://en.wikipedia.org/wiki/-) \")\" is handled the wrong way by the mediawiki parser. ftp://domain.name/path(balanced_brackets)/foo.html ftp://domain.name/path(balanced_brackets)/ending.in.dot. ftp://domain.name/path(unbalanced_brackets/ending.in.dot. leading junk ftp://domain.name/path/embedded?punct/uation. leading junk ftp://domain.name/dangling_close_paren) if you have other interesting URIs for testing, please add them here:", "URL"]</syntaxhighlight> {{out}} <pre>{"http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which", "ftp://domain.name/path(balanced_brackets)/foo.html", "ftp://domain.name/path(balanced_brackets)/ending.in.dot", "ftp://domain.name/path(unbalanced_brackets/ending.in.dot", "ftp://domain.name/path/embedded?punct/uation"}</pre> =={{header\|Objeck}}== Used a regex instead of writing a parser. <syntaxhighlight lang="objeck"> use RegEx; class FindUri { function : Main(args : String[]) ~ Nil { text := "this URI contains an illegal character, parentheses and a misplaced full stop: http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/). and another one just to confuse the parser: http://en.wikipedia.org/wiki/-) \")\" is handled the wrong way by the mediawiki parser. ftp://domain.name/path(balanced_brackets)/foo.html ftp://domain.name/path(balanced_brackets)/ending.in.dot. ftp://domain.name/path(unbalanced_brackets/ending.in.dot. leading junk ftp://domain.name/path/embedded?punct/uation. leading junk ftp://domain.name/dangling_close_paren) if you have other interesting URIs for testing, please add them here:"; found := RegEx->New("\\w://(\\w\|\\(\|\\)\|/\|,\|;\|'\|\\?\|\\.)")->Find(text); count := found->Size(); "Found: {$count}"->PrintLine(); each(i : found) { found->Get(i)->As(String)->PrintLine(); }; } }</syntaxhighlight> <pre> Count: 8 http://en.wikipedia.org/wiki/Erich_K http://mediawiki.org/). http://en.wikipedia.org/wiki/ ftp://domain.name/path(balanced_brackets)/foo.html ftp://domain.name/path(balanced_brackets)/ending.in.dot. ftp://domain.name/path(unbalanced_brackets/ending.in.dot. ftp://domain.name/path/embedded?punct/uation. ftp://domain.name/dangling_close_paren) </pre> =={{header\|Perl}}== Only covers whatever Regexp::Common::URI supports. <syntaxhighlight lang="perl"># 20200821 added Perl programming solution use strict; use warnings; use Regexp::Common qw /URI/; # https://metacpan.org/pod/Regexp::Common::URI while ( my $line = <DATA> ) { chomp $line; my @URIs = $line =~ /$RE{URI}/g and print "URI(s) found.\n"; foreach my $uri (@URIs) { print "URI : $uri\n" } } __DATA__ this URI contains an illegal character, parentheses and a misplaced full stop: http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/). and another one just to confuse the parser: http://en.wikipedia.org/wiki/-) ")" is handled the wrong way by the mediawiki parser. ftp://domain.name/path(balanced_brackets)/foo.html ftp://domain.name/path(balanced_brackets)/ending.in.dot. ftp://domain.name/path(unbalanced_brackets/ending.in.dot. leading junk ftp://domain.name/path/embedded?punct/uation. leading junk ftp://domain.name/dangling_close_paren)</syntaxhighlight> {{out}} <pre>URI(s) found. URI : http://en.wikipedia.org/wiki/Erich_K URI : http://mediawiki.org/). URI(s) found. URI : http://en.wikipedia.org/wiki/-) URI(s) found. URI : ftp://domain.name/path(balanced_brackets)/foo.html URI(s) found. URI : ftp://domain.name/path(balanced_brackets)/ending.in.dot. URI(s) found. URI : ftp://domain.name/path(unbalanced_brackets/ending.in.dot. URI(s) found. URI : ftp://domain.name/path/embedded URI(s) found. URI : ftp://domain.name/dangling_close_paren)</pre> =={{header\|Phix}}== The following is based on scanForUrls() in demo\edita\src\easynclr.e which is used/tested on a daily basis and may get additional bugfixes, though it is quite strongly coupled in with syntax colouring and other editor gubbins. Now handles dangling ")" and trailing "." to match the mediawiki handling, with Edita (1.0.4) similarly updated. Note that Edita does not highlight a quoted text literal in the same manner as medaiwiki, but does with comments.<br> Deliberately handles IRI but not URI, in other words no attempt is made to prohibit unicode characters. <!--<syntaxhighlight lang="phix">(phixonline)--> <span style="color: #008080;">with</span> <span style="color: #008080;">javascript_semantics</span> <span style="color: #008080;">constant</span> <span style="color: #000000;">schemes</span> <span style="color: #0000FF;">=</span> <span style="color: #0000FF;">{</span><span style="color: #008000;">`ftp`</span><span style="color: #0000FF;">,</span><span style="color: #008000;">`gopher`</span><span style="color: #0000FF;">,</span><span style="color: #008000;">`http`</span><span style="color: #0000FF;">,</span><span style="color: #008000;">`https`</span><span style="color: #0000FF;">,</span><span style="color: #008000;">`mailto`</span><span style="color: #0000FF;">,</span><span style="color: #008000;">`news`</span><span style="color: #0000FF;">,</span><span style="color: #008000;">`nntp`</span><span style="color: #0000FF;">,</span> <span style="color: #008000;">`telnet`</span><span style="color: #0000FF;">,</span><span style="color: #008000;">`wais`</span><span style="color: #0000FF;">,</span><span style="color: #008000;">`file`</span><span style="color: #0000FF;">,</span><span style="color: #008000;">`prospero`</span><span style="color: #0000FF;">,</span><span style="color: #008000;">`edit`</span><span style="color: #0000FF;">,</span><span style="color: #008000;">`tel`</span><span style="color: #0000FF;">,</span><span style="color: #008000;">`urn`</span><span style="color: #0000FF;">}</span> <span style="color: #008080;">function</span> <span style="color: #000000;">scan_for_urls</span><span style="color: #0000FF;">(</span><span style="color: #004080;">sequence</span> <span style="color: #000000;">text</span><span style="color: #0000FF;">)</span> <span style="color: #000080;font-style:italic;">-- such as http::\\wikipedia.org</span> <span style="color: #004080;">integer</span> <span style="color: #000000;">chidx</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">1</span><span style="color: #0000FF;">,</span> <span style="color: #000000;">chidx2</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">1</span><span style="color: #0000FF;">,</span> <span style="color: #000000;">lt</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #000000;">text</span><span style="color: #0000FF;">),</span> <span style="color: #000000;">ch2</span> <span style="color: #004080;">sequence</span> <span style="color: #000000;">res</span> <span style="color: #0000FF;">=</span> <span style="color: #0000FF;">{}</span> <span style="color: #008080;">while</span> <span style="color: #000000;">chidx2</span><span style="color: #0000FF;"><=</span><span style="color: #000000;">lt</span> <span style="color: #008080;">do</span> <span style="color: #000000;">ch2</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">text</span><span style="color: #0000FF;">[</span><span style="color: #000000;">chidx2</span><span style="color: #0000FF;">]</span> <span style="color: #008080;">if</span> <span style="color: #000000;">ch2</span><span style="color: #0000FF;">>=</span><span style="color: #008000;">'a'</span> <span style="color: #008080;">and</span> <span style="color: #000000;">ch2</span><span style="color: #0000FF;"><=</span><span style="color: #008000;">'z'</span> <span style="color: #008080;">then</span> <span style="color: #008080;">if</span> <span style="color: #000000;">chidx2</span><span style="color: #0000FF;">-</span><span style="color: #000000;">1</span><span style="color: #0000FF;">></span><span style="color: #000000;">chidx</span> <span style="color: #008080;">or</span> <span style="color: #000000;">text</span><span style="color: #0000FF;">[</span><span style="color: #000000;">chidx</span><span style="color: #0000FF;">]<=</span><span style="color: #008000;">' '</span> <span style="color: #008080;">then</span> <span style="color: #000000;">chidx</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">chidx2</span> <span style="color: #008080;">end</span> <span style="color: #008080;">if</span> <span style="color: #008080;">while</span> <span style="color: #000000;">chidx2</span><span style="color: #0000FF;"><=</span><span style="color: #000000;">lt</span> <span style="color: #008080;">do</span> <span style="color: #000000;">ch2</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">text</span><span style="color: #0000FF;">[</span><span style="color: #000000;">chidx2</span><span style="color: #0000FF;">]</span> <span style="color: #008080;">if</span> <span style="color: #000000;">ch2</span><span style="color: #0000FF;"><</span><span style="color: #008000;">'a'</span> <span style="color: #008080;">or</span> <span style="color: #000000;">ch2</span><span style="color: #0000FF;">></span><span style="color: #008000;">'z'</span> <span style="color: #008080;">then</span> <span style="color: #008080;">exit</span> <span style="color: #008080;">end</span> <span style="color: #008080;">if</span> <span style="color: #000000;">chidx2</span> <span style="color: #0000FF;">+=</span> <span style="color: #000000;">1</span> <span style="color: #008080;">end</span> <span style="color: #008080;">while</span> <span style="color: #004080;">string</span> <span style="color: #000000;">oneword</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">text</span><span style="color: #0000FF;">[</span><span style="color: #000000;">chidx</span><span style="color: #0000FF;">..</span><span style="color: #000000;">chidx2</span><span style="color: #0000FF;">-</span><span style="color: #000000;">1</span><span style="color: #0000FF;">]</span> <span style="color: #008080;">if</span> <span style="color: #000000;">chidx2</span><span style="color: #0000FF;">></span><span style="color: #000000;">lt</span> <span style="color: #008080;">then</span> <span style="color: #008080;">exit</span> <span style="color: #008080;">end</span> <span style="color: #008080;">if</span> <span style="color: #000000;">ch2</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">text</span><span style="color: #0000FF;">[</span><span style="color: #000000;">chidx2</span><span style="color: #0000FF;">]</span> <span style="color: #008080;">if</span> <span style="color: #0000FF;">(</span><span style="color: #000000;">ch2</span><span style="color: #0000FF;">=</span><span style="color: #008000;">':'</span> <span style="color: #008080;">and</span> <span style="color: #7060A8;">find</span><span style="color: #0000FF;">(</span><span style="color: #000000;">oneword</span><span style="color: #0000FF;">,</span><span style="color: #000000;">schemes</span><span style="color: #0000FF;">))</span> <span style="color: #008080;">or</span> <span style="color: #0000FF;">(</span><span style="color: #000000;">ch2</span><span style="color: #0000FF;">=</span><span style="color: #008000;">'.'</span> <span style="color: #008080;">and</span> <span style="color: #7060A8;">equal</span><span style="color: #0000FF;">(</span><span style="color: #000000;">oneword</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"www"</span><span style="color: #0000FF;">))</span> <span style="color: #008080;">then</span> <span style="color: #000000;">chidx2</span> <span style="color: #0000FF;">+=</span> <span style="color: #000000;">1</span> <span style="color: #004080;">integer</span> <span style="color: #000000;">chidx0</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">chidx2</span> <span style="color: #004080;">bool</span> <span style="color: #000000;">isUrl</span> <span style="color: #0000FF;">=</span> <span style="color: #004600;">false</span> <span style="color: #004080;">string</span> <span style="color: #000000;">bstack</span> <span style="color: #0000FF;">=</span> <span style="color: #008000;">""</span> <span style="color: #008080;">while</span> <span style="color: #000000;">chidx2</span><span style="color: #0000FF;"><=</span><span style="color: #000000;">lt</span> <span style="color: #008080;">do</span> <span style="color: #000000;">ch2</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">text</span><span style="color: #0000FF;">[</span><span style="color: #000000;">chidx2</span><span style="color: #0000FF;">]</span> <span style="color: #008080;">if</span> <span style="color: #000000;">ch2</span><span style="color: #0000FF;">=</span><span style="color: #008000;">'\"'</span> <span style="color: #008080;">and</span> <span style="color: #000000;">chidx2</span><span style="color: #0000FF;">=</span><span style="color: #000000;">chidx0</span> <span style="color: #008080;">then</span> <span style="color: #008080;">while</span> <span style="color: #000000;">chidx2</span><span style="color: #0000FF;"><</span><span style="color: #000000;">lt</span> <span style="color: #008080;">do</span> <span style="color: #000000;">chidx2</span> <span style="color: #0000FF;">+=</span> <span style="color: #000000;">1</span> <span style="color: #000000;">ch2</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">text</span><span style="color: #0000FF;">[</span><span style="color: #000000;">chidx2</span><span style="color: #0000FF;">]</span> <span style="color: #008080;">if</span> <span style="color: #000000;">ch2</span><span style="color: #0000FF;">=</span><span style="color: #008000;">'\"'</span> <span style="color: #008080;">then</span> <span style="color: #000000;">chidx2</span> <span style="color: #0000FF;">+=</span> <span style="color: #000000;">1</span> <span style="color: #000000;">isUrl</span> <span style="color: #0000FF;">=</span> <span style="color: #004600;">true</span> <span style="color: #008080;">exit</span> <span style="color: #008080;">end</span> <span style="color: #008080;">if</span> <span style="color: #008080;">end</span> <span style="color: #008080;">while</span> <span style="color: #008080;">exit</span> <span style="color: #008080;">elsif</span> <span style="color: #7060A8;">find</span><span style="color: #0000FF;">(</span><span style="color: #000000;">ch2</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"(<[{"</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">then</span> <span style="color: #000000;">bstack</span> <span style="color: #0000FF;">&=</span> <span style="color: #000000;">ch2</span><span style="color: #0000FF;">+</span><span style="color: #7060A8;">iff</span><span style="color: #0000FF;">(</span><span style="color: #000000;">ch2</span><span style="color: #0000FF;">=</span><span style="color: #008000;">'('</span><span style="color: #0000FF;">?</span><span style="color: #000000;">1</span><span style="color: #0000FF;">:</span><span style="color: #000000;">2</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">elsif</span> <span style="color: #7060A8;">find</span><span style="color: #0000FF;">(</span><span style="color: #000000;">ch2</span><span style="color: #0000FF;">,</span><span style="color: #008000;">")>]}"</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">then</span> <span style="color: #008080;">if</span> <span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #000000;">bstack</span><span style="color: #0000FF;">)=</span><span style="color: #000000;">0</span> <span style="color: #008080;">or</span> <span style="color: #000000;">bstack</span><span style="color: #0000FF;">[$]!=</span><span style="color: #000000;">ch2</span> <span style="color: #008080;">then</span> <span style="color: #008080;">exit</span> <span style="color: #008080;">end</span> <span style="color: #008080;">if</span> <span style="color: #000000;">bstack</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">bstack</span><span style="color: #0000FF;">[</span><span style="color: #000000;">1</span><span style="color: #0000FF;">..$-</span><span style="color: #000000;">1</span><span style="color: #0000FF;">]</span> <span style="color: #008080;">end</span> <span style="color: #008080;">if</span> <span style="color: #008080;">if</span> <span style="color: #000000;">ch2</span><span style="color: #0000FF;">></span><span style="color: #000000;">255</span> <span style="color: #008080;">or</span> <span style="color: #000000;">ch2</span><span style="color: #0000FF;"><=</span><span style="color: #008000;">' '</span> <span style="color: #008080;">then</span> <span style="color: #008080;">exit</span> <span style="color: #008080;">end</span> <span style="color: #008080;">if</span> <span style="color: #000000;">isUrl</span> <span style="color: #0000FF;">=</span> <span style="color: #004600;">true</span> <span style="color: #000000;">chidx2</span> <span style="color: #0000FF;">+=</span> <span style="color: #000000;">1</span> <span style="color: #008080;">end</span> <span style="color: #008080;">while</span> <span style="color: #008080;">if</span> <span style="color: #000000;">isUrl</span> <span style="color: #008080;">then</span> <span style="color: #004080;">string</span> <span style="color: #000000;">oneurl</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">text</span><span style="color: #0000FF;">[</span><span style="color: #000000;">chidx</span><span style="color: #0000FF;">..</span><span style="color: #000000;">chidx2</span><span style="color: #0000FF;">-</span><span style="color: #000000;">1</span><span style="color: #0000FF;">]</span> <span style="color: #008080;">if</span> <span style="color: #000000;">oneurl</span><span style="color: #0000FF;">[$]=</span><span style="color: #008000;">'.'</span> <span style="color: #008080;">then</span> <span style="color: #000000;">oneurl</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">oneurl</span><span style="color: #0000FF;">[</span><span style="color: #000000;">1</span><span style="color: #0000FF;">..$-</span><span style="color: #000000;">1</span><span style="color: #0000FF;">]</span> <span style="color: #008080;">end</span> <span style="color: #008080;">if</span> <span style="color: #000000;">res</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">append</span><span style="color: #0000FF;">(</span><span style="color: #000000;">res</span><span style="color: #0000FF;">,</span><span style="color: #000000;">oneurl</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">end</span> <span style="color: #008080;">if</span> <span style="color: #000000;">chidx</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">chidx2</span> <span style="color: #008080;">if</span> <span style="color: #000000;">chidx2</span><span style="color: #0000FF;">></span><span style="color: #000000;">lt</span> <span style="color: #008080;">then</span> <span style="color: #008080;">exit</span> <span style="color: #008080;">end</span> <span style="color: #008080;">if</span> <span style="color: #008080;">else</span> <span style="color: #000000;">chidx2</span> <span style="color: #0000FF;">-=</span> <span style="color: #000000;">1</span> <span style="color: #008080;">end</span> <span style="color: #008080;">if</span> <span style="color: #008080;">end</span> <span style="color: #008080;">if</span> <span style="color: #000000;">chidx2</span> <span style="color: #0000FF;">+=</span> <span style="color: #000000;">1</span> <span style="color: #008080;">end</span> <span style="color: #008080;">while</span> <span style="color: #008080;">return</span> <span style="color: #000000;">res</span> <span style="color: #008080;">end</span> <span style="color: #008080;">function</span> <span style="color: #008080;">constant</span> <span style="color: #000000;">txt</span> <span style="color: #0000FF;">=</span> <span style="color: #008000;">""" this URI contains an illegal character, parentheses and a misplaced full stop: http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/). and another one just to confuse the parser: http://en.wikipedia.org/wiki/-) ")" is handled the wrong way by the mediawiki parser. ftp://domain.name/path(balanced_brackets)/foo.html ftp://domain.name/path(balanced_brackets)/ending.in.dot. ftp://domain.name/path(unbalanced_brackets/ending.in.dot. leading junk ftp://domain.name/path/embedded?punct/uation. leading junk ftp://domain.name/dangling_close_paren) if you have other interesting URIs for testing, please add them here: http://www.example.org/foo.html#includes_fragment http://www.example.org/foo.html#enthält_Unicode-Fragment http://192.168.0.1/admin/?hackme=%%%%%%%%%true blah (foo://domain.hld/)))) https://haxor.ur:4592/~mama/####&?foo ftp://ftp.is.co.za/rfc/rfc1808.txt http://www.ietf.org/rfc/rfc2396.txt mailto:John.Doe@example.com news:comp.infosystems.www.servers.unix tel:+1-816-555-1212 telnet://192.0.2.16:80/ urn:oasis:names:specification:docbook:dtd:xml:4.1.2 """</span> <span style="color: #7060A8;">printf</span><span style="color: #0000FF;">(</span><span style="color: #000000;">1</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"%s\n"</span><span style="color: #0000FF;">,{</span><span style="color: #7060A8;">join</span><span style="color: #0000FF;">(</span><span style="color: #000000;">scan_for_urls</span><span style="color: #0000FF;">(</span><span style="color: #000000;">txt</span><span style="color: #0000FF;">),</span><span style="color: #008000;">"\n"</span><span style="color: #0000FF;">)})</span> <!--</syntaxhighlight>--> {{out}} <pre> http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer) http://mediawiki.org/ http://en.wikipedia.org/wiki/- ftp://domain.name/path(balanced_brackets)/foo.html ftp://domain.name/path(balanced_brackets)/ending.in.dot ftp://domain.name/path(unbalanced_brackets/ending.in.dot ftp://domain.name/path/embedded?punct/uation ftp://domain.name/dangling_close_paren http://www.example.org/foo.html#includes_fragment http://www.example.org/foo.html#enthält_Unicode-Fragment http://192.168.0.1/admin/?hackme=%%%%%%%%%true https://haxor.ur:4592/~mama/####&?foo ftp://ftp.is.co.za/rfc/rfc1808.txt http://www.ietf.org/rfc/rfc2396.txt mailto:John.Doe@example.com news:comp.infosystems.www.servers.unix tel:+1-816-555-1212 telnet://192.0.2.16:80/ urn:oasis:names:specification:docbook:dtd:xml:4.1.2 </pre> =={{header\|PHP}}== Trivial example using PHP's built-in filter_var() function (which does not support IRIs). <syntaxhighlight lang="php">$tests = array( 'this URI contains an illegal character, parentheses and a misplaced full stop:', 'http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/).', 'and another one just to confuse the parser: http://en.wikipedia.org/wiki/-)', '")" is handled the wrong way by the mediawiki parser.', 'ftp://domain.name/path(balanced_brackets)/foo.html', 'ftp://domain.name/path(balanced_brackets)/ending.in.dot.', 'ftp://domain.name/path(unbalanced_brackets/ending.in.dot.', 'leading junk ftp://domain.name/path/embedded?punct/uation.', 'leading junk ftp://domain.name/dangling_close_paren)', 'if you have other interesting URIs for testing, please add them here:', 'http://www.example.org/foo.html#includes_fragment', 'http://www.example.org/foo.html#enthält_Unicode-Fragment', ' http://192.168.0.1/admin/?hackme=%%%%%%%%%true', 'blah (foo://domain.hld/))))', 'https://haxor.ur:4592/~mama/####&?foo' ); foreach ( $tests as $test ) { foreach( explode( ' ', $test ) as $uri ) { if ( filter_var( $uri, FILTER_VALIDATE_URL ) ) echo $uri, PHP_EOL; } } </syntaxhighlight> {{Out}} <pre> http://mediawiki.org/). http://en.wikipedia.org/wiki/-) ftp://domain.name/path(balanced_brackets)/foo.html ftp://domain.name/path(balanced_brackets)/ending.in.dot. ftp://domain.name/path(unbalanced_brackets/ending.in.dot. ftp://domain.name/path/embedded?punct/uation. ftp://domain.name/dangling_close_paren) http://www.example.org/foo.html#includes_fragment http://192.168.0.1/admin/?hackme=%%%%%%%%%true https://haxor.ur:4592/~mama/&####&?foo </pre> =={{header\|Pike}}== <~~lang~~syntaxhighlight ~~Pike~~lang="pike">string uritext = #"this URI contains an illegal character, parentheses and a misplaced full stop: http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). which is handled by http://mediawiki.org/). Line 208 ⟶ 827: "ftp://domain.name/dangling_close_paren)", "here:" })</~~lang~~syntaxhighlight> =={{header\|~~TXR~~Racket}}== {{trans\|Tcl}} ~~<lang txr>@(define path (path))@\~~ <syntaxhighlight lang="racket">#lang racket (define sample #<<EOS this URI contains an illegal character, parentheses and a misplaced full stop: http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/). and another one just to confuse the parser: http://en.wikipedia.org/wiki/-) ")" is handled the wrong way by the mediawiki parser. ftp://domain.name/path(balanced_brackets)/foo.html ftp://domain.name/path(balanced_brackets)/ending.in.dot. ftp://domain.name/path(unbalanced_brackets/ending.in.dot. leading junk ftp://domain.name/path/embedded?punct/uation. leading junk ftp://domain.name/dangling_close_paren) EOS ) (define uri-ere-bits '("[a-z][-a-z0-9+.]:" ; Scheme... "(?=[/\\w])" ; ... but not just the scheme "(?://[-\\w.@:]+)?" ; Host "[-\\w.~/%!$&'()+,;=]" ; Path "(?:\\?[-\\w.~%!$&'()+,;=/?])?" ; Query "(?:[#][-\\w.~%!$&'()+,;=/?])?" ; Fragment )) (define uri-re (pregexp (apply string-append uri-ere-bits))) (for-each (compose displayln ~s) (regexp-match* uri-re sample)) (regexp-match-positions* uri-re sample) (module+ test ;; "ABNF for Syntax Specifications" http://tools.ietf.org/html/rfc2234 ;; defines ALPHA as: ;; ALPHA = %x41-5A / %x61-7A ; A-Z / a-z (unless (= 228 (char->integer #\ä)) (error "a-umlaut is not 228, and therefore might be an ALPHA")))</syntaxhighlight> {{out}} Tcl's \w matches non-ASCII alphabetic characters. We finish the Kaestner match after the K because a-umlaut is not an ASCII character. Match positions differ from the [[#Tcl]] version because: * sample does not start with a newline in racket (the here string handles that differently to Tcl braces) * the cdr of the pairs is the index AFTER the last character of the match <pre>"http://en.wikipedia.org/wiki/Erich_K" "http://mediawiki.org/)." "http://en.wikipedia.org/wiki/-)" "ftp://domain.name/path(balanced_brackets)/foo.html" "ftp://domain.name/path(balanced_brackets)/ending.in.dot." "ftp://domain.name/path(unbalanced_brackets/ending.in.dot." "ftp://domain.name/path/embedded?punct/uation." "ftp://domain.name/dangling_close_paren)" ((79 . 115) (162 . 185) (230 . 261) (316 . 366) (367 . 423) (424 . 481) (495 . 540) (554 . 593))</pre> =={{header\|Raku}}== (formerly Perl 6) This needs an installed URI distribution. {{works with\|Rakudo\|2018.03}} <syntaxhighlight lang="raku" line>use v6; use IETF::RFC_Grammar::URI; say q:to/EOF/.match(/ <IETF::RFC_Grammar::URI::absolute-URI> /, :g).list.join("\n"); this URI contains an illegal character, parentheses and a misplaced full stop: http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/). and another one just to confuse the parser: http://en.wikipedia.org/wiki/-) ")" is handled the wrong way by the mediawiki parser. ftp://domain.name/path(balanced_brackets)/foo.html ftp://domain.name/path(balanced_brackets)/ending.in.dot. ftp://domain.name/path(unbalanced_brackets/ending.in.dot. leading junk ftp://domain.name/path/embedded?punct/uation. leading junk ftp://domain.name/dangling_close_paren) EOF say $/[-1]; say "We matched $/[-1], which is a $/[-1].^name() at position $/[-1].from() to $/[-1].to()" </syntaxhighlight> Like most of the solutions here it does not comply to IRI but only to URI: <pre>stop: http://en.wikipedia.org/wiki/Erich_K http://mediawiki.org/). parser: http://en.wikipedia.org/wiki/-) ftp://domain.name/path(balanced_brackets)/foo.html ftp://domain.name/path(balanced_brackets)/ending.in.dot. ftp://domain.name/path(unbalanced_brackets/ending.in.dot. ftp://domain.name/path/embedded ftp://domain.name/dangling_close_paren) ｢ftp://domain.name/dangling_close_paren)｣ IETF::RFC_Grammar::URI::absolute_URI => ｢ftp://domain.name/dangling_close_paren)｣ scheme => ｢ftp｣ We matched ftp://domain.name/dangling_close_paren), which is a Match, at position 554 to 593</pre> The last lines show that we get Match objects back that we can query to get all kinds of information. We even get the information what subrules matched, and since these are also Match objects we can obtain their match position in the text. =={{header\|REXX}}== <syntaxhighlight lang="rexx">/REXX program scans a text (contained within the REXX program) to extract URIs and IRIs/ $$= 'this URI contains an illegal character, parentheses and a misplaced full stop:', 'http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/).', 'and another one just to confuse the parser: http://en.wikipedia.org/wiki/-)', '")" is handled the wrong way by the mediawiki parser.', 'ftp://domain.name/path(balanced_brackets)/foo.html', 'ftp://domain.name/path(balanced_brackets)/ending.in.dot.', 'ftp://domain.name/path(unbalanced_brackets/ending.in.dot.', 'leading junk ftp://domain.name/path/embedded?punct/uation.', 'leading junk ftp://domain.name/dangling_close_paren)', 'if you have other interesting URIs for testing, please add them here:' @abc= 'abcdefghijklmnopqrstuvwxyz' /construct lowercase (Latin) alphabet./ @abcU= @abc; upper @abcU; @abcs= @abc \|\| @abcU / " lower & uppercase " / @scheme= @abcs \|\| 0123456789 \|\| '+-.' /add decimal digits & some punctuation/ @unreserved= @abcs \|\| 0123456789 \|\| '-._~' / " " " " " " / @reserved= @unreserved"/?#[]@!$&)(+,;=\'" /add other punctuation & special chars/ $= space($$)' ' /variable $ is a working copy of $$ / #= 0 /the count of URI's found (so far)./ /▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄/ do while $\=''; y= pos(':', $) /locate a colon (:) in the text body./ if y==0 then leave /Was a colon found? Nope, we're done./ if y==1 then do; parse var $ . $ /handle a bare colon by itself. / iterate /go and keep scanning for a colon. / end /* [↑] (a rare special case.) / sr= reverse( left($, y - 1) ) /extract the scheme and reverse it. / se= verify(sr, @scheme) /locate the end of the scheme. / $= substr($, y + 1) /assign an adjusted new text. / if se\==0 then sr= left(sr, se - 1) /possibly "crop" the scheme name. / s= reverse(sr) /reverse it again to rectify the name./ he= verify($, @reserved) /locate the end of hierarchical part./ s= s':'left($, he - 1) /extract and append " " / $= substr($, he) /assign an adjusted new part of text. / #= # + 1 /bump the URI counter. / !.#= s /assign the URI to an array (!.) / end /while/ / [↑] scan the text for URI's. / /▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀/ do k=1 for #; say !.k; end /stick a fork in it, we're all done. /</syntaxhighlight> {{out\|output\|text=  when using the internal default inputs:}} <pre> stop: http://en.wikipedia.org/wiki/Erich_K http://mediawiki.org/). parser: http://en.wikipedia.org/wiki/-) ftp://domain.name/path(balanced_brackets)/foo.html ftp://domain.name/path(balanced_brackets)/ending.in.dot. ftp://domain.name/path(unbalanced_brackets/ending.in.dot. ftp://domain.name/path/embedded?punct/uation. ftp://domain.name/dangling_close_paren) here: </pre> =={{header\|Ruby}}== <syntaxhighlight lang="ruby"> require 'uri' str = 'this URI contains an illegal character, parentheses and a misplaced full stop: http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/). and another one just to confuse the parser: http://en.wikipedia.org/wiki/-) ")" is handled the wrong way by the mediawiki parser. ftp://domain.name/path(balanced_brackets)/foo.html ftp://domain.name/path(balanced_brackets)/ending.in.dot. ftp://domain.name/path(unbalanced_brackets/ending.in.dot. leading junk ftp://domain.name/path/embedded?punct/uation. leading junk ftp://domain.name/dangling_close_paren) if you have other interesting URIs for testing, please add them here:' puts URI.extract(str) puts "\nFiltered for HTTP and HTTPS:" puts URI.extract(str, ["http", "https"]) puts "\nThis is the (extendible) list of supported schemes: #{URI.scheme_list.keys}"</syntaxhighlight> {{Output}} <pre> stop: http://en.wikipedia.org/wiki/Erich_K http://mediawiki.org/). parser: http://en.wikipedia.org/wiki/-) ftp://domain.name/path(balanced_brackets)/foo.html ftp://domain.name/path(balanced_brackets)/ending.in.dot. ftp://domain.name/path(unbalanced_brackets/ending.in.dot. ftp://domain.name/path/embedded?punct/uation. ftp://domain.name/dangling_close_paren) here: Filtered for HTTP and HTTPS: http://en.wikipedia.org/wiki/Erich_K http://mediawiki.org/). http://en.wikipedia.org/wiki/-) This is the (extendible) list of supported schemes: ["FTP", "HTTP", "HTTPS", "LDAP", "LDAPS", "MAILTO"] </pre> =={{header\|Tcl}}== This uses regular expressions to do the matching. It doesn't match a URL without a scheme (too problematic in general text) and it requires more than ''just'' the scheme too, but apart from that it matches slightly too broad a range of strings (though not usually problematically much). Matches some IRIs correctly too, but does not tackle the <tt><bracketed></tt> form (especially not if it includes extra spaces). <syntaxhighlight lang="tcl">proc findURIs {text args} { # This is an ERE with embedded comments. Rare, but useful with something # this complex. set URI {(?x) [a-z][-a-z0-9+.]: # Scheme... (?=[/\w]) # ... but not just the scheme (?://[-\w.@:]+)? # Host [-\w.~/%!$&'()+,;=] # Path (?:\?[-\w.~%!$&'()+,;=/?])? # Query (?:[#][-\w.~%!$&'()+,;=/?])? # Fragment } regexp -inline -all {}$args -- $URI $text }</syntaxhighlight> ;Demonstrating<nowiki>:</nowiki> Note that the last line of output is showing that we haven't just extracted the URI substrings, but can also get the match positions within the text. <syntaxhighlight lang="tcl">set sample { this URI contains an illegal character, parentheses and a misplaced full stop: http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/). and another one just to confuse the parser: http://en.wikipedia.org/wiki/-) ")" is handled the wrong way by the mediawiki parser. ftp://domain.name/path(balanced_brackets)/foo.html ftp://domain.name/path(balanced_brackets)/ending.in.dot. ftp://domain.name/path(unbalanced_brackets/ending.in.dot. leading junk ftp://domain.name/path/embedded?punct/uation. leading junk ftp://domain.name/dangling_close_paren) } puts [join [findURIs $sample] \n] puts [findURIs $sample -indices]</syntaxhighlight> {{out}} <pre> http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). http://mediawiki.org/). http://en.wikipedia.org/wiki/-) ftp://domain.name/path(balanced_brackets)/foo.html ftp://domain.name/path(balanced_brackets)/ending.in.dot. ftp://domain.name/path(unbalanced_brackets/ending.in.dot. ftp://domain.name/path/embedded?punct/uation. ftp://domain.name/dangling_close_paren) {80 140} {163 185} {231 261} {317 366} {368 423} {425 481} {496 540} {555 593} </pre> =={{header\|TXR}}== <syntaxhighlight lang="txr">@(define path (path))@\ @(local x y)@\ @(cases)@\ Line 250 ⟶ 1,112: @ (end) @ (end) @(end)</~~lang~~syntaxhighlight> Test file: Line 283 ⟶ 1,145: leading junk ftp://domain.name/dangling_close_paren) ftp://domain.name/dangling_close_paren</pre> =={{header\|Wren}}== {{libheader\|Wren-pattern}} Wren's simple pattern matcher lacks the sophistication of regular expressions and I've had to make considerable simplifications to the search pattern needed for complete URI/IRI matching whilst still doing enough to identify the ones embedded in the sample text for this task. <syntaxhighlight lang="wren">import "./pattern" for Pattern var text = """ this URI contains an illegal character, parentheses and a misplaced full stop: http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/). and another one just to confuse the parser: http://en.wikipedia.org/wiki/-) ")" is handled the wrong way by the mediawiki parser. ftp://domain.name/path(balanced_brackets)/foo.html ftp://domain.name/path(balanced_brackets)/ending.in.dot. ftp://domain.name/path(unbalanced_brackets/ending.in.dot. leading junk ftp://domain.name/path/embedded?punct/uation. leading junk ftp://domain.name/dangling_close_paren) if you have other interesting URIs for testing, please add them here: http://www.example.org/foo.html#includes_fragment http://www.example.org/foo.html#enthält_Unicode-Fragment """ var i = Pattern.lower + Pattern.digit + "-+." var j = Pattern.alpha + "_-.@:" var k = j + "~\%!$&'()+,;=?/#" var e = "/l+0/i:////+1/j//+1/k" var p = Pattern.new(e, Pattern.within, i, j, k) var matches = p.findAll(text) System.print("URI's found:\n") for (m in matches) System.print(m.text) k = k + "ä" p = Pattern.new(e, Pattern.within, i, j, k) System.print("\nIRI's found:\n") matches = p.findAll(text) for (m in matches) System.print(m.text)</syntaxhighlight> {{out}} <pre> URI's found: http://en.wikipedia.org/wiki/Erich_K http://mediawiki.org/). http://en.wikipedia.org/wiki/-) ftp://domain.name/path(balanced_brackets)/foo.html ftp://domain.name/path(balanced_brackets)/ending.in.dot. ftp://domain.name/path(unbalanced_brackets/ending.in.dot. ftp://domain.name/path/embedded?punct/uation. ftp://domain.name/dangling_close_paren) http://www.example.org/foo.html#includes_fragment http://www.example.org/foo.html#enth IRI's found: http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). http://mediawiki.org/). http://en.wikipedia.org/wiki/-) ftp://domain.name/path(balanced_brackets)/foo.html ftp://domain.name/path(balanced_brackets)/ending.in.dot. ftp://domain.name/path(unbalanced_brackets/ending.in.dot. ftp://domain.name/path/embedded?punct/uation. ftp://domain.name/dangling_close_paren) http://www.example.org/foo.html#includes_fragment http://www.example.org/foo.html#enthält_Unicode-Fragment </pre>