Find URI in text

Task

Write a function to search plain text for URIs or IRIs.

The function should return a list of URIs or IRIs found in the text.

The definition of a URI is given in RFC 3986. IRI is defined in RFC 3987.

For searching URIs in particular "Appendix C. Delimiting a URI in Context" is noteworthy.

The abbreviation IRI isn't as well known as URI and the short description is that an IRI is just an alternate form of a URI that supports Internationalization and hence Unicode. While many specifications support both forms this isn't universal.

Consider the following issues:

. , ; ' ? ( ) are legal characters in a URI, but they are often used in plain text as a delimiter.
IRIs allow most (but not all) unicode characters.
URIs can be something else besides http:// or https://

Sample text

     this URI contains an illegal character, parentheses and a misplaced full stop:
     http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/).
     and another one just to confuse the parser: http://en.wikipedia.org/wiki/-)
     ")" is handled the wrong way by the mediawiki parser.
     ftp://domain.name/path(balanced_brackets)/foo.html
     ftp://domain.name/path(balanced_brackets)/ending.in.dot.
     ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
     leading junk ftp://domain.name/path/embedded?punct/uation.
     leading junk ftp://domain.name/dangling_close_paren)
     if you have other interesting URIs for testing, please add them here:

Regular expressions to solve the task are fine, but alternative approaches are welcome too. (otherwise, this task would degrade into 'finding and applying the best regular expression')

Extra Credit: implement the parser to match the IRI specification in RFC 3987.

Delphi

Library: System.RegularExpressions

Translation of: Go

program Find_URI_in_text;

{$APPTYPE CONSOLE}

uses
  System.SysUtils,
  System.RegularExpressions;

const
  pattern = '(*UTF)(*UCP)' +                    // Make \w unicode aware
    '(?<URI>[a-z][-a-z0-9+.]*:' +              // Scheme...
    '(?=[/\w])' +                      // ... but not just the scheme
    '(?://[-\w.@:]+)?)' +               // Host
    '[-\w.~/%!$&''()*+,;=]*' +          // Path
    '(?:\?[-\w.~%!$&''()*+,;=/?]*)?' + // Query
    '(?:\#[-\w.~%!$&''()*+,;=/?]*)?';   // Fragment

  Text =
    'this URI contains an illegal character, parentheses and a misplaced full stop:' + #13#10 + 
    'http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). ' +
    '(which is handled by http://mediawiki.org/).' + #13#10 +
    'and another one just to confuse the parser: http://en.wikipedia.org/wiki/-)' + #13#10 + 
    '")" is handled the wrong way by the mediawiki parser.' + #13#10 +
    'ftp://domain.name/path(balanced_brackets)/foo.html' + #13#10 +
    'ftp://domain.name/path(balanced_brackets)/ending.in.dot.' + #13#10 +
    'ftp://domain.name/path(unbalanced_brackets/ending.in.dot.' + #13#10 +
    'leading junk ftp://domain.name/path/embedded?punct/uation.' + #13#10 +
    'leading junk ftp://domain.name/dangling_close_paren)' + #13#10 +
    'if you have other interesting URIs for testing, please add them here:' + #13#10 + 
    'http://www.example.org/foo.html#includes_fragment' + #13#10 +
    'http://www.example.org/foo.html#enthält_Unicode-Fragment';

var
  reg: TRegEx;
  Match: TMatch;
  IRIs: string = '';
  URIs: string = '';

begin
  reg := TRegEx.Create(pattern);
  for Match in reg.Matches(Text) do
  begin
    URIs := URIs + #10 + Match.Groups['URI'].Value;
    IRIs := IRIs + #10 + Match.Value;
  end;

  Write('URIs:-');
  Writeln(URIs, #10);

  Write('IRIs:-');
  Writeln(IRIs, #10);

  Readln;
end.

Output:

URIs:-
http://en.wikipedia.org
http://mediawiki.org
http://en.wikipedia.org
ftp://domain.name
ftp://domain.name
ftp://domain.name
ftp://domain.name
ftp://domain.name
http://www.example.org
http://www.example.org

IRIs:-
http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer).
http://mediawiki.org/).
http://en.wikipedia.org/wiki/-)
ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
ftp://domain.name/path/embedded?punct/uation.
ftp://domain.name/dangling_close_paren)
http://www.example.org/foo.html#includes_fragment
http://www.example.org/foo.html#enthält_Unicode-Fragment

Go

Translation of: Kotlin

Library: golang-pkg-pcre

The regexp package in the Go standard library is not fully compatible with PCRE and is unable to compile the regular expression used here. A third party library has therefore been used instead which is a PCRE shim for Go.

package main

import (
    "fmt"
    "github.com/glenn-brown/golang-pkg-pcre/src/pkg/pcre"
)

var pattern = 
    "(*UTF)(*UCP)" +                    // Make \w unicode aware
    "[a-z][-a-z0-9+.]*:" +              // Scheme...
    "(?=[/\\w])" +                      // ... but not just the scheme
    "(?://[-\\w.@:]+)?" +               // Host
    "[-\\w.~/%!$&'()*+,;=]*" +          // Path
    "(?:\\?[-\\w.~%!$&'()*+,;=/?]*)?" + // Query
    "(?:\\#[-\\w.~%!$&'()*+,;=/?]*)?"   // Fragment

func main() {
    text := `
this URI contains an illegal character, parentheses and a misplaced full stop:
http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/).
and another one just to confuse the parser: http://en.wikipedia.org/wiki/-)
")" is handled the wrong way by the mediawiki parser.
ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
leading junk ftp://domain.name/path/embedded?punct/uation.
leading junk ftp://domain.name/dangling_close_paren)
if you have other interesting URIs for testing, please add them here:
http://www.example.org/foo.html#includes_fragment
http://www.example.org/foo.html#enthält_Unicode-Fragment
`
    descs := []string{"URIs:-", "IRIs:-"}
    patterns := []string{pattern[12:], pattern}
    for i := 0; i <= 1; i++ {
        fmt.Println(descs[i])
        re := pcre.MustCompile(patterns[i], 0)
        t := text
        for {
            se := re.FindIndex([]byte(t), 0)
            if se == nil {
                break
            }
            fmt.Println(t[se[0]:se[1]])
            t = t[se[1]:]
        }
        fmt.Println()
    }
}

Output:

URIs:-
http://en.wikipedia.org/wiki/Erich_K
http://mediawiki.org/).
http://en.wikipedia.org/wiki/-)
ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
ftp://domain.name/path/embedded?punct/uation.
ftp://domain.name/dangling_close_paren)
http://www.example.org/foo.html#includes_fragment
http://www.example.org/foo.html#enth

IRIs:-
http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer).
http://mediawiki.org/).
http://en.wikipedia.org/wiki/-)
ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
ftp://domain.name/path/embedded?punct/uation.
ftp://domain.name/dangling_close_paren)
http://www.example.org/foo.html#includes_fragment
http://www.example.org/foo.html#enthält_Unicode-Fragment

Icon and Unicon

This example follows RFC 3986 very closely (see Talk page for discussion). For better IP parsing see Parse_an_IP_Address. This solution doesn't handle IRIs per RFC 3987. Neither Icon nor Unicon natively support Unicode although ObjectIcon does.

Delimited examples of the form <URI> or "URI" will be correctly parse in any event. Handling of other possibly ambiguous examples that include valid URI characters is done by the 'findURItext' and 'disambURI' procedures. All candidate URIs are returned since once information is removed it will be lost and may be difficult for a user to reconstruct. This solution deals with all of the trailing character and balance considerations.

procedure main()
   every write(findURItext("this URI contains an illegal character, parentheses_
               and a misplaced full stop:\n_
               http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). _
               which is handled by http://mediawiki.org/).\n_
               and another one just to confuse the parser: _
               http://en.wikipedia.org/wiki/-)\n_
               \")\" is handled the wrong way by the mediawiki parser.\n_
               ftp://domain.name/path(balanced_brackets)/foo.html\n_
               ftp://domain.name/path(balanced_brackets)/ending.in.dot.\n_
               ftp://domain.name/path(unbalanced_brackets/ending.in.dot.\n_
               leading junk ftp://domain.name/path/embedded?punct/uation.\n_
               leading junk ftp://domain.name/dangling_close_paren)\n_
               if you have other interesting URIs for testing, please add them here:\n_
               blah (foo://domain.hld/))))"))
end

$define GENDELIM   ':/?#[]@'
$define SUBDELIM   '!$&()*+,;=\''
$define UNRESERVED &letters ++ &digits ++ '-._~'
$define RESERVED   GENDELIM++SUBDELIM
$define HEXDIGITS  '0123456789aAbBcCdDeEfF'

procedure findURItext(s)      #: generate all syntatically valid URI's from s
   local u,p
   s ? while tab(upto(&letters)) || (u := 2(p := &pos, URI())) do  { 
      suspend u                     # return parsed URI 
      every suspend disambURI(u,p)  # deal with text ambiguities, return many
      }
end

procedure disambURI(u,p)      #: generate disambiguous URIs from parsed
   local u2
   repeat  {
      if any('.,;?',u[-1]) then 
         suspend u := u[1:-1]             # remove trailing .,;? from URI
      else if u[-1] == "'" == &subject[p-:=1] then 
         suspend u := u[1:-1]             # remove trailing ' from 'URI'
      else if any('()',u[-1]) then   {    
         every u ? u2 := tab(bal())          
         if u ~==:= u2 then suspend u     # longest balanced URI wrt ()          
         }
      else break                          # done
      }       
end  
                  
procedure URI()               #: match longest URI at cursor
   static sc2
   initial sc2 := &letters ++ &digits ++ '+-.'                    # scheme 
   suspend (
      ( tab(any(&letters)) || (tab(many(sc2)) |="") || =":" ) ||  # scheme
      ( (="//" || authority() || arbsp("/",segment)) |            # heir ...
        (="/" || ( path_rootless() |="")) |
        path_rootless() |
        ="" 
      ) ||         
      ( ( ="?" || queryfrag() ) |="" ) ||                         # query
      ( ( ="#" || queryfrag() ) |="" )                            # fragment
      )
end

procedure queryfrag()         #: match a query or fragment
   static pc
   initial pc := UNRESERVED ++ SUBDELIM ++ ':@/?'
   suspend arbcp(pc,pctencode)   
end

procedure segment(n)          #: match a pchar segment
   static sc
   initial sc := UNRESERVED ++ SUBDELIM ++ ':@'
   suspend arbcp(sc,pctencode,n)
end

procedure segmentnc(n)        #: match a pchar--':' segment
   static sc
   initial sc := UNRESERVED ++ SUBDELIM ++ '@'
   suspend arbcp(sc,pctencode,n)
end

procedure path_rootless()     #: match a rootless path
   suspend segment(1) || arbsp("/",segment)
end

procedure authority()         #: match authority
   static uic,rnc
   initial {
      rnc := UNRESERVED ++ SUBDELIM    # regular name
      uic := rnc ++ ':'                # userinfo      
      }
   suspend  ( (arbcp(uic,pctencode) || ="@") |="")  ||      # userinfo
            ( IPsimple() | arbcp(rnc,pctencode) )   ||      # host
            ( (=":" || tab(many(&digits))) |="")
end  
      
procedure IPsimple()          #: match ip address (trickable )
   static i4c,i6c,ifc
   initial {
      i4c := &digits ++ '.'
      i6c := HEXDIGITS ++ '.:'
      ifc := UNRESERVED ++ SUBDELIM ++ ':'
      }
   suspend ( 
      ="[" || 
         (  tab(many(i6c)) |  
            ( ="v"||tab(any(HEXDIGITS))||="."||tab(any(ifc))||tab(many(ifc)) )
      ) || ="]" ) | tab(many(i4c))
end  

procedure arbcp(cs,pr,n)      #: match arbitrary numbers of (cset|proc,n)
   local p,i
   /n := 0                    # for 0* / 1*
   runerr(0 > n,205)
   p := &pos
   i := 0
   while tab(many(cs)) | pr() do i +:= 1
   if i >= n then suspend &subject[p:&pos]
   &pos := p                  # restore &pos
end

procedure arbsp(st,pr,n)      #: match arbitrary numbers of (string || proc,n)
   local p,i
   /n := 0                    # for 0* / 1*
   runerr(0 > n,205)
   p := &pos
   i := 0
   while =st || pr() do i +:= 1 
   if i >= n then suspend &subject[p:&pos]
   &pos := p                  # restore &pos
end

procedure pctencode()         #: match 1 % encoded single byte character
   suspend ="%" || tab(any(HEXDIGITS)) || tab(any(HEXDIGITS))
end

Output:

stop:
http://en.wikipedia.org/wiki/Erich_K
http://mediawiki.org/).
http://mediawiki.org/)
http://mediawiki.org/
parser:
http://en.wikipedia.org/wiki/-)
http://en.wikipedia.org/wiki/-
ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
ftp://domain.name/path(balanced_brackets)/ending.in.dot
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
ftp://domain.name/path(unbalanced_brackets/ending.in.dot
ftp://domain.name/path/embedded?punct/uation.
ftp://domain.name/path/embedded?punct/uation
ftp://domain.name/dangling_close_paren)
ftp://domain.name/dangling_close_paren
here:
foo://domain.hld/))))
foo://domain.hld/

jq

Works with: jq version with regex

The following uses essentially the same regular expression as is used in the #Tcl article (as of June 2015), and the results using the given input text are identical. Note in particular that scheme-only strings such as "stop:" are not extracted.

# input: a JSON string
# output: a stream of URIs
# Each input string may contain more than one URI.
def findURIs:
    match( "
	[a-z][-a-z0-9+.]*:		# Scheme...
	(?=[/\\w])			# ... but not just the scheme
	(?://[-\\w.@:]+)?		# Host
	[-\\w.~/%!$&'()*+,;=]*		# Path
	(?:\\?[-\\w.~%!$&'()*+,;=/?]*)?	# Query
	(?:[#][-\\w.~%!$&'()*+,;=/?]*)?	# Fragment
 
     "; "gx")
    | .string ;

# Example: read in a file of arbitrary text and
# produce a stream of the URIs that are identified.
split("\n")[] | findURIs

Output:

$ jq -R -r -f Find_URI_in_text.jq Find_URI_in_text.txt
http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer).
http://mediawiki.org/).
http://en.wikipedia.org/wiki/-)
ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
ftp://domain.name/path/embedded?punct/uation.
ftp://domain.name/dangling_close_paren)

Julia

The Julia URI parser treats stop: and here: as schemes with an empty path. Looking at the RFC this seems technically correct except that the schemes "stop:" and "here:" do not exist, whereas http: and ftp: do.

using URIParser, HTTP

function findvalidURI(txt)
    results = String[]
    # whitespace not allowed in URI, so split on whitespace
    for str in split(txt, r"\s+")
        # convert escaped chars to %dd format
        s = replace(replace(str, r"\&\#x([\d\w]{2})\;" => s"\%\1"), "?" => "x")
        try
            if isvalid(parse(HTTP.URI, s))
                push!(results, str)
            end
        catch
            continue
        end
    end
    return results
end

testtext = """
this URI contains an illegal character, parentheses and a misplaced full stop:
http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/).
and another one just to confuse the parser: http://en.wikipedia.org/wiki/-)
")" is handled the wrong way by the mediawiki parser.
ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
leading junk ftp://domain.name/path/embedded?punct/uation.
leading junk ftp://domain.name/dangling_close_paren)
if you have other interesting URIs for testing, please add them here:
"""
        
for t in strip.(split(testtext, "\n")), result in findvalidURI(t)
    println(result)
end

Output:

stop:
http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer).
http://mediawiki.org/).
parser:
http://en.wikipedia.org/wiki/-)
ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
ftp://domain.name/path/embedded?punct/uation.
ftp://domain.name/dangling_close_paren)
here:

Kotlin

The regular expression used here is essentially the same as the one in the Tcl entry. However, the flag expression (?U) is needed to enable matching of Unicode characters. Without this only ASCII characters are matched.

// version 1.2.21

val pattern =
    "(?U)" +                              // Enable matching of non-ascii characters
    "[a-z][-a-z0-9+.]*:" +	          // Scheme...
    "(?=[/\\w])" +                        // ... but not just the scheme
    "(?://[-\\w.@:]+)?" +                 // Host
    "[-\\w.~/%!\$&'()*+,;=]*" +           // Path
    "(?:\\?[-\\w.~%!\$&'()*+,;=/?]*)?" +  // Query
    "(?:\\#[-\\w.~%!\$&'()*+,;=/?]*)?"    // Fragment

fun main(args: Array<String>) {
    val text = """
        |this URI contains an illegal character, parentheses and a misplaced full stop:
        |http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/).
        |and another one just to confuse the parser: http://en.wikipedia.org/wiki/-)
        |")" is handled the wrong way by the mediawiki parser.
        |ftp://domain.name/path(balanced_brackets)/foo.html
        |ftp://domain.name/path(balanced_brackets)/ending.in.dot.
        |ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
        |leading junk ftp://domain.name/path/embedded?punct/uation.
        |leading junk ftp://domain.name/dangling_close_paren)
        |if you have other interesting URIs for testing, please add them here:
        |http://www.example.org/foo.html#includes_fragment
        |http://www.example.org/foo.html#enthält_Unicode-Fragment
    """.trimMargin()
    val patterns = listOf(pattern.drop(4), pattern)
    val descs = listOf("URIs:-", "IRIs:-")
    for (i in 0..1) {
        println(descs[i])
        val regex = Regex(patterns[i])
        val matches = regex.findAll(text)
        matches.forEach { println(it.value) }
        println()
    }
}

Output:

URIs:-
http://en.wikipedia.org/wiki/Erich_K
http://mediawiki.org/).
http://en.wikipedia.org/wiki/-)
ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
ftp://domain.name/path/embedded?punct/uation.
ftp://domain.name/dangling_close_paren)
http://www.example.org/foo.html#includes_fragment
http://www.example.org/foo.html#enth

IRIs:-
http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer).
http://mediawiki.org/).
http://en.wikipedia.org/wiki/-)
ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
ftp://domain.name/path/embedded?punct/uation.
ftp://domain.name/dangling_close_paren)
http://www.example.org/foo.html#includes_fragment
http://www.example.org/foo.html#enthält_Unicode-Fragment

Mathematica/Wolfram Language

Using the built-in text parser

TextCases[" this URI contains an illegal character, parentheses and a misplaced 
full stop:
 http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which 
is handled by http://mediawiki.org/).
 and another one just to confuse the parser: 
http://en.wikipedia.org/wiki/-)
 \")\" is handled the wrong way by the mediawiki parser.
 ftp://domain.name/path(balanced_brackets)/foo.html
 ftp://domain.name/path(balanced_brackets)/ending.in.dot.
 ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
 leading junk ftp://domain.name/path/embedded?punct/uation.
 leading junk ftp://domain.name/dangling_close_paren)
 if you have other interesting URIs for testing, please add them 
here:", "URL"]

Output:

{"http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which", 
"ftp://domain.name/path(balanced_brackets)/foo.html", 
"ftp://domain.name/path(balanced_brackets)/ending.in.dot", 
"ftp://domain.name/path(unbalanced_brackets/ending.in.dot", 
"ftp://domain.name/path/embedded?punct/uation"}

Objeck

Used a regex instead of writing a parser.

use RegEx;

class FindUri {
  function : Main(args : String[]) ~ Nil {
    text := "this URI contains an illegal character, parentheses and a misplaced full stop:
http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/).
and another one just to confuse the parser: http://en.wikipedia.org/wiki/-)
\")\" is handled the wrong way by the mediawiki parser.
ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
leading junk ftp://domain.name/path/embedded?punct/uation.
leading junk ftp://domain.name/dangling_close_paren)
if you have other interesting URIs for testing, please add them here:";

    found := RegEx->New("\\w*://(\\w|\\(|\\)|/|,|;|'|\\?|\\.)*")->Find(text);
    count := found->Size();
    "Found: {$count}"->PrintLine();
    each(i : found) {
      found->Get(i)->As(String)->PrintLine();
    };
  }
}

Count: 8
http://en.wikipedia.org/wiki/Erich_K
http://mediawiki.org/).
http://en.wikipedia.org/wiki/
ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
ftp://domain.name/path/embedded?punct/uation.
ftp://domain.name/dangling_close_paren)

Perl

Only covers whatever Regexp::Common::URI supports.

# 20200821 added Perl programming solution

use strict;
use warnings;

use Regexp::Common qw /URI/; # https://metacpan.org/pod/Regexp::Common::URI

while ( my $line = <DATA> ) {
   chomp $line;
   my @URIs = $line =~ /$RE{URI}/g and print "URI(s) found.\n";
   foreach my $uri (@URIs) { print "URI : $uri\n" }
}

__DATA__
this URI contains an illegal character, parentheses and a misplaced full stop:
http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/).
and another one just to confuse the parser: http://en.wikipedia.org/wiki/-)
")" is handled the wrong way by the mediawiki parser.
ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
leading junk ftp://domain.name/path/embedded?punct/uation.
leading junk ftp://domain.name/dangling_close_paren)

Output:

URI(s) found.
URI : http://en.wikipedia.org/wiki/Erich_K
URI : http://mediawiki.org/).
URI(s) found.
URI : http://en.wikipedia.org/wiki/-)
URI(s) found.
URI : ftp://domain.name/path(balanced_brackets)/foo.html
URI(s) found.
URI : ftp://domain.name/path(balanced_brackets)/ending.in.dot.
URI(s) found.
URI : ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
URI(s) found.
URI : ftp://domain.name/path/embedded
URI(s) found.
URI : ftp://domain.name/dangling_close_paren)

Phix

The following is based on scanForUrls() in demo\edita\src\easynclr.e which is used/tested on a daily basis and may get additional bugfixes, though it is quite strongly coupled in with syntax colouring and other editor gubbins. Now handles dangling ")" and trailing "." to match the mediawiki handling, with Edita (1.0.4) similarly updated. Note that Edita does not highlight a quoted text literal in the same manner as medaiwiki, but does with comments.
Deliberately handles IRI but not URI, in other words no attempt is made to prohibit unicode characters.

with javascript_semantics
constant schemes = {`ftp`,`gopher`,`http`,`https`,`mailto`,`news`,`nntp`,
                    `telnet`,`wais`,`file`,`prospero`,`edit`,`tel`,`urn`}

function scan_for_urls(sequence text)
    -- such as http::\\wikipedia.org
    integer chidx = 1, chidx2 = 1, lt = length(text), ch2
    sequence res = {}
    while chidx2<=lt do
        ch2 = text[chidx2]
        if ch2>='a' and ch2<='z' then
            if chidx2-1>chidx or text[chidx]<=' ' then
                chidx = chidx2
            end if
            while chidx2<=lt do
                ch2 = text[chidx2]
                if ch2<'a' or ch2>'z' then exit end if
                chidx2 += 1
            end while                   
            string oneword = text[chidx..chidx2-1]
            if chidx2>lt then exit end if
            ch2 = text[chidx2]
            if (ch2=':' and find(oneword,schemes))
            or (ch2='.' and equal(oneword,"www")) then
                chidx2 += 1
                integer chidx0 = chidx2
                bool isUrl = false
                string bstack = ""
                while chidx2<=lt do
                    ch2 = text[chidx2]
                    if ch2='\"' and chidx2=chidx0 then
                        while chidx2<lt do
                            chidx2 += 1
                            ch2 = text[chidx2]
                            if ch2='\"' then
                                chidx2 += 1
                                isUrl = true
                                exit
                            end if
                        end while
                        exit
                    elsif find(ch2,"(<[{") then
                        bstack &= ch2+iff(ch2='('?1:2)
                    elsif find(ch2,")>]}") then
                        if length(bstack)=0 or bstack[$]!=ch2 then exit end if
                        bstack = bstack[1..$-1]
                    end if
                    if ch2>255 or ch2<=' ' then exit end if
                    isUrl = true
                    chidx2 += 1
                end while
                if isUrl then
                    string oneurl = text[chidx..chidx2-1]
                    if oneurl[$]='.' then oneurl = oneurl[1..$-1] end if
                    res = append(res,oneurl)
                end if
                chidx = chidx2
                if chidx2>lt then exit end if
            else
                chidx2 -= 1
            end if
        end if
        chidx2 += 1
    end while
    return res
end function

constant txt = """
this URI contains an illegal character, parentheses and a misplaced full stop:
http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/).
and another one just to confuse the parser: http://en.wikipedia.org/wiki/-)
")" is handled the wrong way by the mediawiki parser.
ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
leading junk ftp://domain.name/path/embedded?punct/uation.
leading junk ftp://domain.name/dangling_close_paren)
if you have other interesting URIs for testing, please add them here:   
http://www.example.org/foo.html#includes_fragment
http://www.example.org/foo.html#enthält_Unicode-Fragment
 http://192.168.0.1/admin/?hackme=%%%%%%%%%true
blah (foo://domain.hld/))))
https://haxor.ur:4592/~mama/####&?foo
  ftp://ftp.is.co.za/rfc/rfc1808.txt
  http://www.ietf.org/rfc/rfc2396.txt
  mailto:John.Doe@example.com
  news:comp.infosystems.www.servers.unix
  tel:+1-816-555-1212
  telnet://192.0.2.16:80/
  urn:oasis:names:specification:docbook:dtd:xml:4.1.2
"""
printf(1,"%s\n",{join(scan_for_urls(txt),"\n")})

Output:

http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer)
http://mediawiki.org/
http://en.wikipedia.org/wiki/-
ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot
ftp://domain.name/path(unbalanced_brackets/ending.in.dot
ftp://domain.name/path/embedded?punct/uation
ftp://domain.name/dangling_close_paren
http://www.example.org/foo.html#includes_fragment
http://www.example.org/foo.html#enthält_Unicode-Fragment
http://192.168.0.1/admin/?hackme=%%%%%%%%%true
https://haxor.ur:4592/~mama/####&?foo
ftp://ftp.is.co.za/rfc/rfc1808.txt
http://www.ietf.org/rfc/rfc2396.txt
mailto:John.Doe@example.com
news:comp.infosystems.www.servers.unix
tel:+1-816-555-1212
telnet://192.0.2.16:80/
urn:oasis:names:specification:docbook:dtd:xml:4.1.2

PHP

Trivial example using PHP's built-in filter_var() function (which does not support IRIs).

$tests = array(
    'this URI contains an illegal character, parentheses and a misplaced full stop:',
    'http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/).',
    'and another one just to confuse the parser: http://en.wikipedia.org/wiki/-)',
    '")" is handled the wrong way by the mediawiki parser.',
    'ftp://domain.name/path(balanced_brackets)/foo.html',
    'ftp://domain.name/path(balanced_brackets)/ending.in.dot.',
    'ftp://domain.name/path(unbalanced_brackets/ending.in.dot.',
    'leading junk ftp://domain.name/path/embedded?punct/uation.',
    'leading junk ftp://domain.name/dangling_close_paren)',
    'if you have other interesting URIs for testing, please add them here:',
    'http://www.example.org/foo.html#includes_fragment',
    'http://www.example.org/foo.html#enthält_Unicode-Fragment',
    ' http://192.168.0.1/admin/?hackme=%%%%%%%%%true',
    'blah (foo://domain.hld/))))',
    'https://haxor.ur:4592/~mama/####&?foo'
);

foreach ( $tests as $test ) {
    foreach( explode( ' ', $test ) as $uri ) {
        if ( filter_var( $uri, FILTER_VALIDATE_URL ) )
            echo $uri, PHP_EOL;
    }
}

Output:

http://mediawiki.org/).
http://en.wikipedia.org/wiki/-)
ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
ftp://domain.name/path/embedded?punct/uation.
ftp://domain.name/dangling_close_paren)
http://www.example.org/foo.html#includes_fragment
http://192.168.0.1/admin/?hackme=%%%%%%%%%true
https://haxor.ur:4592/~mama/&####&?foo

Pike

string uritext = #"this URI contains an illegal character, parentheses and a misplaced full stop:
http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). 
which is handled by http://mediawiki.org/).
and another one just to confuse the parser: http://en.wikipedia.org/wiki/-)
\")\" is handled the wrong way by the mediawiki parser.
ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
leading junk ftp://domain.name/path/embedded?punct/uation.
leading junk ftp://domain.name/dangling_close_paren)
if you have other interesting URIs for testing, please add them here:";

array find_uris(string uritext)
{
    array uris=({}); 
    int pos=0; 
    while((pos = search(uritext, ":", pos+1))>0)
    { 
        int prepos = sizeof(array_sscanf(reverse(uritext[pos-20..pos-1]), "%[a-zA-Z0-9+.-]%s")[0]); 
        int postpos = sizeof(array_sscanf(uritext[pos+1..], "%[^\n\r\t <>\"]%s")[0]); 

        if ((<'.',',','?','!',';'>)[uritext[pos+postpos]])
            postpos--;
        if (uritext[pos-prepos-1]=='(' && uritext[pos+postpos]==')')
            postpos--;
        if (uritext[pos-prepos-1]=='\'' && uritext[pos+postpos]=='\'')
            postpos--;  
        uris+= ({ uritext[pos-prepos..pos+postpos] });
    }
    return uris;
}

find_uris(uritext);
Result: ({ /* 11 elements */
            "stop:",
            "http://en.wikipedia.org/wiki/Erich_K\303\244stner_(camera_designer)",
            "http://mediawiki.org/)",
            "parser:",
            "http://en.wikipedia.org/wiki/-)",
            "ftp://domain.name/path(balanced_brackets)/foo.html",
            "ftp://domain.name/path(balanced_brackets)/ending.in.dot",
            "ftp://domain.name/path(unbalanced_brackets/ending.in.dot",
            "ftp://domain.name/path/embedded?punct/uation",
            "ftp://domain.name/dangling_close_paren)",
            "here:"
        })

Racket

Translation of: Tcl

#lang racket

(define sample
  #<<EOS
this URI contains an illegal character, parentheses and a misplaced full stop:
http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/).
and another one just to confuse the parser: http://en.wikipedia.org/wiki/-)
")" is handled the wrong way by the mediawiki parser.
ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
leading junk ftp://domain.name/path/embedded?punct/uation.
leading junk ftp://domain.name/dangling_close_paren)
EOS
  )

(define uri-ere-bits
  '("[a-z][-a-z0-9+.]*:"              ; Scheme...
    "(?=[/\\w])"                      ; ... but not just the scheme
    "(?://[-\\w.@:]+)?"               ; Host
    "[-\\w.~/%!$&'()*+,;=]*"          ; Path
    "(?:\\?[-\\w.~%!$&'()*+,;=/?]*)?" ; Query
    "(?:[#][-\\w.~%!$&'()*+,;=/?]*)?" ; Fragment
    ))

(define uri-re (pregexp (apply string-append uri-ere-bits)))

(for-each (compose displayln ~s) (regexp-match* uri-re sample))
(regexp-match-positions* uri-re sample)

(module+ test
  ;; "ABNF for Syntax Specifications" http://tools.ietf.org/html/rfc2234
  ;; defines ALPHA as:
  ;;   ALPHA = %x41-5A / %x61-7A   ; A-Z / a-z
  (unless (= 228 (char->integer #\ä))
    (error "a-umlaut is not 228, and therefore might be an ALPHA")))

Output:

Tcl's \w matches non-ASCII alphabetic characters. We finish the Kaestner match after the K because a-umlaut is not an ASCII character.

Match positions differ from the #Tcl version because:

sample does not start with a newline in racket (the here string handles that differently to Tcl braces)
the cdr of the pairs is the index AFTER the last character of the match

"http://en.wikipedia.org/wiki/Erich_K"
"http://mediawiki.org/)."
"http://en.wikipedia.org/wiki/-)"
"ftp://domain.name/path(balanced_brackets)/foo.html"
"ftp://domain.name/path(balanced_brackets)/ending.in.dot."
"ftp://domain.name/path(unbalanced_brackets/ending.in.dot."
"ftp://domain.name/path/embedded?punct/uation."
"ftp://domain.name/dangling_close_paren)"
((79 . 115) (162 . 185) (230 . 261) (316 . 366) (367 . 423) (424 . 481) (495 . 540) (554 . 593))

Raku

(formerly Perl 6)

This needs an installed URI distribution.

Works with: Rakudo version 2018.03

use v6;
use IETF::RFC_Grammar::URI;

say q:to/EOF/.match(/ <IETF::RFC_Grammar::URI::absolute-URI> /, :g).list.join("\n");
    this URI contains an illegal character, parentheses and a misplaced full stop:
    http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/).
    and another one just to confuse the parser: http://en.wikipedia.org/wiki/-)
    ")" is handled the wrong way by the mediawiki parser.
    ftp://domain.name/path(balanced_brackets)/foo.html
    ftp://domain.name/path(balanced_brackets)/ending.in.dot.
    ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
    leading junk ftp://domain.name/path/embedded?punct/uation.
    leading junk ftp://domain.name/dangling_close_paren)
    EOF

say $/[*-1];
say "We matched $/[*-1], which is a $/[*-1].^name() at position $/[*-1].from() to $/[*-1].to()"

Like most of the solutions here it does not comply to IRI but only to URI:

stop:
http://en.wikipedia.org/wiki/Erich_K
http://mediawiki.org/).
parser:
http://en.wikipedia.org/wiki/-)
ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
ftp://domain.name/path/embedded
ftp://domain.name/dangling_close_paren)
｢ftp://domain.name/dangling_close_paren)｣
 IETF::RFC_Grammar::URI::absolute_URI => ｢ftp://domain.name/dangling_close_paren)｣
  scheme => ｢ftp｣
We matched ftp://domain.name/dangling_close_paren), which is a Match, at position 554 to 593

The last lines show that we get Match objects back that we can query to get all kinds of information. We even get the information what subrules matched, and since these are also Match objects we can obtain their match position in the text.

REXX

/*REXX program scans a text (contained within the REXX program) to extract URIs and IRIs*/
$$= 'this URI contains an illegal character, parentheses and a misplaced full stop:',
    'http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/).',
    'and another one just to confuse the parser: http://en.wikipedia.org/wiki/-)',
    '")" is handled the wrong way by the mediawiki parser.',
    'ftp://domain.name/path(balanced_brackets)/foo.html',
    'ftp://domain.name/path(balanced_brackets)/ending.in.dot.',
    'ftp://domain.name/path(unbalanced_brackets/ending.in.dot.',
    'leading junk ftp://domain.name/path/embedded?punct/uation.',
    'leading junk ftp://domain.name/dangling_close_paren)',
    'if you have other interesting URIs for testing, please add them here:'

@abc=        'abcdefghijklmnopqrstuvwxyz'        /*construct lowercase (Latin) alphabet.*/
@abcU= @abc;  upper @abcU;  @abcs= @abc || @abcU /*    "     lower & uppercase     "    */
@scheme=     @abcs || 0123456789 || '+-.'        /*add decimal digits & some punctuation*/
@unreserved= @abcs || 0123456789 || '-._~'       /* "     "      "    "   "       "     */
@reserved=   @unreserved"/?#[]@!$&)(*+,;=\'"     /*add other punctuation & special chars*/
$= space($$)' '                                  /*variable  $  is a working copy of $$ */
#= 0                                             /*the count of  URI's  found  (so far).*/
                                                 /*▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄*/
  do  while  $\='';            y= pos(':', $)    /*locate a colon  (:) in the text body.*/
  if y==0  then leave                            /*Was a colon found?  Nope, we're done.*/
  if y==1  then do;    parse var   $   .  $      /*handle a  bare colon  by itself.     */
                       iterate                   /*go and keep scanning for a colon.    */
                end                              /* [↑]   (a rare special case.)        */
  sr= reverse( left($, y - 1) )                  /*extract the  scheme  and reverse it. */
  se= verify(sr, @scheme)                        /*locate the  end  of the  scheme.     */
  $= substr($, y + 1)                            /*assign an adjusted new text.         */
  if se\==0  then sr= left(sr, se - 1)           /*possibly  "crop"  the  scheme  name. */
  s= reverse(sr)                                 /*reverse it again to rectify the name.*/
  he= verify($, @reserved)                       /*locate the end of  hierarchical part.*/
  s= s':'left($, he - 1)                         /*extract and append      "         "  */
  $=   substr($, he)                             /*assign an adjusted new part of text. */
  #= # + 1                                       /*bump the  URI  counter.              */
  !.#= s                                         /*assign the  URI  to an array  (!.)   */
  end   /*while*/                                /* [↑]  scan the text for URI's.       */
                                                 /*▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀*/
 do k=1  for #;     say !.k;      end            /*stick a fork in it,  we're all done. */

output when using the internal default inputs:

stop:
http://en.wikipedia.org/wiki/Erich_K
http://mediawiki.org/).
parser:
http://en.wikipedia.org/wiki/-)
ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
ftp://domain.name/path/embedded?punct/uation.
ftp://domain.name/dangling_close_paren)
here:

Ruby

require  'uri'

str = 'this URI contains an illegal character, parentheses and a misplaced full stop:
http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/).
and another one just to confuse the parser: http://en.wikipedia.org/wiki/-)
")" is handled the wrong way by the mediawiki parser.
ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
leading junk ftp://domain.name/path/embedded?punct/uation.
leading junk ftp://domain.name/dangling_close_paren)
if you have other interesting URIs for testing, please add them here:'


puts URI.extract(str) 

puts "\nFiltered for HTTP and HTTPS:"
puts URI.extract(str, ["http", "https"])

puts "\nThis is the (extendible) list of supported schemes: #{URI.scheme_list.keys}"

Output:

stop:
http://en.wikipedia.org/wiki/Erich_K
http://mediawiki.org/).
parser:
http://en.wikipedia.org/wiki/-)
ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
ftp://domain.name/path/embedded?punct/uation.
ftp://domain.name/dangling_close_paren)
here:

Filtered for HTTP and HTTPS:
http://en.wikipedia.org/wiki/Erich_K
http://mediawiki.org/).
http://en.wikipedia.org/wiki/-)

This is the (extendible) list of supported schemes: ["FTP", "HTTP", "HTTPS", "LDAP", "LDAPS", "MAILTO"]

Tcl

This uses regular expressions to do the matching. It doesn't match a URL without a scheme (too problematic in general text) and it requires more than just the scheme too, but apart from that it matches slightly too broad a range of strings (though not usually problematically much). Matches some IRIs correctly too, but does not tackle the <bracketed> form (especially not if it includes extra spaces).

proc findURIs {text args} {
    # This is an ERE with embedded comments. Rare, but useful with something
    # this complex.
    set URI {(?x)
	[a-z][-a-z0-9+.]*:		# Scheme...
	(?=[/\w])			# ... but not just the scheme
	(?://[-\w.@:]+)?		# Host
	[-\w.~/%!$&'()*+,;=]*		# Path
	(?:\?[-\w.~%!$&'()*+,;=/?]*)?	# Query
	(?:[#][-\w.~%!$&'()*+,;=/?]*)?	# Fragment
    }
    regexp -inline -all {*}$args -- $URI $text
}

Demonstrating:

Note that the last line of output is showing that we haven't just extracted the URI substrings, but can also get the match positions within the text.

set sample {
this URI contains an illegal character, parentheses and a misplaced full stop:
http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/).
and another one just to confuse the parser: http://en.wikipedia.org/wiki/-)
")" is handled the wrong way by the mediawiki parser.
ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
leading junk ftp://domain.name/path/embedded?punct/uation.
leading junk ftp://domain.name/dangling_close_paren)
}

puts [join [findURIs $sample] \n]
puts [findURIs $sample -indices]

Output:

http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer).
http://mediawiki.org/).
http://en.wikipedia.org/wiki/-)
ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
ftp://domain.name/path/embedded?punct/uation.
ftp://domain.name/dangling_close_paren)
{80 140} {163 185} {231 261} {317 366} {368 423} {425 481} {496 540} {555 593}

TXR

@(define path (path))@\
  @(local x y)@\
  @(cases)@\
    (@(path x))@(path y)@(bind path `(@x)@y`)@\
  @(or)@\
    @{x /[.,;'!?][^ \t\f\v]/}@(path y)@(bind path `@x@y`)@\
  @(or)@\
    @{x /[^ .,;'!?()\t\f\v]/}@(path y)@(bind path `@x@y`)@\
  @(or)@\
    @(bind path "")@\
  @(end)@\
@(end)
@(define url (url))@\
  @(local proto domain path)@\
  @{proto /[A-Za-z]+/}://@{domain /[^ \/\t\f\v]+/}@\
  @(cases)/@(path path)@\
    @(bind url `@proto://@domain/@path`)@\
  @(or)@\
    @(bind url `@proto://@domain`)@\
  @(end)@\
@(end)
@(collect)
@  (all)
@line
@  (and)
@     (coll)@(url url)@(end)@(flatten url)
@  (end)
@(end)
@(output)
LINE 
    URLS
----------------------
@  (repeat)
@line
@    (repeat)
    @url
@    (end)
@  (end)
@(end)

Test file:

$ cat url-data 
Blah blah http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (Handled by http://mediawiki.org/).
Confuse the parser: http://en.wikipedia.org/wiki/-)
ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
leading junk ftp://domain.name/path/embedded?punct/uation.
leading junk ftp://domain.name/dangling_close_paren)

Run:

$ txr url.txr url-data 
LINE 
    URLS
----------------------
Blah blah http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (Handled by http://mediawiki.org/).
    http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer)
    http://mediawiki.org/
Confuse the parser: http://en.wikipedia.org/wiki/-)
    http://en.wikipedia.org/wiki/-
ftp://domain.name/path(balanced_brackets)/foo.html
    ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
    ftp://domain.name/path(balanced_brackets)/ending.in.dot
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
    ftp://domain.name/path
leading junk ftp://domain.name/path/embedded?punct/uation.
    ftp://domain.name/path/embedded?punct/uation
leading junk ftp://domain.name/dangling_close_paren)
    ftp://domain.name/dangling_close_paren

Wren

Library: Wren-pattern

Wren's simple pattern matcher lacks the sophistication of regular expressions and I've had to make considerable simplifications to the search pattern needed for complete URI/IRI matching whilst still doing enough to identify the ones embedded in the sample text for this task.

import "./pattern" for Pattern

var text = """
this URI contains an illegal character, parentheses and a misplaced full stop:
http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/).
and another one just to confuse the parser: http://en.wikipedia.org/wiki/-)
")" is handled the wrong way by the mediawiki parser.
ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
leading junk ftp://domain.name/path/embedded?punct/uation.
leading junk ftp://domain.name/dangling_close_paren)
if you have other interesting URIs for testing, please add them here:
http://www.example.org/foo.html#includes_fragment
http://www.example.org/foo.html#enthält_Unicode-Fragment
"""

var i = Pattern.lower + Pattern.digit + "-+."
var j = Pattern.alpha + "_-.@:"
var k = j + "~\%!$&'()*+,;=?/#"
var e = "/l+0/i:////+1/j//+1/k"
var p = Pattern.new(e, Pattern.within, i, j, k)
var matches = p.findAll(text)
System.print("URI's found:\n")
for (m in matches) System.print(m.text)
k = k + "ä"
p = Pattern.new(e, Pattern.within, i, j, k)
System.print("\nIRI's found:\n")
matches = p.findAll(text)
for (m in matches) System.print(m.text)

Output:

URI's found:

http://en.wikipedia.org/wiki/Erich_K
http://mediawiki.org/).
http://en.wikipedia.org/wiki/-)
ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
ftp://domain.name/path/embedded?punct/uation.
ftp://domain.name/dangling_close_paren)
http://www.example.org/foo.html#includes_fragment
http://www.example.org/foo.html#enth

IRI's found:

http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer).
http://mediawiki.org/).
http://en.wikipedia.org/wiki/-)
ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
ftp://domain.name/path/embedded?punct/uation.
ftp://domain.name/dangling_close_paren)
http://www.example.org/foo.html#includes_fragment
http://www.example.org/foo.html#enthält_Unicode-Fragment