Find URI in text: Difference between revisions

m
added whitespace to the task's preamble, added a ;Task:, added whitespace before the TOC.
m (added whitespace to the task's preamble, added a ;Task:, added whitespace before the TOC.)
Line 1:
{{Draft task|Text processing}}
 
;Task:
Write a function to search plain text for URIs or IRIs.
 
Line 10 ⟶ 12:
 
The abbreviation IRI isn't as well known as URI and the short description is that an IRI is just an alternate form of a URI that supports Internationalization and hence Unicode. While many specifications support both forms this isn't universal.
 
 
Consider the following issues:
:* &nbsp; <big> <code>. , ; ' ? ( )</code> </big> &nbsp; are legal characters in a URI, but they are often used in plain text as a delimiter.
:* &nbsp; IRIs allow most (but not all) unicode characters.
:* &nbsp; URIs can be something else besides http:// or https://
 
 
sample;Sample text:
<nowiki> this URI contains an illegal character, parentheses and a misplaced full stop:
<nowiki>
http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/).
this URI contains an illegal character, parentheses and a misplaced full stop:
and another one just to confuse the parser: http://en.wikipedia.org/wiki/-)
http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/).
")" is handled the wrong way by the mediawiki parser.
and another one just to confuse the parser: http://en.wikipedia.org/wiki/-)
ftp://domain.name/path(balanced_brackets)/foo.html
")" is handled the wrong way by the mediawiki parser.
ftp://domain.name/path(balanced_brackets)/fooending.in.dot.html
ftp://domain.name/path(balanced_brackets)unbalanced_brackets/ending.in.dot.
leading junk ftp://domain.name/path(unbalanced_brackets/ending.in.dotembedded?punct/uation.
leading junk ftp://domain.name/path/embedded?punct/uation.dangling_close_paren)
if you have other interesting URIs for testing, please add them here: </nowiki>
leading junk ftp://domain.name/dangling_close_paren)
if you have other interesting URIs for testing, please add them here:
</nowiki>
 
Regular expressions to solve the task are fine, but alternative approaches are welcome too. (otherwise, this task would degrade into 'finding and applying the best regular expression')
 
'''Extra Credit:''' &nbsp; implement the parser to match the IRI specification in RFC 3987.
<br><br>
 
=={{header|Go}}==