Find URI in text: Difference between revisions

Find URI in text (view source)

Revision as of 17:53, 25 November 2019

110 bytes added , 4 years ago

m

added whitespace to the task's preamble, added a ;Task:, added whitespace before the TOC.

Anonymous user

rosettacode>Gerard Schildberger

Revision as of 16:46, 25 November 2019 (view source) rosettacode>JonMcLoone (→‎{{header\|Mathematica}}) ← Older edit		Revision as of 17:53, 25 November 2019 (view source) rosettacode>Gerard Schildberger m (added whitespace to the task's preamble, added a ;Task:, added whitespace before the TOC.) Newer edit →
Line 1: {{Draft task\|Text processing}} ;Task: Write a function to search plain text for URIs or IRIs. Line 10 ⟶ 12: The abbreviation IRI isn't as well known as URI and the short description is that an IRI is just an alternate form of a URI that supports Internationalization and hence Unicode. While many specifications support both forms this isn't universal. Consider the following issues: :*   <big> <code>. , ; ' ? ( )</code> </big>   are legal characters in a URI, but they are often used in plain text as a delimiter. :*   IRIs allow most (but not all) unicode characters. :*   URIs can be something else besides http:// or https:// ~~sample~~;Sample text: <nowiki> this URI contains an illegal character, parentheses and a misplaced full stop:▼ ~~<nowiki>~~ http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/).▼ ▲ this URI contains an illegal character, parentheses and a misplaced full stop: and another one just to confuse the parser: http://en.wikipedia.org/wiki/-)▼ ▲ http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/). ")" is handled the wrong way by the mediawiki parser.▼ ▲ and another one just to confuse the parser: http://en.wikipedia.org/wiki/-) ftp://domain.name/path(balanced_brackets)/foo.html ▲ ")" is handled the wrong way by the mediawiki parser. ftp://domain.name/path(balanced_brackets)/~~foo~~ending.in.dot.~~html~~ ftp://domain.name/path(~~balanced_brackets)~~unbalanced_brackets/ending.in.dot. leading junk ftp://domain.name/path~~(unbalanced_brackets~~/~~ending.in.dot~~embedded?punct/uation. leading junk ftp://domain.name/~~path/embedded?punct/uation.~~dangling_close_paren) if you have other interesting URIs for testing, please add them here: </nowiki>▼ ~~leading junk ftp://domain.name/dangling_close_paren)~~ ▲ if you have other interesting URIs for testing, please add them here: ~~</nowiki>~~ Regular expressions to solve the task are fine, but alternative approaches are welcome too. (otherwise, this task would degrade into 'finding and applying the best regular expression') '''Extra Credit:'''   implement the parser to match the IRI specification in RFC 3987. <br><br> =={{header\|Go}}==