Find URI in text: Difference between revisions
m (how does mediawiki handle the )?) |
(because we can :-)) |
||
Line 16: | Line 16: | ||
this URI contains an illegal character, parentheses and a misplaced full stop: |
this URI contains an illegal character, parentheses and a misplaced full stop: |
||
http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/). |
http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/). |
||
and another one just to confuse the parser: http://rosettacode.org/wiki/User:EMBee/Because_we_can_:-) |
|||
Regular expressions to solve the task are fine, but alternative approaches are welcome too. (otherwise, this task would degrade into 'how to apply a regular expression') |
Regular expressions to solve the task are fine, but alternative approaches are welcome too. (otherwise, this task would degrade into 'how to apply a regular expression') |
||
Line 21: | Line 22: | ||
=={{header|Pike}}== |
=={{header|Pike}}== |
||
<lang Pike>string uritext = "this URI contains an illegal character, parentheses and a misplaced full stop:" |
<lang Pike>string uritext = "this URI contains an illegal character, parentheses and a misplaced full stop:" |
||
"http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/)." |
"http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/)." |
||
"and another one just to confuse the parser: http://rosettacode.org/wiki/User:EMBee/Because_we_can_:-)"; |
|||
array find_uris(string uritext) |
array find_uris(string uritext) |
Revision as of 12:58, 3 January 2012
Write a function to search plain text for URIs.
The function should return a list of URIs found in the text.
The definition of a URI is given in RFC 3986.
For searching URIs in particular "Appendix C. Delimiting a URI in Context" is noteworthy.
Consider the following issues:
. , ; ' ? ( )
are legal characters in a URI, but they are often used in plain text as a delimiter.- a user may type an URI as seen in the browser location-bar with non-ascii characters (which are not legal).
- URIs can be something else besides http:// or https://
sample text:
this URI contains an illegal character, parentheses and a misplaced full stop: http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/). and another one just to confuse the parser: http://rosettacode.org/wiki/User:EMBee/Because_we_can_:-)
Regular expressions to solve the task are fine, but alternative approaches are welcome too. (otherwise, this task would degrade into 'how to apply a regular expression')
Pike
<lang Pike>string uritext = "this URI contains an illegal character, parentheses and a misplaced full stop:" "http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/)." "and another one just to confuse the parser: http://rosettacode.org/wiki/User:EMBee/Because_we_can_:-)";
array find_uris(string uritext) {
array uris=({}); int pos=0; while((pos = search(uritext, "://", pos+1))>0) { int prepos = sizeof(array_sscanf(reverse(uritext[pos-20..pos-1]), "%[a-zA-Z0-9+.-]%s")[0]); int postpos = sizeof(array_sscanf(uritext[pos+3..], "%[^ <>\"]%s")[0]);
if ((<'.',',','?','!',';'>)[uritext[pos+postpos+2]]) postpos--; if (uritext[pos-prepos-1]=='(' && uritext[pos+postpos+2]==')') postpos--; if (uritext[pos-prepos-1]=='\ && uritext[pos+postpos+2]=='\) postpos--; uris+= ({ uritext[pos-prepos..pos+postpos+2] }); } return uris;
}</lang>