Find URI in text: Difference between revisions

From Rosetta Code
Content added Content deleted
(hopefully we'll not only get regexp solutions)
m (a real url ending in ))
Line 16: Line 16:
this URI contains an illegal character, parentheses and a misplaced full stop:
this URI contains an illegal character, parentheses and a misplaced full stop:
http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/).
http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/).
and another one just to confuse the parser: http://rosettacode.org/wiki/User:EMBee/Because_we_can_:-)
and another one just to confuse the parser: http://en.wikipedia.org/wiki/-)


Regular expressions to solve the task are fine, but alternative approaches are welcome too. (otherwise, this task would degrade into 'fining and applying the best regular expression')
Regular expressions to solve the task are fine, but alternative approaches are welcome too. (otherwise, this task would degrade into 'fining and applying the best regular expression')
Line 23: Line 23:
<lang Pike>string uritext = "this URI contains an illegal character, parentheses and a misplaced full stop:\n"
<lang Pike>string uritext = "this URI contains an illegal character, parentheses and a misplaced full stop:\n"
"http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/).\n"
"http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/).\n"
"and another one just to confuse the parser: http://rosettacode.org/wiki/User:EMBee/Because_we_can_:-)\n";
"and another one just to confuse the parser: http://en.wikipedia.org/wiki/-)\n";


array find_uris(string uritext)
array find_uris(string uritext)

Revision as of 14:50, 3 January 2012

Find URI in text is a draft programming task. It is not yet considered ready to be promoted as a complete task, for reasons that should be found in its talk page.

Write a function to search plain text for URIs.

The function should return a list of URIs found in the text.

The definition of a URI is given in RFC 3986.

For searching URIs in particular "Appendix C. Delimiting a URI in Context" is noteworthy.

Consider the following issues:

  • . , ; ' ? ( ) are legal characters in a URI, but they are often used in plain text as a delimiter.
  • a user may type an URI as seen in the browser location-bar with non-ascii characters (which are not legal).
  • URIs can be something else besides http:// or https://

sample text:

this URI contains an illegal character, parentheses and a misplaced full stop:
http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/).
and another one just to confuse the parser: http://en.wikipedia.org/wiki/-)

Regular expressions to solve the task are fine, but alternative approaches are welcome too. (otherwise, this task would degrade into 'fining and applying the best regular expression')

Pike

<lang Pike>string uritext = "this URI contains an illegal character, parentheses and a misplaced full stop:\n" "http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/).\n" "and another one just to confuse the parser: http://en.wikipedia.org/wiki/-)\n";

array find_uris(string uritext) {

   array uris=({}); 
   int pos=0; 
   while((pos = search(uritext, "://", pos+1))>0)
   { 
       int prepos = sizeof(array_sscanf(reverse(uritext[pos-20..pos-1]), "%[a-zA-Z0-9+.-]%s")[0]); 
       int postpos = sizeof(array_sscanf(uritext[pos+3..], "%[^\n\r\t <>\"]%s")[0]); 
       if ((<'.',',','?','!',';'>)[uritext[pos+postpos+2]])
           postpos--;
       if (uritext[pos-prepos-1]=='(' && uritext[pos+postpos+2]==')')
           postpos--;
       if (uritext[pos-prepos-1]=='\ && uritext[pos+postpos+2]=='\)
           postpos--;  
       uris+= ({ uritext[pos-prepos..pos+postpos+2] });
   }
   return uris;

}

find_uris(uritext); Result: ({ /* 3 elements */

                "http://en.wikipedia.org/wiki/Erich_K\303\244stner_(camera_designer)",
                "http://mediawiki.org/)",
                "http://rosettacode.org/wiki/User:EMBee/Because_we_can_:-)"
       })</lang>