Find URI in text: Difference between revisions
m (→{{header|Pike}}: real url ending in )) |
(→{{header|TXR}}: Added.) |
||
Line 51: | Line 51: | ||
"http://en.wikipedia.org/wiki/-)" |
"http://en.wikipedia.org/wiki/-)" |
||
})</lang> |
})</lang> |
||
=={{header|TXR}}== |
|||
<lang txr>@(define path (path))@\ |
|||
@(local x y)@\ |
|||
@(cases)@\ |
|||
(@(path x))@(path y)@(bind path `(@x)@y`)@\ |
|||
@(or)@\ |
|||
@{x /[.,;'!?][^ \t\f\v]/}@(path y)@(bind path `@x@y`)@\ |
|||
@(or)@\ |
|||
@{x /[^ .,;'!?()\t\f\v]/}@(path y)@(bind path `@x@y`)@\ |
|||
@(or)@\ |
|||
@(bind path "")@\ |
|||
@(end)@\ |
|||
@(end) |
|||
@(define url (url))@\ |
|||
@(local proto domain path)@\ |
|||
@{proto /[A-Za-z]+/}://@{domain /[^ \/\t\f\v]+/}@\ |
|||
@(cases)/@(path path)@\ |
|||
@(bind url `@proto://@domain/@path`)@\ |
|||
@(or)@\ |
|||
@(bind url `@proto://@domain`)@\ |
|||
@(end)@\ |
|||
@(end) |
|||
@(collect) |
|||
@ (all) |
|||
@line |
|||
@ (and) |
|||
@ (coll)@(url url)@(end)@(flatten url) |
|||
@ (end) |
|||
@(end) |
|||
@(output) |
|||
LINE |
|||
URLS |
|||
---------------------- |
|||
@ (repeat) |
|||
@line |
|||
@ (repeat) |
|||
@url |
|||
@ (end) |
|||
@ (end) |
|||
@(end)</lang> |
|||
Test file: |
|||
<pre>$ cat url-data |
|||
Blah blah http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (Handled by http://mediawiki.org/). |
|||
Confuse the parser: http://en.wikipedia.org/wiki/-) |
|||
ftp://domain.name/path(balanced_brackets)/foo.html |
|||
ftp://domain.name/path(balanced_brackets)/ending.in.dot. |
|||
ftp://domain.name/path(unbalanced_brackets/ending.in.dot. |
|||
leading junk ftp://domain.name/path/embedded?punct/uation. |
|||
leading junk ftp://domain.name/dangling_close_paren)</pre> |
|||
Run: |
|||
<pre>$ txr url.txr url-data |
|||
LINE |
|||
URLS |
|||
---------------------- |
|||
Blah blah http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (Handled by http://mediawiki.org/). |
|||
http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer) |
|||
http://mediawiki.org/ |
|||
Confuse the parser: http://en.wikipedia.org/wiki/-) |
|||
http://en.wikipedia.org/wiki/- |
|||
ftp://domain.name/path(balanced_brackets)/foo.html |
|||
ftp://domain.name/path(balanced_brackets)/foo.html |
|||
ftp://domain.name/path(balanced_brackets)/ending.in.dot. |
|||
ftp://domain.name/path(balanced_brackets)/ending.in.dot |
|||
ftp://domain.name/path(unbalanced_brackets/ending.in.dot. |
|||
ftp://domain.name/path |
|||
leading junk ftp://domain.name/path/embedded?punct/uation. |
|||
ftp://domain.name/path/embedded?punct/uation |
|||
leading junk ftp://domain.name/dangling_close_paren) |
|||
ftp://domain.name/dangling_close_paren</pre> |
Revision as of 18:55, 3 January 2012
Write a function to search plain text for URIs.
The function should return a list of URIs found in the text.
The definition of a URI is given in RFC 3986.
For searching URIs in particular "Appendix C. Delimiting a URI in Context" is noteworthy.
Consider the following issues:
. , ; ' ? ( )
are legal characters in a URI, but they are often used in plain text as a delimiter.- a user may type an URI as seen in the browser location-bar with non-ascii characters (which are not legal).
- URIs can be something else besides http:// or https://
sample text:
this URI contains an illegal character, parentheses and a misplaced full stop: http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/). and another one just to confuse the parser: http://en.wikipedia.org/wiki/-)
Regular expressions to solve the task are fine, but alternative approaches are welcome too. (otherwise, this task would degrade into 'fining and applying the best regular expression')
Pike
<lang Pike>string uritext = "this URI contains an illegal character, parentheses and a misplaced full stop:\n" "http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/).\n" "and another one just to confuse the parser: http://en.wikipedia.org/wiki/-)\n";
array find_uris(string uritext) {
array uris=({}); int pos=0; while((pos = search(uritext, "://", pos+1))>0) { int prepos = sizeof(array_sscanf(reverse(uritext[pos-20..pos-1]), "%[a-zA-Z0-9+.-]%s")[0]); int postpos = sizeof(array_sscanf(uritext[pos+3..], "%[^\n\r\t <>\"]%s")[0]);
if ((<'.',',','?','!',';'>)[uritext[pos+postpos+2]]) postpos--; if (uritext[pos-prepos-1]=='(' && uritext[pos+postpos+2]==')') postpos--; if (uritext[pos-prepos-1]=='\ && uritext[pos+postpos+2]=='\) postpos--; uris+= ({ uritext[pos-prepos..pos+postpos+2] }); } return uris;
}
find_uris(uritext); Result: ({ /* 3 elements */
"http://en.wikipedia.org/wiki/Erich_K\303\244stner_(camera_designer)", "http://mediawiki.org/)", "http://en.wikipedia.org/wiki/-)" })</lang>
TXR
<lang txr>@(define path (path))@\
@(local x y)@\ @(cases)@\ (@(path x))@(path y)@(bind path `(@x)@y`)@\ @(or)@\ @{x /[.,;'!?][^ \t\f\v]/}@(path y)@(bind path `@x@y`)@\ @(or)@\ @{x /[^ .,;'!?()\t\f\v]/}@(path y)@(bind path `@x@y`)@\ @(or)@\ @(bind path "")@\ @(end)@\
@(end) @(define url (url))@\
@(local proto domain path)@\ @{proto /[A-Za-z]+/}://@{domain /[^ \/\t\f\v]+/}@\ @(cases)/@(path path)@\ @(bind url `@proto://@domain/@path`)@\ @(or)@\ @(bind url `@proto://@domain`)@\ @(end)@\
@(end) @(collect) @ (all) @line @ (and) @ (coll)@(url url)@(end)@(flatten url) @ (end) @(end) @(output) LINE
URLS
@ (repeat) @line @ (repeat)
@url
@ (end) @ (end) @(end)</lang>
Test file:
$ cat url-data Blah blah http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (Handled by http://mediawiki.org/). Confuse the parser: http://en.wikipedia.org/wiki/-) ftp://domain.name/path(balanced_brackets)/foo.html ftp://domain.name/path(balanced_brackets)/ending.in.dot. ftp://domain.name/path(unbalanced_brackets/ending.in.dot. leading junk ftp://domain.name/path/embedded?punct/uation. leading junk ftp://domain.name/dangling_close_paren)
Run:
$ txr url.txr url-data LINE URLS ---------------------- Blah blah http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (Handled by http://mediawiki.org/). http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer) http://mediawiki.org/ Confuse the parser: http://en.wikipedia.org/wiki/-) http://en.wikipedia.org/wiki/- ftp://domain.name/path(balanced_brackets)/foo.html ftp://domain.name/path(balanced_brackets)/foo.html ftp://domain.name/path(balanced_brackets)/ending.in.dot. ftp://domain.name/path(balanced_brackets)/ending.in.dot ftp://domain.name/path(unbalanced_brackets/ending.in.dot. ftp://domain.name/path leading junk ftp://domain.name/path/embedded?punct/uation. ftp://domain.name/path/embedded?punct/uation leading junk ftp://domain.name/dangling_close_paren) ftp://domain.name/dangling_close_paren