Talk:Find URI in text

Unicode Chars

My hunch is just to leave Unicode characters alone. This can be regarded as a matter of conversion before the URL is used. It depends on the purpose of extracting URL's from text. (Are they headed for a processing stage which deals with those characters fine?)24.85.131.247 19:01, 3 January 2012 (UTC)

that's the intention exactly. non-ascii characters are mentioned because they should be included. a parser that only accepts legal characters would not do that.--eMBee 02:14, 4 January 2012 (UTC)

So, since spaces can be entered in a browser, they can be accepted as part of a URI, here? --Rdm 18:17, 5 January 2012 (UTC)

i suppose if the url has a delimiter like quotes or <http://go.here/to this place>, then i don't see why not. it's depends on the ability to figure out the users intent. and on the application. depending on where the parser is used there might even be an opportunity to verify that a url actually exists. (now that would actually be an interesting feature: you type some text on some website, and the browser or server tells you that the url you typed does not exist)--eMBee 03:38, 6 January 2012 (UTC)

What actually parses as a URI may not be what was intended by the task author

I just had a look at this against the cited RFC. I believe the task description is inconsistent with the syntax spelled out in the RFC. For instance,

Even though it doesn't make sense to a person "stop:" is a valid scheme and parses as a URI through 'path-empty'. To rule it out you must know what the valid schemes are.
nothing in the RFC indicates that parenthesis must be balanced and the characters are allowed via the 'segment' parts of URIs.

Based on this a solution that gives "stop:" and containing unbalanced parenthesis are technically valid but probably not what the author intended. --Dgamey 02:17, 8 January 2012 (UTC)

not sure about "stop:". because for one, new schemes can be made up. some applications have internal schemes that are known to us. the task only asks to find URIs, not process them, thus the decision to deal with "stop:" or not, can be handled in the processing stage. for example in some cases you may only be interested in http, https, and maybe ftp. in such a case you'd go through the list of matches and remove anything that is not of interest. of course one could write the parser in a way that it can take a list going in to decided which schemes should be found, but by default there is no harm in finding to much.

nothing in the task indicates that parenthesis must be balanced either. unbalanced parenthesis are certainly valid and are what the author intended too. please look at the live example i found from wikipedia: http://en.wikipedia.org/wiki/-) (and note how mediawiki parses it wrong :-).--eMBee 03:57, 8 January 2012 (UTC)

I had another look and the RFC definitions also allow '-' and '.' in through 'unreserved', 'pchar', and 'segment' so "http://en.wikipedia.org/wiki/-)" and "http://en.wikipedia.org/wiki/-" are valid as you indicated as well as "http://mediawiki.org/).". Also the URI with the illegal character is valid up until that character so "http://en.wikipedia.org/wiki/Erich_K" is valid. Appendix C doesn't help much as none of the sample URI's are cleanly delineated. --Dgamey 04:37, 8 January 2012 (UTC)

Expected Output Needed

A list of expected output should be given to avoid confusion. Some of the examples are clearly wrong.

Pike is incomplete and includes the illegal char
TXR also includes the illegal character

At this time that would be all of the examples are wrong. --Dgamey 04:37, 8 January 2012 (UTC)

this is maybe not clear from the task description, handling unicode characters is intentional in order to allow a user to write an url as they see it. (look at how http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer) is displayed in the browser when you follow the link.)

it is not necessary to copy the example input exactly. if you can think of other examples that are worth testing, please include them too.

as for the expected output, this is a question of the balance beween following the rfc and handling user expectations. for example, a . or , at the end of a URI is most likely not part of the URI according to user expectation, but it is a legal character in the RFC. which rule is better? i don't know. until someone can show a live URI that has . or , at the end i am inclined to remove them. in contrast the () case is somewhat easier to decide. if there is a ( before the URI, then clearly the ) at the end is also not part of the URI, but there are edge-cases too.--eMBee 06:58, 8 January 2012 (UTC)

Unicode and URIs

RFC 3986 defines URIs and does not allow Unicode; however, the IETF addresses this in RFC 3987 via the IRI mechanism which is related but separate. The syntactic definitions are very similar where most of the elements are extended. Two lower level elements are added 'iprivate' and 'ucschar' which are specific ranges of two byte percent encoded values. These elements percolate up through most of the higher syntax elements such as the authority, paths, and segments which have i-versions. Other elements such as 'scheme' and the IP address elements are left alone. There is also no 'ireserved' element. --Dgamey 14:50, 8 January 2012 (UTC)

Having worked on a couple of projects that involve parsing things defined by RFCs I've found that, unless it's a use once and throw away solution that, straying from the RFC or reinterpreting them is generally asking for trouble. --Dgamey 14:50, 8 January 2012 (UTC)