Talk:Find URI in text: Difference between revisions

 
(10 intermediate revisions by 2 users not shown)
Line 20:
* TXR also includes the illegal character
At this time that would be all of the examples are wrong. --[[User:Dgamey|Dgamey]] 04:37, 8 January 2012 (UTC)
: this is maybe not clear from the task description, handling unicode characters is intentional in order to allow a user to write an url as they see it. (look at how http://en.wikipedia.org/wiki/Erich_Kästner_Erich_Kästner_(camera_designer) is displayed in the browser when you follow the link.)
:: This gets a bit into the details. The link is encoded with ä which is allowed in a URI. If it's Unicode then it is not technically a URI but an IRI (see below). --[[User:Dgamey|Dgamey]] 15:19, 8 January 2012 (UTC)
::: (no it isn't (ok, i don't have a german keyboard and i was just lazy ;-)) i am not talking about the encoding here in the text, but the display in the browser address bar. (imagine looking at a screenshot). it is conceivable and to be expectd that a person would type such an address as she sees it, and expect it to work.--[[User:EMBee|eMBee]] 15:29, 8 January 2012 (UTC)
:::: That would be the wiki then. The character I get back is a single byte extended ASCII value of 228 or xe4. --[[User:Dgamey|Dgamey]] 15:32, 8 January 2012 (UTC)
::::: sorry, i was intentionally opaque. it was <code>&amp;auml;</code> because i was to lazy to copy-paste a real <code>ä</code> until i fixed it. my point though was that the encoding in the text is not relevant to what i am talking about, but the way it is displayed in the address bar. anyways, the whole argument is moot because RFC 3987 covers exactly what i mean and i have updated the task accordingly (thanks again for that pointer).--[[User:EMBee|eMBee]] 16:05, 8 January 2012 (UTC)
:::::: Thanks. I was going to start a discussion on task description but I think you covered it. Even though it's possible to have a good gut feel about what happens, some of these things get picky when you wade into the details. Also this is the price one pays for dabbling in draft tasks :). I may tweak the task description some for elaboration as many may not know what an IRI is. --[[User:Dgamey|Dgamey]] 17:52, 8 January 2012 (UTC)
::::::: that would be great, thank you!--[[User:EMBee|eMBee]] 02:56, 9 January 2012 (UTC)
:it is not necessary to copy the example input exactly. if you can think of other examples that are worth testing, please include them too.
:as for the expected output, this is a question of the balance beween following the rfc and handling user expectations. for example, a <code> . </code> or <code> , </code> at the end of a URI is most likely not part of the URI according to user expectation, but it is a legal character in the RFC. which rule is better? i don't know. until someone can show a live URI that has <code> . </code> or <code> , </code> at the end i am inclined to remove them. in contrast the <code>()</code> case is somewhat easier to decide. if there is a <code>(</code> before the URI, then clearly the <code>)</code> at the end is also not part of the URI, but there are edge-cases too.--[[User:EMBee|eMBee]] 06:58, 8 January 2012 (UTC)
 
== Unicode and URIs ==
[http://www.ietf.org/rfc/rfc3986.txt RFC 3986] defines URIs and does not allow Unicode; however, the IETF addresses this in [http://www.ietf.org/rfc/rfc3987.txt RFC 3987] via the IRI mechanism which is related but separate. The syntactic definitions are very similar where most of the elements are extended. Two lower level elements are added 'iprivate' and 'ucschar' which are specific ranges of two byte percent encoded values. These elements percolate up through most of the higher syntax elements such as the authority, paths, and segments which have i-versions. Other elements such as 'scheme' and the IP address elements are left alone. There is also no 'ireserved' element. --[[User:Dgamey|Dgamey]] 14:50, 8 January 2012 (UTC)
: Having worked on a couple of projects that involve parsing things defined by RFCs I've found that, unless it's a use once and throw away solution, straying from the RFC or reinterpreting them is generally asking for trouble. --[[User:Dgamey|Dgamey]] 14:50, 8 January 2012 (UTC)
:: there is also the general rule: be strict in what you produce, but be liberal in what you accept. i believe this applies here. but thank you for pointing to RFC 3987. looks like that is exactly what i meant, and i wouldn't mind if that is used as a base to decide what is valid and what not. however, i believe that using "any text" except <code>" < > </code> and whitespace as delimiters at the end of an URI is sufficient for most use cases.
::as for once off or throwaway code, i see rosettacode not as a place to provide finalized libraries but code that anyone can use as a starting point to implement their own. for that i favor simpler code that is easier to understand and modify rather than complete code that solves all edge cases which a user may not even be interested in.--[[User:EMBee|eMBee]] 15:55, 8 January 2012 (UTC)
Anonymous user