Talk:Web scraping: Difference between revisions

m
 
(11 intermediate revisions by 6 users not shown)
Line 1:
== Choice of page to scrape ==
 
I hope I've chosen a page that will be around for some time and in the same form. If not I could switch to extracting the last modified time of a Rosetta page I suppose, but we wouldn't be guaranteed a page that is simple to parse and that you can extract slightly changing answers from. --[[User:Paddy3118|Paddy3118]] 20:53, 20 August 2008 (UTC)
 
:That URL has been broken for many months and this task should be considered unimplementable. [[User:Eliasen]] 7th May 2020
 
::Comment above moved from task description to here. My vote would be to extract that Paddy's "7-minutes-to-9pm on the 20th of August 2008" above as a permitted alternative, ie the first "UTC" on this page. To that end I just removed a "UTC" from one of the headings, and the Phix entry now does precisely that, with the old code left in as comments. --[[User:Petelomax|Pete Lomax]] ([[User talk:Petelomax|talk]]) 10:27, 7 May 2020 (UTC)
 
== Criticism ==
Line 25 ⟶ 30:
I noticed that several solutions anchored UTC to the end of line, and the page now outputs "UTC Universal Time" instead. --[[User:Glennj|glennj]] 15:32, 12 August 2009 (UTC)
 
== "Just“Just the UTC time"time” – clarification ==
 
I've noticed that some examples disagree on what exactly should be printed. Some use, literally, just the timestamp without any date. Some include the date, some include the time zone name, some the complete line. I think it should be a little clarified what exactly should be returned. I read the tasks as if just the time (e. g. 01:22:25) should be returned. The complete line isn't of much use anyway since it lacks the current year. —[[User:Hypftier|Johannes Rössel]] 00:22, 21 December 2009 (UTC)
: If you're going to start making the task more exact, don't forget to correct each language's implementation or mark it as needing attention. Or let sleeping dogs lie; it's just a web-scraping task and all of the implementations achieve the basic requirement. (After all, disagreement over what should be scraped is normal for programmers doing web scraping…) –[[User:Dkf|Donal Fellows]] 07:38, 21 December 2009 (UTC)
 
:A big aim of the task is showing what can be done with the libraries easily available with the language. I would not want to be more exact on what should be scraped if that would merely add more code to parse text. Obviously returning the whole page would be wrong, as would returning time from another zone, or time that was not from the web page, or the use of some obscure library. (I'm with the sleeping dogs, AAaaoooooowwww...) --[[User:Paddy3118|Paddy3118]] 09:08, 21 December 2009 (UTC)
 
== xPath and HTML web scraping ==
You should have a chance with xPath working on XHTML, but HTML in general can be notoriously badly formed and contains [http://www.w3schools.com/tags/tag_br.asp tags without terminators] such as <nowiki><br></nowiki>
I can't remember how well formed this site is though. --[[User:Paddy3118|Paddy3118]] 16:49, 13 May 2011 (UTC)
7,794

edits