Talk:Web scraping

Revision as of 23:00, 8 September 2008 by rosettacode>JimD

Choice of page to scrape

I hope I've chosen a page that will be around for some time and in the same form. If not I could switch to extracting the last modified time of a Rosetta page I suppose, but we wouldn't be guaranteed a page that is simple to parse and that you can extract slightly changing answers from. --Paddy3118 20:53, 20 August 2008 (UTC)

Criticism

The task, as described and the examples so far are extremely weak by comparison to one's normal expectations of what "web scraping" means. The examples simple pull a page and extract a line of text using simple regular expressions.

When developers talk about "web scraping" they are usually talking about much more than simply fetching the page and doing trivial extraction of a simple regular expression. Usually the task implies more sophisticated parsing of the page's HTML and frequently involves encoding the request into a query string (ReSTful sites) or an HTTP POST-able form.

Thus I would expect the task to describe the fetching and parsing of a web page, in HTML form ... with a subsequent encoding of selected results into a new query (and/or posted form). This would give a far more realistic example of what "web scraping" means to most people who would employ the phrase. JimD 23:00, 8 September 2008 (UTC)

Return to "Web scraping" page.