Talk:Web scraping

Choice of page to scrape

I hope I've chosen a page that will be around for some time and in the same form. If not I could switch to extracting the last modified time of a Rosetta page I suppose, but we wouldn't be guaranteed a page that is simple to parse and that you can extract slightly changing answers from. --Paddy3118 20:53, 20 August 2008 (UTC)

Criticism

The task, as described and the examples so far are extremely weak by comparison to one's normal expectations of what "web scraping" means. The examples just pull a page and extract a line of text using simple regular expressions.

When developers talk about "web scraping" they are usually talking about much more than simply fetching the page and doing trivial extraction of a simple regular expression. Usually the task implies more sophisticated parsing of the page's HTML and frequently involves encoding the request into a query string (ReSTful sites) or an HTTP POST-able form.

Thus I would expect the task to describe the fetching and parsing of a web page, in HTML form ... with a subsequent encoding of selected results into a new query (and/or posted form). This would give a far more realistic example of what "web scraping" means to most people who would employ the phrase. JimD 23:00, 8 September 2008 (UTC)

Hi Jim,

I have read your criticism, and looked at the introduction of the definition of web scraping here. It seems we disagree on the size of example appropriate to R.C. but not really on what web scraping is. If you have a larger example then the central idea of extracting data from a live web page may be lost in the details of how the data is extracted from HTML, or what is done subsequently with that data. A lot of the tasks on R.C. are small and I thought this would fit that mould.

You could always add a separate task involving extracting data from HTML files? --Paddy3118 12:56, 9 September 2008 (UTC)