Talk:Yahoo! search interface: Difference between revisions

Content added Content deleted

Inline

Revision as of 15:40, 3 May 2009

TOS violation

This task violates Google's TOS.

5. Use of the Services by you

5.3 You agree not to access (or attempt to access) any of the Services by any means other than through the interface that is provided by Google, unless you have been specifically allowed to do so in a separate agreement with Google. You specifically agree not to access (or attempt to access) any of the Services through any automated means (including use of scripts or web crawlers) and shall ensure that you comply with the instructions set out in any robots.txt file present on the Services.

--Kevin Reid 10:51, 3 May 2009 (UTC)

I'd say just put a warning at the top of the page that says something along the lines of Usage of this code violates section 5.3 of Google's Terms of Service, unless you have special arrangements with Google, but I'm pretty sure that folks who have such special arrangements also get access to a different API, and probly an authentication key or other token.

So what's the purpose of the task? Is it to call an API and perform some sort of scraping operation on the result? Can someone whip up a low-footprint web script I can put on Rosetta Code's server as an alternative? Or is the task's value specifically in that it aids automate Google searches, and redirecting to a different server and API makes it useless? If the latter is the case, then I would suggest we delete it. Not because I believe it's necessarily inappropriate to describe how to do something that a TOS or other rule says one isn't allowed to do, but because if we piss Google off, that's 68% of our traffic that gets lost if Google decides to remove us from their search index.

Again, if the problem is the need to interact with an API on a remote server, and do some web scraping on the results, I'll put up any suitable and lightweight script provided on the server for people to test against.

In the mean time, I'm erasing the code, and modifying robots.txt to not index anything but /wiki/*. (I should have done that ages ago, anyway; Google indexes all the results of clicking on the Edit links, as it stands.) --Short Circuit 12:34, 3 May 2009 (UTC)

I could turn my Vanity Search blog entry into an RC task, but don't we already have a task that extracts info from RC stats pages? Maybe we should just let this other "search a web site that needs user input and gives answers on multiple pages" type task just die? --Paddy3118 12:56, 3 May 2009 (UTC)

Actually, I'd prefer to avoid too many more tasks that pull from MediaWiki-supplied content; They're unkind to the DB backend, and my slice doesn't have enough RAM for memcached or squid to be particularly helpful when dealing with a high request-per-minute rate; apache processes fill 256MB pretty quick when they pile up waiting for a response from MySQL, which gets slower as it gets starved for RAM. (I can't afford to spend more on RC server performance unless RC can pay for itself.) If scraping and list navigation is the ultimate goal, something much more lightweight can be provided, that doesn't even touch MySQL. --Short Circuit 15:40, 3 May 2009 (UTC)

@@ Line 15: / Line 15: @@
 : In the mean time, I'm erasing the code, and modifying robots.txt to not index anything but /wiki/*. (I should have done that ages ago, anyway; Google indexes all the results of clicking on the Edit links, as it stands.) --[[User:Short Circuit|Short Circuit]] 12:34, 3 May 2009 (UTC)
 :: I could turn my [http://paddy3118.blogspot.com/2009/02/extended-vanity-search-on-rosetta-code.html Vanity Search] blog entry into an RC task, but don't we already have a task that extracts info from RC stats pages? Maybe we should just let this other "search a web site that needs user input and gives answers on multiple pages" type task just die? --[[User:Paddy3118|Paddy3118]] 12:56, 3 May 2009 (UTC)
+::: Actually, I'd prefer to avoid too many more tasks that pull from MediaWiki-supplied content; They're unkind to the DB backend, and my slice doesn't have enough RAM for memcached or squid to be particularly helpful when dealing with a high request-per-minute rate; apache processes fill 256MB pretty quick when they pile up waiting for a response from MySQL, which gets slower as it gets starved for RAM. (I can't afford to spend more on RC server performance unless RC can pay for itself.)  If scraping and list navigation is the ultimate goal, something much more lightweight can be provided, that doesn't even touch MySQL. --[[User:Short Circuit|Short Circuit]] 15:40, 3 May 2009 (UTC)