ImplSearchBot is running again on a maintained basis. It wasn't buggy in itself, but its traffic pattern did not mesh well with massive incread in traffic that came from StumbleUpon. Load averages of 20-50 do not make a server with four logical processors happy.The initial fix for ImplSearchBot was to change its behavior to pause periodically whenever the server's load average exceeded a threshhold value. When I saw that Rosetta Code visitors were continuing to receive HTTP 500 errors, I dug into matters a bit more deeply.Part of the problem was using FastCGI to handle the traffic load, and its behavior and the related problems are documented in my previous post. The solution available to me was to Switch from FastCGI back to mod_php. Conveniently, mod_php doesn't time out within any time frame that the average Web user will notice. Inconveniently, mod_php will not work with Apache's mpm_worker (Would someone mind making mod_php and extensions thread-safe?), so the server is still stuck with a separate copy of all of PHP's static runtime data for each process, rather than sharing between threads. So, memory wise, we're no more efficient than with FastCGI and php-cgi.Also, inconveniently, it required additional configuration to limit the number of concurrent Apache processes running; The default settings for mpm_prefork on Ubuntu don't see any problem with having twenty-five or so concurrent processes running. Within ten minutes of restarting apache with the new settings, the server bogged down under enough concurrent processes that it became totally unresponsive to external stimuli; The last thing I could get out of the console before a hard reboot was an OOM message regarding MySQL. To this end, I configured Apache in a way that should limit the number of concurrent processes that get spawned.Then we ran into a problem with search engine crawler bots. They were requesting pages so quickly that they were either causing too many Apache threads to be spawned, or, after I reconfigured Apache, filling the "waiting clients" queue, causing legitimate users to have to wait an undue amount of time. I tested with my browser, and my browser timed out.First to bat was our old nemesis, Yahoo! Slurp. I long suspected Slurp of ignoring robots.txt, or at least the Crawl-delay directive, as the primary culprit of most reports I've had of HTTP500 errors turned out to be a large number of rapid-fire requests from Slurp's user agent. Well, I've finally discovered that Slurp doesn't exactly ignore Crawl-delay. Rather, rosettacode.org was getting hit by five concurrent crawls by Slurp bots coming from different IP addresses. Whatever value I set for Crawl-delay was effectively being divided by five on average, and nothing prevents those bots from all requesting different pages within the half-second time span.I increased Crawl-delay. If Slurp is identified as the cause of problems again, I won't have much of a choice but to disallow it entirely.Next to bat was 80legs. Their distributed crawl system, clever as it is, was hitting the server with a request every second. Their crawler bot does not support Crawl-delay, but they responded quickly to my request for throttling.Finally, I had to slow down Twiceler, which, while not as heavy as Slurp or 80legs, was still heavy. Twiceler supports Crawl-delay. So do Google's indexer, Bing's and a fair number of others. I added Crawl-delay to the global section of robots.txt.Things seem to be running fairly snappy, now. Not as good as they could, but it's running.While all these were happening, I received three different offers of hosting. One from someone in #rosettacode who I don't know very well, someone in #perl who I don't know at all, and one offer from Boise On-Call IT, which is a small company run by someone I know both socially and professionally. For now, I'm trying to work things out with Boise On-Call IT services (There's a potential for technical incompatibility; Rosetta Code has had some heavy configuration done to be able to run reasonably well in a small footprint, and more is needed, if only for the sake of aiding expandability.). If that doesn't work out, I'll revisit the other offers, as well as consider the possibility of moving to a larger dedicated host, but that's going to cost.I've not yet found good commercial colocation facilities local to me (Grand Rapids, Michigan), and running things out of my home carries with it a large number of unpleasant problems with infrastructure quality and cost complications.Finally, I still need to tune MySQL. A large portion of it is sitting in swap, and that needs to be fixed. As with Apache, MySQL's default settings are more suitable with a system with far more memory than the 256MB slice Rosetta Code is running on currently. I've also discovered If I can tune MySQL to not require so much RAM, then it's conceivable that memcached might be able to have more than 3MB of its allocated 64MB in physical memory, rather than swap...With the system running as well as it is at the moment (admittedly, it feels a tad slower than, say, a couple weeks ago), I've changed things with ImplSearchBot's run schedule. Rather than running every four hours as it did as recently as two weeks ago, or even once per day as I rescheduled during the StumbleUpon influx, it's running continually. Every time it finishes its cycle, it starts over. With the way it's currently coded, and with the current non-internal server load, every page that ImplSearchBot regularly touches is being updated within an hour and a half. Future updates should make that a worse case scenario, with the normal case being on the inside of fifteen minutes, or even within a minute if there's nothing else to do.One of the most common and recommended things to do for MediaWiki is to use Squid as an accelerator cache. Unfortunately, that's not really an option for Rosetta Code, as it requires patching and doing a custom build of Squid for full effectiveness, and it's possible that the site may need to move to a server where custom builds of such software isn't an option. There's also the question of compatibility with other software packages which don't support the Vary HTTP header, etc.Another problem I've witnessed in MediaWiki is Monobook's usage of PHP-driven CSS and Javascript files. Install Firebug, navigate to any page on Rosetta Code's wiki, and watch the transferred files as you do a full refresh. Anything that pulls from index.php requires the server to process your request with PHP, which means hitting the database again, which blocks the Apache process, which either (pre-mpm-reconfigure) spawns another Apache process, potentially causing overuse of swap or (post-mpm-reconfigure) holds up the client request queue longer. I don't mind Common.css so much; That's rather important whenever things need to change. The other requests for PHP-provided styling and client-side scripting currently return empty files, which means database utilization without providing any utility to most end-users. (The returned files may be edited on a per-user basis in their preferences.)It's been suggested (and I've even broached the subject myself in the past) that Rosetta Code move away from MediaWiki. While it could potentially ease the pain of some of our current problems, I know of no other affordable CMS with as strong a core developer base, as strong as an install base, as familiar an editing interface, as smooth an upgrade path history, as strong a modding community, or as long-viewed in probable funding and continued development. In other words, while it can be a pain to bend to our needs, it's at least stable, codewise, and I'm usually more than happy to bend things to do what they weren't originally intended to do.However, there is one piece of software that's currently public-facing that I would like to do without. I want to get rid of Wordpress. Yes, it has a large developer and modding community. Yes, it likely has a long future as a software package. However, its upgrade path is painful, it's computationally expensive to run (It hasn't had a good built-in cache, and the previous plugin we were using was abandoned by its developer.), and I'd like to find something better. Most promising is Movable Type; I like the idea of serving up static files from disk unless things are actually being posted.Finally, one of the barriers to getting and maintaining better hosting is lifting slightly; Within a few months, Rosetta Code should start being able to sell books derived from its content. The plan is to allow users to request a book with content chosen based on a rule set, have that book be generated programmatically, and have the PDF and printed version be available at the same time. The PDF as a free download, and the printed version via a Print-On-Demand service. Money gained through sales via the POD service would pay for the manual component of producing the books (I never found a POD service I could comfortably fully automate things with), as well pay for upgraded hosting and other improvements to the site. (It would be nice, for example, to have a paid server admin, or backend developer, or...) Additionally, the ruleset that chose the book's content would be saved, and the book would be re-released in subsequent editions periodically. I might even be able to automate submitting the PDF versions to archive.org.
Recently in implsearchbot Category
ImplSearchBot is disabled, and will continue to be disabled until I fix an urgent bug. As a result, the "Tasks Unimplemented in X" pages will not be updated again until it's dealt with. This is related to the 400%+ increase in normal load we've been seeing since Sunday.The load average on the slice was as high as 22 when I checked it in response to an email. ISB itself won't drive the load average above 1; Its operation is completely serial. If, however, the load average is already above 1 due to incoming traffic, it can push the load average above 2, which causes a nasty cycle that explodes the server load.
Update: See ImplSearchBot Fate and Replacement.
- User asks for a page
- Apache asks fcgid to run a MediaWiki PHP script. fcgid hangs briefly while waiting on the script to spawn, since there are already other processes waiting for CPU time.
- Script spawns, queries database, waits for a bit.
- Script spits out HTML content, apache serves it up.
- User asks for another page.
- fcgid times out, terminates PHP script. MySQL transaction has to be aborted, user gets an HTTP 500 Internal Server Error message.
- User hits reload. See "User requests page" above.
- Server is taking a long time to return a result. User gets impatient.
- User hits reload, or opens another tab while the first is still loading. See "User Requests Page" above, with the added caveat that there's still another instance of a PHP script being waited on by fcgid and apache; The user will have to wait a bit longer still.
Update: See ImplSearchBot Fate and Replacement.
ImplSearchBot's source code has long been available on the wiki, but it's now also available via Git:
git://implsearchbot.git.sourceforge.net/gitroot/implsearchbotAs Mwn3d has pointed out, it's not as efficient as it could be. Additionally, there's a growing list of bugs that I haven't had time to address and fix.