Recently in Weekly update Category

Change of hosting, other updates

| No Comments
Rosetta Code is now hosted at Linnode, and has roughly twice the physical server capacity that it had at Slicehost. That extra space allowed me to properly tune MySQL, as well as improve usage of memcached and install php5-xcache. I'm still not using Squid, though I was able to pull 5-7Mb/s worth of pages when using HTTP KeepAlives.

ImplSearchBot has been down since Labor Day, and will likely remain down until it is replaced.  More on that later.

You might have noticed either that the blog was down for much of the week following Labor Day, or, alternately, that the blog looks significantly different.  For a combination of performance and security reasons, I've migrated the Rosetta Code blog from Wordpress to Movable Type.  Old posts have formatting issues.  I may go back and correct them as I have time, but of all the traffic data I have, nothing suggests that that would be worthwhile.

The Rosetta Code planet is not being updated at the moment, but that will be rectified this weekend.  Hopefully, I will also be able to finish the infrastructure to provide faster and simpler access to the data that ImplSearchBot depends on.  And I will likely write a couple more blog posts.

News, notes and plans

| 2 Comments
ImplSearchBot is running again on a maintained basis.  It wasn't buggy in itself, but its traffic pattern did not mesh well with massive incread in traffic that came from StumbleUpon.  Load averages of 20-50 do not make a server with four logical processors happy.The initial fix for ImplSearchBot was to change its behavior to pause periodically whenever the server's load average exceeded a threshhold value.  When I saw that Rosetta Code visitors were continuing to receive HTTP 500 errors, I dug into matters a bit more deeply.Part of the problem was using FastCGI to handle the traffic load, and its behavior and the related problems are documented in my previous post.  The solution available to me was to Switch from FastCGI back to mod_php.  Conveniently, mod_php doesn't time out within  any time frame that the average Web user will notice.  Inconveniently, mod_php will not work with Apache's mpm_worker (Would someone mind making mod_php and extensions thread-safe?), so the server is still stuck with a separate copy of all of PHP's static runtime data for each process, rather than sharing between threads.  So, memory wise, we're no more efficient than with FastCGI and php-cgi.Also, inconveniently, it required additional configuration to limit the number of concurrent Apache processes running; The default settings for mpm_prefork on Ubuntu don't see any problem with having twenty-five or so concurrent processes running.  Within ten minutes of restarting apache with the new settings, the server bogged down under enough concurrent processes that it became totally unresponsive to external stimuli; The last thing I could get out of the console before a hard reboot was an OOM message regarding MySQL.  To this end, I configured Apache in a way that should limit the number of concurrent processes that get spawned.Then we ran into a problem with search engine crawler bots.  They were requesting pages so quickly that they were either causing too many Apache threads to be spawned, or, after I reconfigured Apache, filling the "waiting clients" queue, causing legitimate users to have to wait an undue amount of time.  I tested with my browser, and my browser timed out.First to bat was our old nemesis, Yahoo! Slurp.  I long suspected Slurp of ignoring robots.txt, or at least the Crawl-delay directive, as the primary culprit of most reports I've had of HTTP500 errors turned out to be a large number of rapid-fire requests from Slurp's user agent.  Well, I've finally discovered that Slurp doesn't exactly ignore Crawl-delay.  Rather, rosettacode.org was getting hit by five concurrent crawls by Slurp bots coming from different IP addresses.  Whatever value I set for Crawl-delay was effectively being divided by five on average, and nothing prevents those bots from all requesting different pages within the half-second time span.I increased Crawl-delay.  If Slurp is identified as the cause of problems again, I won't have much of a choice but to disallow it entirely.Next to bat was 80legs.  Their distributed crawl system, clever as it is, was hitting the server with a request every second.  Their crawler bot does not support Crawl-delay, but they responded quickly to my request for throttling.Finally, I had to slow down Twiceler, which, while not as heavy as Slurp or 80legs, was still heavy.  Twiceler supports Crawl-delay.  So do Google's indexer, Bing's and a fair number of others.  I added Crawl-delay to the global section of robots.txt.Things seem to be running fairly snappy, now.  Not as good as they could, but it's running.While all these were happening, I received three different offers of hosting.  One from someone in #rosettacode who I don't know very well, someone in #perl who I don't know at all, and one offer from Boise On-Call IT, which is a small company run by someone I know both socially and professionally.  For now, I'm trying to work things out with Boise On-Call IT services (There's a potential for technical incompatibility; Rosetta Code has had some heavy configuration done to be able to run reasonably well in a small footprint, and more is needed, if only for the sake of aiding expandability.).  If that doesn't work out, I'll revisit the other offers, as well as consider the possibility of moving to a larger dedicated host, but that's going to cost.I've not yet found good commercial colocation facilities local to me (Grand Rapids, Michigan), and running things out of my home carries with it a large number of unpleasant problems with infrastructure quality and cost complications.Finally, I still need to tune MySQL.  A large portion of it is sitting in swap, and that needs to be fixed.  As with Apache, MySQL's default settings are more suitable with a system with far more memory than the 256MB slice Rosetta Code is running on currently.  I've also discovered   If I can tune MySQL to not require so much RAM, then it's conceivable that memcached might be able to have more than 3MB of its allocated 64MB in physical memory, rather than swap...With the system running as well as it is at the moment (admittedly, it feels a tad slower than, say, a couple weeks ago), I've changed things with ImplSearchBot's run schedule.  Rather than running every four hours as it did as recently as two weeks ago, or even once per day as I rescheduled during the StumbleUpon influx, it's running continually.  Every time it finishes its cycle, it starts over.  With the way it's currently coded, and with the current non-internal server load, every page that ImplSearchBot regularly touches is being updated within an hour and a half.  Future updates should make that a worse case scenario, with the normal case being on the inside of fifteen minutes, or even within a minute if there's nothing else to do.One of the most common and recommended things to do for MediaWiki is to use Squid as an accelerator cache.  Unfortunately, that's not really an option for Rosetta Code, as it requires patching and doing a custom build of Squid for full effectiveness, and it's possible that the site may need to move to a server where custom builds of such software isn't an option.  There's also the question of compatibility with other software packages which don't support the Vary HTTP header, etc.Another problem I've witnessed in MediaWiki is Monobook's usage of PHP-driven CSS and Javascript files.  Install Firebug, navigate to any page on Rosetta Code's wiki, and watch the transferred files as you do a full refresh.  Anything that pulls from index.php requires the server to process your request with PHP, which means hitting the database again, which blocks the Apache process, which either (pre-mpm-reconfigure) spawns another Apache process, potentially causing overuse of swap or (post-mpm-reconfigure) holds up the client request queue longer.  I don't mind Common.css so much; That's rather important whenever things need to change.  The other requests for PHP-provided styling and client-side scripting currently return empty files, which means database utilization without providing any utility to most end-users.  (The returned files may be edited on a per-user basis in their preferences.)It's been suggested (and I've even broached the subject myself in the past) that Rosetta Code move away from MediaWiki.  While it could potentially ease the pain of some of our current problems, I know of no other affordable CMS with as strong a core developer base, as strong as an install base, as familiar an editing interface, as smooth an upgrade path history, as strong a modding community, or as long-viewed in probable funding and continued development.  In other words, while it can be a pain to bend to our needs, it's at least stable, codewise, and I'm usually more than happy to bend things to do what they weren't originally intended to do.However, there is one piece of software that's currently public-facing that I would like to do without.  I want to get rid of Wordpress.  Yes, it has a large developer and modding community.  Yes, it likely has a long future as a software package.  However, its upgrade path is painful, it's computationally expensive to run (It hasn't had a good built-in cache, and the previous plugin we were using was abandoned by its developer.), and I'd like to find something better.  Most promising is Movable Type; I like the idea of serving up static files from disk unless things are actually being posted.Finally, one of the barriers to getting and maintaining better hosting is lifting slightly; Within a few months, Rosetta Code should start being able to sell books derived from its content.  The plan is to allow users to request a book with content chosen based on a rule set, have that book be generated programmatically, and have the PDF and printed version be available at the same time.  The PDF as a free download, and the printed version via a Print-On-Demand service.  Money gained through sales via the POD service would pay for the manual component of producing the books (I never found a POD service I could comfortably fully automate things with), as well pay for upgraded hosting and other improvements to the site. (It would be nice, for example, to have a paid server admin, or backend developer, or...) Additionally, the ruleset that chose the book's content would be saved, and the book would be re-released in subsequent editions periodically.  I might even be able to automate submitting the PDF versions to archive.org.

Planet Rosetta Code

| No Comments
It's about time to announce this.  Rosetta Code has a planet.  Its focus is on programming and things generally related to Rosetta Code.  If you have a blog, leave a note with an RSS or Atom feed URL, and I'll look at including it.

ImplSearchBot source code

| No Comments
ImplSearchBot's source code has long been available on the wiki, but it's now also available via Git:
git://implsearchbot.git.sourceforge.net/gitroot/implsearchbot
As Mwn3d has pointed out, it's not as efficient as it could be.  Additionally, there's a growing list of bugs that I haven't had time to address and fix.

What is ImplSearchBot?

From a practical standpoint, ImplSearchBot maintains the lists of tasks not implemented in the various languages found on Rosetta Code.From a simplified technical standpoint, ImplSearchBot takes two categories, and finds those pages which are in one category, but not in another, and builds lists of those pages.  For our purposes, those categories are Programming Tasks and Programming Languages.  An additional set of categories is used to organize the lists, the Omit Categories.  There is one omit category per language, and each of those categories is used to identify what tasks are inappropriate for that particular language.  Those tasks aren't removed from the listing, but merely identified as being less likely to be accomplished.ImplSearchBot also calculates  a language's penetration rate, as well as tracks the total number of languages and tasks.Finally, it's been keeping its cache under version control, so that other scripts and processes may trigger on differences in cache versions, rather than querying MediaWiki's API.

So...What?

Honestly, I need your help.  I wrote ImplSearchBot months ago to deal with a recurring question: "What tasks haven't been implemented in my language?"  Unfortunately, the problem is more complicated than that, and as its seen more and more use, it's needed more and more work.  I simply don't have enough time available to make all the changes and fix all the problems, much less time to do that and fix and improve other areas of the site.So, what I'd like you to do, is grab a copy of the source code, look at it, see what it does, figure out how it might be made more efficient, more effective, more useful, more flexible, and send me a patch.

Some more updates

| No Comments
The Village Pump was moved to the Rosetta Code namespace. Nothing was discarded, but any bots may need to be updated.A page was created for documenting how to programmatically interact with Rosetta Code, as I found the documentation elsewhere to be lacking in structure and specificity, and felt that having a place on Rosetta Code to reorganize and annotate it would benefit other users as much or more as myself. That should hopefully begin to fill out this week.The sidebar for the site was compacted, with some links dropped, some links moved, and some links given shorter names. The main wiki page was given a makeover with a rewritten introduction, the movement of the feeds above the link boxes, and the return of the Recent Changes and Blog Comments feeds.

4th of July Site Updates

| No Comments
Over the weekend of July 4th, Rosetta Code went through some long-overdue software updates, and some interesting site layout changes.  Here is a detailed summary of those changes.

Rosetta Code TODO list

| 1 Comment
I've been averaging 70-90 hours of work per week for a few weeks now, and a lot of work on Rosetta Code has had to be put off.  So here's a TODO list of accumulated things that need to done on Rosetta Code, and have either been in the works for a long time or have been planned. (And by "planned", I mean that some of these things have been ideas that just won't go away.)At the top of the list; These things are either urgent or are already in the "pipeline":
  • Mod Alias needs to be set up on RC, as mod_rewrite's '+' handling became broken in the switch form mod_php to fcgi, and we've back in the bad old days of C++ pointing to C.  At least there's something we can do about it now...
  • ImplSearchBot needs to be fixed.  It's editing almost 400 pages every four hours, when it only needs to be editing between one and ten, on average.
  • ImplSearchBot's Subversion repository (where it keeps the JSON caches of category contents) needs to be opened up for general consumption.
  • ImplSearchBot's Subversion repository needs to be abused to generate RSS feeds containing interesting events per language.
  • There are some bugs in the way Rosetta Code's syntax highlighting deals with leading whitespace.  Details are in the relevant Village Pump page.  There also appears to be a bug breaking Unicode support with at least some languages when dealing with the string "møøse".  Not sure why this would be.
  • Need to finish RC promo video.  Looking for suitable audio to sync.
  • Find out what causes Recent Changes RSS feed to spit out batches of duplicate items a couple times a week.
  • Rewrite the Rosetta theme from scratch.
Things that I want to start on:I'd like to see a bit of a shift away from theoretical tasks to practical tasks, and move from explicitly contrasting languages to identifying where a language's abilities can be taken advantage of for things that programmers often need to do.If anyone wants to give these a try (especially creating more tasks, creating RC promo material, or anything that requires a bot), go ahead, give it a shot!  I haven't had a whole lot of time of late.(Yes, I know there are a lot of links; It comes from having a lot of proper nouns and other interesting concepts...)

Downtime resolved

| No Comments
I figured out what killed the server Saturday--It was ImageMagick.  The ALGOL 68 Dragon Curve animated GIF is fairly large.  Someone went to visit the GIF's page on RC, and MediaWiki ran 'convert' to create a thumbnail.MediaWiki ran (approximately; the paths and filenames have been changed to protect the server, and because I tested it locally) ran was:
convert -background white -size 781 ALGOL_68_Dragon_curve_animated.gif -coalesce -thumbnail '781x599!' -depth 8 out.gif
That command, with that data, takes several seconds to run on my Phenom 9650 desktop at home, and my machine has a darn sight more CPU available to programs than anything running within RC's VPS slice.  As a result, when MediaWiki ran convert, it, via mod_php, via Apache spent several seconds trying to generate a thumbnail.  It would have eventually finished, except that whoever was looking at the page got bored, and refreshed.  Several times.  When I discovered that the server was having issues, the server load average was up around 16.  RC's slice usually hovers between 0.00 and 0.15.Three things have been implemented to fix this problem.  First, I've switched from mod_php to FastCGI, with the assistance of some of the folks in #mediawiki on FreeNode.  As a result, we get back an HTTP 500 ISE when commands run by the server take longer than expected to complete. (For whatever reason, mod_php was simply hanging.)  Second, I've turned off thumbnails.  None of the images on the site are large enough to make them worthwhile, for the time being.  I'll look into backgrounding convert processes in a way that doesn't take down the site, but that's going to be low-priority for now. The third bit came as part of my debugging.  All external commands run by MediaWiki will now be logged, to leave some sort of trace for when this kind of problem happens again.

Downtime

| 1 Comment
Sorry about the downtime. I'm not entirely certain what the cause was, but the fix has been to switch from mod_php to fcgi, and correct a few caching settings in MediaWiki. My best guess is that a high-traffic site linked to (or embedded) the Algol 68 Dragon Curve animated GIF thumbnail, which was apparently causing several hung instances of ImageMagick's convert tool. I'll know more when I have time to look at the logs and analytics data tonight.

Weekly Update

| No Comments
Sorry I missed last weeks update, but I sort of had a good excuse (my birthday was Tuesday :P).ActivityMwn3d added a few Prolog examples, and created some pages relating to Prolog and it's implementations. NevilleDNZ added some more ALGOL 68 examples, yet again :). I added a few more Modula-3 examples just a few minutes ago. ShinTakezou was also active in a handful of tasks and discussions on assorted task talk pages.There were a lot of minor edits to fix lang tags, and the old <code> tags work properly now (that is, they function the same way they do in standard HTML).New Tasks & Misc.Kevin Reid created a new task called Find Unimplemented Tasks, and provided an E example.Mwn3d created the new task Count Programming Examples which counts the number of examples a given task has.Short Circuit created a "Proggit" button to submit certain tasks to the programming subreddit.Still NeededCheck the Village Pump for the latest discussion of topics regarding Rosetta Code and what's currently going on/needing attention.

About this Archive

This page is an archive of recent entries in the Weekly update category.

Stats is the previous category.

Find recent content on the main index or look in the archives to find all content.