Rosetta Code:Village Pump/Fight spam
This is a particular discussion thread among many which consider Rosetta Code.
Summary
On dealing with spam
Discussion
How can we fight this spam which is attacking RC nowadays? I don't like too much black listing of netblocks as suggested maybe somewhere, since you could block also "common" people. I've seen a brand new spam user after posting to Named Argument, and I've discovered I can't do too much more than saying I've seen it. I've also noticed the name of these spammers follow a pattern which could be identified (but likely will change...) ... --ShinTakezou 11:17, 29 June 2009 (UTC)
- On the Tcler's Wiki, we block problem netblocks from making updates (well, we actually show the spammer a preview page but never commit it to the database, which is a nicer solution as they think they've spammed you successfully) but without seeing the logs for addresses where those spam users are being created from, it's hard to tell whether that will work. It's a fairly stupid spammer though, since external links are all nofollow-marked. Maybe simple techniques will work for now. Plus visibly blocking that netblock from creating a new user too. —Donal Fellows 13:51, 29 June 2009 (UTC)
- I didn't want to block the IPs because we had previously had a problem with an IP collision with a legitimate user. I'm not really sure what else I can do. We do have a CAPTCHA, but maybe it's not good enough. --Mwn3d 13:57, 29 June 2009 (UTC)
- Since I think it's not robotic spam, I can't see that a CAPTCHA would help. —Donal Fellows 14:13, 29 June 2009 (UTC)
- Yeah and I wouldn't suggest turning off anonymous edits because we've had a recent surge of legitimate anonymous editors (and some people would probably think that was inconvenient). We may just have to keep up the old fashioned delete and block strategy. --Mwn3d 14:25, 29 June 2009 (UTC)
- Gah. Drop off the face of the planet for a weekend and come back to another spam influx. It could very well be robotic spam if they have a human being sign up the account; CAPTCHAs are only presented to anonymous edits, account creation and login failures. Those settings have worked well for us for the better part of two years. Roboticizing after account creation was an eventuality, but it depended on someone deciding that RC was a big enough target to go the extra steps. (And extra steps are something that the spam economic model tends to avoid; They'd rather hit more weak targets than fewer higher profile ones.) I'm not going to have time to tweak the server settings for a few days, at least. In the mean time, let's watch to see if the problem is going to be bad enough to warrant significant attention. (Unless they've broken reCAPTCHA, it's roughly 1:1 manual labor, which is uneconomic for spammers.) If need be, it might be possible to do a halfway-block; Rather than an outright ban on a user or IP, force all edits from them to go through reCAPTCHA. But that will likely require modding an extension, which I don't have time for right now. --Short Circuit 16:17, 29 June 2009 (UTC)
- I don't know if it's possible, but we can deny accounts containing "buy" in their names. --Guga360 16:35, 29 June 2009 (UTC)
- If the accounts are manually created, this will not give you much. As soon as the spammer gets the error message, he'll just change the account name to something which works. A better idea would be to special-case edits adding hyperlinks, and demand a captcha for those even for logged-in users. That would stop bots adding links, while not affecting normal users too much (few legitimate edits contain external links, therefore having to solve a captcha in those cases would not be too much of a burden). You could also maintain a whitelist for URLs not protected by captchas (e.g. everything on wikipedia.org), in order to minimize the impact for legitimate edits. --Ce 09:10, 30 June 2009 (UTC)
- For Wikipedia there's the special wp: link domain. —Donal Fellows 11:01, 30 June 2009 (UTC)
- Don't give an error message to the spammy accounts. Just silently fail to commit any changes they make. (Better would be giving them their own view of the world, but that's more work.) —Donal Fellows 11:13, 30 June 2009 (UTC)
- Then diagnosing and resolving false positives would be a PITA. --Short Circuit 14:55, 30 June 2009 (UTC)
- If links trigger captchas, then the bots will just post raw URLs. I've seen that one before... --Short Circuit 14:55, 30 June 2009 (UTC)
- A raw URL is a link in the wiki, so naturally it should trigger captha, too. --PauliKL 09:14, 2 July 2009 (UTC)
- Test: http://m-w.com/ http://news.google.com/ http://slashdot.org --Short Circuit 16:02, 2 July 2009 (UTC)
- Odd. I that must have been added during some upgrade since the site started; MW didn't used to do that. Another test: ht tp://broken.com http:// broken.com http://odd-fish . com --Short Circuit 16:02, 2 July 2009 (UTC)
- Test: http://m-w.com/ http://news.google.com/ http://slashdot.org --Short Circuit 16:02, 2 July 2009 (UTC)
- A raw URL is a link in the wiki, so naturally it should trigger captha, too. --PauliKL 09:14, 2 July 2009 (UTC)
- If the accounts are manually created, this will not give you much. As soon as the spammer gets the error message, he'll just change the account name to something which works. A better idea would be to special-case edits adding hyperlinks, and demand a captcha for those even for logged-in users. That would stop bots adding links, while not affecting normal users too much (few legitimate edits contain external links, therefore having to solve a captcha in those cases would not be too much of a burden). You could also maintain a whitelist for URLs not protected by captchas (e.g. everything on wikipedia.org), in order to minimize the impact for legitimate edits. --Ce 09:10, 30 June 2009 (UTC)
- I don't know if it's possible, but we can deny accounts containing "buy" in their names. --Guga360 16:35, 29 June 2009 (UTC)
- Gah. Drop off the face of the planet for a weekend and come back to another spam influx. It could very well be robotic spam if they have a human being sign up the account; CAPTCHAs are only presented to anonymous edits, account creation and login failures. Those settings have worked well for us for the better part of two years. Roboticizing after account creation was an eventuality, but it depended on someone deciding that RC was a big enough target to go the extra steps. (And extra steps are something that the spam economic model tends to avoid; They'd rather hit more weak targets than fewer higher profile ones.) I'm not going to have time to tweak the server settings for a few days, at least. In the mean time, let's watch to see if the problem is going to be bad enough to warrant significant attention. (Unless they've broken reCAPTCHA, it's roughly 1:1 manual labor, which is uneconomic for spammers.) If need be, it might be possible to do a halfway-block; Rather than an outright ban on a user or IP, force all edits from them to go through reCAPTCHA. But that will likely require modding an extension, which I don't have time for right now. --Short Circuit 16:17, 29 June 2009 (UTC)
- Yeah and I wouldn't suggest turning off anonymous edits because we've had a recent surge of legitimate anonymous editors (and some people would probably think that was inconvenient). We may just have to keep up the old fashioned delete and block strategy. --Mwn3d 14:25, 29 June 2009 (UTC)
- Since I think it's not robotic spam, I can't see that a CAPTCHA would help. —Donal Fellows 14:13, 29 June 2009 (UTC)
- I didn't want to block the IPs because we had previously had a problem with an IP collision with a legitimate user. I'm not really sure what else I can do. We do have a CAPTCHA, but maybe it's not good enough. --Mwn3d 13:57, 29 June 2009 (UTC)
Upgrades
I may be about time for another MW upgrade too while we're messing around under the hood. We're on 1.13.3 and they're up to 1.15.0. --Mwn3d 17:23, 2 July 2009 (UTC)
- I was going to do it last weekend, along with ImplSearchBot and fixing my desktop machine, but I instead spent the whole weekend cleaning in search of a missing $500 MSRP phone... --Short Circuit 19:03, 2 July 2009 (UTC)
- Updating WP would be good too. That's just hitting a button right? --Mwn3d 19:05, 2 July 2009 (UTC)
- Yes, if the FTP port was open, and if I was running an ftpd. The WP upgrade mechanism is stupid that way. July 4th is going to get in the way, but I'll see what I can do about it this weekend. --Short Circuit 22:21, 2 July 2009 (UTC)
- Updating WP would be good too. That's just hitting a button right? --Mwn3d 19:05, 2 July 2009 (UTC)
Timing
The creation of each of those accounts requires some form of manual attention. Keep an eye out for a schedule on when they seem to be appearing. For someone to go to that much work to spam a site like this is rather odd. --Short Circuit 00:26, 1 July 2009 (UTC)
- We're seeing the same sort of spam at the erights.org wiki, also a MediaWiki; if you want to do analysis looking there as well might be useful. (Feel free to help with the deleting, of course :-) ) --Kevin Reid 00:42, 1 July 2009 (UTC)
Looks like CAPTCHAs don't work
We're still getting spammed even with annoying levels of CAPTCHAs. Looks like this is some ass doing it manually or they've broken reCAPTCHA, though the fairly low rate of spamming indicates that this is probably manual. Time to ban some netblocks from doing updates to the database, given that nuking from orbit isn't an option. (When spam is a problem, there's no point trying half-measures first. They won't work. Spammers are the scum of the earth and have a financial incentive to boot.) —Donal Fellows 11:18, 1 July 2009 (UTC)
- Read this. Seems that it's likely a manual effort in an attempt to create landing pages. Banning netblocks isn't really going to help, as Tor makes for an easy workaround. At this point, I'm thinking either utilizing an RBL blacklist, or come up with a Bayes-based edit filter based on the ConfirmEdit extension. (Ham gets marked via MediaWiki's patrol mechanism, while spam gets marked by page deletion.)
- The other thought is that spammers are putting manual effort into creating landing pages for email campaigns and the like. We could conceivably #REDIRECT the spam pages to a common target page for the time being. --Short Circuit 18:01, 1 July 2009 (UTC)
- Blacklists or Bayesian filters are not very effective ways to filter spam, and they create false positives. A good spam filtering is based on what the spammers are actually selling: their contact information (e-mail address, web address etc.). I would think there is only one or just a few spammers that bother to manually create pages here in Rosetta Code, so it should be possible to add their contact information to spam filter manually. --PauliKL 09:59, 2 July 2009 (UTC)
- You could also try simply turning off the creation of new accounts for a while (e.g., a couple of weeks) to encourage the spammers to go elsewhere. The number of new genuine users turned off by this is probably going to be quite small, and the problem does at least seem to be confined to user pages. —Donal Fellows 13:23, 2 July 2009 (UTC)
- They're still using identifiable names that could be caught by RE like "[0-9]+\s*buy"; if stopped this way, they can for sure change approach, but if they are landing-pages, likely the username must follow a pattern an airplain can identify, and then they must change their OLS signals, so I believe it is not a so bad approach to fight them. I like the idea of silent failure and honeypot page, even though I've not the slightest idea on how it could be done on mediawiki. --ShinTakezou 13:42, 2 July 2009 (UTC)
- That was the approach I had in mind. Need time to implement it. Shouldn't take too long, but it'll require learning how to extend the ConfirmEdit extension. --Short Circuit 15:47, 2 July 2009 (UTC)
- They're still using identifiable names that could be caught by RE like "[0-9]+\s*buy"; if stopped this way, they can for sure change approach, but if they are landing-pages, likely the username must follow a pattern an airplain can identify, and then they must change their OLS signals, so I believe it is not a so bad approach to fight them. I like the idea of silent failure and honeypot page, even though I've not the slightest idea on how it could be done on mediawiki. --ShinTakezou 13:42, 2 July 2009 (UTC)
- You could also try simply turning off the creation of new accounts for a while (e.g., a couple of weeks) to encourage the spammers to go elsewhere. The number of new genuine users turned off by this is probably going to be quite small, and the problem does at least seem to be confined to user pages. —Donal Fellows 13:23, 2 July 2009 (UTC)
- Blacklists or Bayesian filters are not very effective ways to filter spam, and they create false positives. A good spam filtering is based on what the spammers are actually selling: their contact information (e-mail address, web address etc.). I would think there is only one or just a few spammers that bother to manually create pages here in Rosetta Code, so it should be possible to add their contact information to spam filter manually. --PauliKL 09:59, 2 July 2009 (UTC)
Couldn't?
Couldn't CAPTCHA be disabled for already registered and "tested" users? --ShinTakezou 15:03, 2 July 2009 (UTC)
- That's where we were earlier this week. I might reduce the captcha conditions this weekend if the spam doesn't drop off. At any rate, I'm going to have to mod things a bit to deal with spammers, and it might be possible to drop the CAPTCHAs altogether at that point. --Short Circuit 15:46, 2 July 2009 (UTC)
Done?
It looks like it's done for now. Did anyone do anything or did they just give up? --Mwn3d 18:25, 9 July 2009 (UTC)
Block anon users from uploading images?
We've just had another influx of spam, this time from (a) non-logged-in user(s?) with many IP addresses, all within the USA. (Yay for http://ip-lookup.net/ which makes confirming this sort of thing much easier.) Analyzing the nature of the spammer's technique leads to the conclusion that they're putting the bulk of their spam in an image and then wrapping a page around that. While there are many parts that should be blocked, I'd suggest that the big one is to block uploading of new images so that only logged in users can do it; hardly any legit images (well, any at all?) are ever uploaded by anonymous users — unlike contributions to solving tasks — so blocking won't hurt site growth much. It will also make it easier to clean things up; only a single page will need clearing, instead of multiple. Blocking non-logged-in from all page creation might be a nice extension, if possible, but it's more intrusive (and more likely to encourage irritating changes of tactics on the part of the scumbag gits spammers). –Donal Fellows 10:50, 8 February 2011 (UTC)
- I know how you feel about the
scumbag gitsspammers, and support your call for upload restrictions. However, these particular scumbag gits then went on to create a user account which originally left me confused until I saw that Mwn3d had been cleaning their crap earlier this morning. -Paddy3118 13:39, 8 February 2011 (UTC)- FWIW, I did change the settings on uploads. MW now requires registered, autoconfirmed credentials. --Michael Mol 13:42, 16 February 2011 (UTC)
Possible spam
- User talk:RepairToolbox
- File:Dog hiking backpacks 4877.jpg irrelevant image, submitter was blocked some time ago
- File:AP Specialties - CPP-3086 - Lowe Promotions 4662.jpg irrelevant image, submitter was blocked some time ago
Long time ago I registered to a quantum random bit generator online service. I remember they had a qualifying question for registering new users: http://random.irb.hr/signup.php
I suppose they have a pool of questions that are randomly picked.
We could use something similar, but with programming-related questions. It could even be the subject of a Rosetta-code task. --Grondilu 21:50, 20 November 2012 (UTC)
PS. A new user candidate could chose a question about his preferred language. Here is a simple example, assuming the user is a Perl6 adept:
"Which virtual machine hosts a very famous implementation of Perl6, and has the same name as a very talkative kind of bird?" --Grondilu 22:04, 20 November 2012 (UTC)
- RC uses an extension based off of the SimpleCaptcha MediaWiki extension. Show me some options that build on SimpleCaptcha, and we can look at changing up the captchas. Keep in mind many of RC's users are not very good with English. --Michael Mol 22:32, 20 November 2012 (UTC)
- Ahh I failed to realise that the Captcha system is part of the mediawiki software, and not some custom code that we might tweak easily to fit our needs. My bad--Grondilu 02:26, 21 November 2012 (UTC)
- Isn't QuestyCaptcha a standard feature of the MediaWiki capthca system? I think it would be much better alternative than an image based based captcha. It is both easier for normal user and more effective against spammers. There are captcha-breaking services in India, where thousands of workers routinely provide answers to captchas. It is very easy for them to type in the text they see in the image (especially since they have lots of practice). But even a simple question such as "what is the name of this forum" may be impossible for them to answer, if they only see the captcha. --PauliKL