Suddenly Programming!

Mon, October 9, 2006

Suddenly Programming!

First, I offer my sincere apologies to any of you who received a "new comments" e-mail from onebee last week as a result of comment spam. What began months ago as a slapdash spam-detection scheme to weed out the occasional spam comment has gradually become more and more sophisticated, as attacks have increased. Until last week, it was blind to a certain kind of spam comment, and we got around 750 of that kind late Tuesday night. The system didn't recognize them as spam, so they were published to the site, which meant you may have received erroneous e-mail if they happened to be posted to a thread you were commenting on. Fortunately, this was a very easy type of spam to recognize and weed out, and I immediately shut it down. This is fortunate indeed, because we've been hit with approximately 8,000 more in the days since.

Which means I have been doing some programming, which is something I have basically avoided since last December when I was unceremoniously dismissed from my former position as a web developer. It took a few hours to get back in the swing of things, but since I do most of my programming by the seat of my pants anyway, it wasn't a huge adjustment.

Dealing with spam of any variety is a complex issue. Which, simply, is how spam is able to exist. Spammers create messages that flow through normal communication channels and attempt to look like normal messages. To eradicate them, you have to strike a balance between clamping down all communication (and possibly shutting off a few innocent messages) or opening up communication (and allowing spam in). At onebee, we're big on everyone having a voice, especially in the comments, so it makes it a stickier wicket.

First of all, I decided that if the comment system detected a spam message, it should still store it in the database in case it's a false positive. I wouldn't want to lose someone's real comment just because they happened to use a bad keyword by accident. It just stores the comment with its publish flag turned off, so it won't show up on the site until I've had a chance to look at it and approve or delete it. I didn't want to go to comments that required a login or moderated comments because the first means more work for readers and the second means more work for me. So, my solution is a sort of auto-moderator that only flags the most spammish messages for my attention, and lets the rest through.

The trouble is how to educate that moderator. I gave it some keywords that occur far more often in spam than in normal comments, and I taught it to recognize someone posting a big list of links to other sites. (This represents the bulk of spam, because comment spammers don't want you and I to click their links; they know we're wise. They want Google to find their links while it's automatically scanning onebee's pages, on the assumption that Google knows onebee.com to be innocent and therefore it'll trust links from our pages and give them a higher ranking in its index.) But guessing what spammers will try next is never a perfect system, as we learned last Tuesday. Plus, I don't mind going through and deleting three or four spam messages now and then, but when it gets to thousands, I have to login to the database and rip them out rather than deleting them one by one. (This is a pain in the ass.)

When I got home Saturday night, there were another hundred or so spam messages waiting to be deleted. As I was deleting them, I realized that new ones were popping up right that very instant! Since onebee gets low traffic on weekends, I quickly turned comments off site-wide for a few minutes while I finished deleting. Curious, I checked the web server logs to see what was going on. I saw that all of the evening's comments seemed to be originating from the same IP, and I also noted that whichever computer was retrieving those pages was not loading any images or anything. Normally, when someone visits a page, the server log will show multiple GET requests from their browser. (GET is where the "client" – your web browser – asks the server to send it a certain file.) The browser will GET the page itself, and then GET any individual images, JavaScript code, CSS documents, etc., that are referenced by code in that page. Visiting even the simplest page on onebee results in about ten lines in the server log. But in this case, it was just "GET /whatever/page" then "POST /whatever/page," over and over. (POST is when the client sends information back to the server through a web form, like when you submit the comments or do a search.) Interestingly, the offending spam machine seemed to stop requesting new pages once I shut off the comment entry form. I don't know if this is a coincidence or if it actually abandons a site once it stops finding comment forms there.

Two things happened here: one, I realized that I needed to come up with a stronger system, because I don't want to spend time deleting a hundred or so comments a day; two, I started thinking about spammer behavior. Up to now, I'd only been focusing on clues in the message (keywords, links, etc.); but I realized that's just a fraction of the information I get from a comment spammer. There are a lot of behavioral clues I'd been ignoring. I grepped through the last two days of server logs, and identified 80 or so IPs that had been sending lots of POSTs. Since we'd only had a handful of legitimate comments over that period, I assumed these were spammers. (The server log doesn't record the content of a POST, only that it happened. That's one of the reasons I use POSTs – I'm protecting your privacy, even from me!) I compared that list with an online resource which aggregates the IPs of reported spammers, and found about a dozen of my IPs were on that list. Aha! Let's block the fuckers!

The problem is, I want to go about blocking them in the right way. Completely shutting down their access to onebee.com is pretty drastic. What if they spoofed their IPs, or used dynamic ones that might be reassigned to other computers later? I'd hate for a legitimate reader to be turned away by an overzealous blocking policy. I decided rather than block, I would redirect: sending traffic from those IPs to an explanatory page so that if by some chance a real person were trying to get in, they'd have a way to help me fix it. This took about an hour of frustrated screwing around with Apache's mod_rewrite, which Apache's own documentation says is too complex to understand in one day. (Ah, the drawbacks of seat-of-the-pants web programming.) Having succeeded at bending rewrite to my will during the re-bee (read: I stumbled onto a solution that produced the desired result; God knows how clumsy it is), I figured I could fight my way through again. I nearly gave up hope a few times, but in the end I succeeded.

Of course this is just a list of the handful of spammers that happened to attack onebee.com in the last couple of days and show up on a listing of known spammers. I will probably have to be more restrictive than this if I want to make a dent. So, as of today, the comment system is storing the following information in a database if it an incoming comment fails the spam test: the originating computer's IP, and the time it submitted the comment. Additionally, it is using this information to keep a running tally of how many comments from that IP have failed the spam test and how far apart they were submitted. For now, we're just monitoring. But I suspect that soon it will be necessary to automate the process of restricting access, and I'll use this information to block the IPs that submit the greatest number of spam comments and submit them the most often. (If a computer submits a lot of spam comments, or seems to submit them more than once a minute, it's reasonable to conclude that they're a baddie and shut them down.) I'm sure there are a few good ways to continue marginalizing the comment spam without creating undue restriction or making a lot of extra work for me. Maybe I'll crack down on IPs that load pages without loading images, or maybe I'll block any IP that submits even one spam comment, and create an automated process for removing false-positives after the fact. For now, it's a "wait and see" approach; I'll devise my next move after I see how this last one works.

Anyway... that's what I spent my weekend on!

(to the top of the page)

Styled text	bold or italic	bold or italic
Quoted text	> Trump is so dumb.	Trump is so dumb.
Hyperlink	Go to <https://abc.com>	Go to https://abc.com
Film/TV Titles	(m: Titanic) or (t: Frasier)	Titanic or Frasier

Username (or e-mail):
Password:

remember me:

Name:			Log in / Register to comment
e-mail:

Suddenly Programming!

Related: