The New York Times has an article on the growth of spam. Interesting facts include: 9/10 of all email is spam, spam source identification is nearly useless due to botnet spam senders, and image based spam (emails which consist of an image only) are on the growth.
Estimates of the cost of spam are almost certainly far to low, because they do not account for the cost in time lost by people.
The image based spam which is currently penetrating many filters should be catchable with a more sophisticated application of machine learning technology. For the spam I see, the rendered images come in only a few formats, which would be easy to recognize via a support vector machine (with RBF kernel), neural network, or even nearest-neighbor architecture. The mechanics of setting this up to run efficiently is the only real challenge. This is the next step in the spam war.
The response to this system is to make the image based spam even more random. We should (essentially) expect to see Captcha spam, and our inability to recognize captcha spam should persist as long as the vision problem is not solved. This hopefully degrades the value of spam to the spammers, but it may not make the value of spam nonzero.
Solutions beyond machine learning may be necessary. One simple economic solution is to transfer from first time sender to receiver a small amount (10 cents?) in a verifiable manner. If the receiver classifies the email as spam then the charge repeats on the next receipt, and otherwise it goes away.
There are several difficulties with this approach: How do you change a huge system in heavy use which no one controls? How do you deal with mailing lists? These problems appear surmountable. For example, we could extend the mail protocol to include a payment system (using the “X-” lines) and use the existence of a payment as a feature in existing spam-or-not prediction systems. Over time, this feature may become the most useful feature encouraging every legitimate email user to offer a small payment with the first email to a recipient.
I’ve already seen Captcha spam. It was a stock pump-n-dump spam last week sometime. I delete my spam pretty quickly, or I’d post the image somewhere. It was pretty hard to read, so I suspect its value as a spam was really low.
Regarding payment, check out Kuipers, et al’s Zero-sum payment system for spam control.
A student in my department, Thede Loder, left to start a company based on this idea: boxbe. He published the basic idea in Loder, Van Alstyne, and Wash, “An Economic Answer to Unsolicited Communication (published in ACM’s conference on EC).
If you think recognizing rendered images in spam is a trivial problem, you should do it and write a paper. Often real-world problems are less trivial than at first glance.
Dan
I’ve gotten a few of those.
It’s been interesting watching them evolve over the past few months. At first it was just an image, with the strange Burroughsian cut-up anti-Bayes text (it’s fun to google the phrases to find out where they come from–small blogs). Last month they started varying the baseline of the letters and stepping up the random text. Recently they’ve engaged full captcha mode, as illustrated above. (I don’t think there is any OCR-based antispam currently in use; I assume they’re using their own anti-captcha software to test it against for future-proofing, or something like that.)
In the case of this particular spam, the captcha changes have definitely made it less effective. Originally there were actual exhortations to buy a specific stock, with a five-day price forecast, quotes from analysts, etc. Now, I’m not even sure what they want me to do.
I’m almost leery to mark it as spam, since the only content is human-sounding text. (Almost perfect human-sounding text, as a matter of fact; if a friend wrote a poem in that form and sent it to me, I’d want it to come through.) I think it has already increased the amount of false-positives in my email… yarr. Ah well, I’m sure it’ll be a solved problem soon enough, right?
I’ve certainly noticed the increase (I’ve heard it attributed to botnets several times) but I’ve also noticed some other trends with regard to spam other than just the increase in image-based attempts. I’ve gotten lots of messages that contain literary excerpts as the actual text of the message (and the actual advertisement in an attached image), with the text probably retrieved from Project Gutenberg or other similar sources. I received a big chunk of George Bernard Shaw’s “Geneva” once, and have seen subject lines and message bodies involving what are clearly character’s names/excerpts from Isaac Asimov’s “Foundation” series. This makes life difficult for anyone hoping to base a discriminant function on recognizing sensible natural language (and leads to the rather humourous predicament of training our filters to reject some of the better examples of written language our culture has produced).
I’ve also seen the 21st century cousin of an old style magazine clipping ransom note used in spam to spell out words. It amused me the first time I saw it (in that screenshot, too, you can see text taken from Anton Chekhov’s “The Schoolmaster”, available on Project Gutenberg).
Joshua Goodman gave an invited talk at ACL (or HLT or something) this year on spam. There are ways of getting around the money problem that he talked about…I’m sure I don’t remember them all, but the basic set up was that in addition to charging the sender real money, it can also be charged computation time. I.e., if your email client believes that an email might be spam, it can send back a challenge like “factor 3472984732805372052738021″… the numbers should be hard enough that a spammer couldn’t reasonably solve hundreds of thousands of them, but easy enough that someone sending a reasonable amount of email could perform the factorization in a minute or so. The whole set-up around “challenge” problems seems attractive because many of the challenges can be performed behind the scenes. Only when all those fail, you could request that the sender solve a captcha in order to allow their mail to be sent.
It was more convincing in the talk, but the point is that there’s a lot you can do without completely overhauling the email protocol.
One problem with challenges (either economic or computational) is that the spammers rarely ever use their address in the “From” field since they’re only pushing a message, not engaging in communication. Instead, they use other addresses from their email address database or make up usernames at an arbitrarily chosen domain.
To see why this is a problem for challenge-based solutions consider a spammer who uses netizen A’s address to send a piece of spam to netizen B address. B would send a challenge to A asking that they factor a number or pay a nominal fee and ignore the spam until there is a valid response. However, A (who may not even know B) would get an email asking for a response.
A can’t just filter these responses otherwise the system won’t work. Furthermore, if the challenge email quotes the original piece of spam then the spammer has still successfully sent a message to A (though the quoted form might highlight the fact that B believes it to be spam).
Unless the challenges are sent out selectively there will be as much unsolicited email flying around the internet. To send the challenges out selectively you have to identify spam and so we’ve reduced the challenge-response scenario to identifying spam unless there’s something I’ve missed.
Perhaps I’m misunderstanding you, but I don’t see the problem here. A’s mail server gets a challenge from B with regards to message M, observes that A has not sent message M to B and chucks the challenge in the garbage. B’s mail server doesn’t get a response and so chucks the spam in the garbage. Sounds like everyone is happy at this point except the spammers.
From the viewpoint of modifying the existing system, it is nice to avoid a multiround interaction so that mail transport agents from the old system continue to function. The simple way to do this is for the ‘new’ MTAs to (a) publish a public key and (b) encrypt the access information for first-time-send payments. Done properly this should imply that only the receiver can access the payment while preserving backwards compatibility with older systems.
These ideas aren’t particularly new—there are numerous papers proposing various schemes (including the ones pointed out above). What is new is that the mail system is nearing a point of collapse which provides an impetus for change.
I tend to agree that this is fundementally an economic problem and not a machine learning one. I would sugget (although I’m sure I’m not the first), that the key isn’t to make an expiring micropayment, but rather a micropayment contract. That is, the sender agrees to pay some (likely nominal) fee *if* the user chooses to collect. One doesn’t collect on mails they actually wanted, and do on mails they don’t. Collecting on non-spam will quickly lead to you not receiving mails. I believe whte-lists get over the burden of these contracts being necessary for all emails.
I think this helps get over the unfortunate fact that spam is often in the eye of the beholder.
For reference: one of the primary “computational payment” schemes is hashcash. Their FAQ covers a lot of the issues involved here.
If somebody I don’t already know sends me an email that contains mostly an image I classify it as spam. I’m not fond of non-ascii email anyways… 😉
Everyone’s happy except the spammers and the people who haven’t yet moved to the new mail systems. I suppose after I had received enough challenge emails from friends and colleagues I’d probably end up upgrading, if only to make the challenge “spam” stop.
I think John is right in his comment below, the situation has become critical. All it will take is one company or project to produce an easy to use solution that gains enough attention. This will hopefully force most ISPs, freemail services and mail clients to agree to start looking at similar sorts of ideas.
Not a Machine Learning or economic solution, but:
One attractive idea would be to force the use of some computationally costly operation for the sender, such as compression or encryption.
Tilting the computational burden towards the sender would be desirable of course, but even if the burden is balanced, spam senders would have to compress millions of messages whereas receivers would have to uncompress only a few hundred messages a day.
Also, lossless compression is such a mature technology that there is no reason not to use it really.
This also requires minimal change to existing protocols, as in a first stage, we could default to no encryption/compression and “upgrade” to more costly operations as available computational power increases. Synchronising such upgrades with major OS releases could make their adoption very fast.
I am personally skeptical of solutions which involve adding computational constraints on the message sender. I see little evidence that spam senders are computationally bounded relative to average senders—they have already acquired large botnets which could be put to work as a computation farm in addition to their current role.
The text is actually from Bulgakov’s “Master and Margarita”
It appears you’re correct. Hmm, I wonder what I searched for that led me to that conclusion. Maybe I was looking at another message (there’s no shortage of them, after all).