Monday, December 18, 2006

Fighting spam mail collectively

(Note: this post may describe an already existing technology. If it doesn't exist, then it's high time it would...)

As far as I know, there are two major techniques used by spam filters:

  1. List - the filter has a list of indicators that allow the classification of a message as spam. This list is typically updated periodically, and usually the users can add items to it when they mark a mail as spam. Outlook 2003's Junk Mail Filter is a good example, which, as far as I know, is primarily based on domains and email addresses.
  2. Adaptive - the system starts with an initial classification mechanism, sometimes as simple as allowing all emails. Then, whenever a mail is marked by the user as spam, the classifier is updated to inhibit this rule. These kinds of techniques are usually based on some machine learning algorithm, often some kind of derivation of the Bayes classifier. The advantage of these techniques is that they are more likely to discover spam mail of a completely new format. On the other hand, it may result in weird cases of false negatives and even false positives.

The limitation of these methods is that:

  1. The first one depends on how well your provider knows about new types of spam mail
  2. The second one requires, for each type of new spam, a set of samples it can learn from. So you will inevitably need to mark some mails as spam until your classifier is correctly tuned.

What I'm proposing, is that whenever you manually mark a mail as spam, the whole mail would be sent to the provider. This will give the provider a huge amount of recent spam mails, making it possible to update the filters in a matter of minutes. Then, each client that reads mails thereafter will already have a filter that will know how to filter these new types of spam mails (assuming the clients update frequently, such as once a day).

Problems:

  1. This could, in itself, generate a huge amount of traffic, only due to the constant data sent to the provider.
    1. Solution: Data could be sent in bulks to the provider (once every few hours), and no necessarily in email format
    2. Solution: If the clients are frequently updated, this shouldn't be a problem, because then fast enough all clients will have an updated filter and stop sending data to the provider.
  2. A spammer could buy several licenses and start pumping the provider such as to make him generate completely useless filters.
    1. Solution: Data sent to the provider will be encrypted and include recognizable information from the user (based on the license information).
    2. Solution: It should be fairly easy to detect such spamming clients and rule them out.

The bottom line is that using the strength of collectivity, it should be possible to create much more accurate spam filters, much faster.

No comments: