SpamAssassin Concepts

Machine Learning for Hackers

by Drew Conway and John Myles White · 10 Feb 2012 · 451pp · 103,606 words

’re going to build a system for deciding whether an email is spam or ham. Our raw data comes from the SpamAssassin public corpus, available for free download at http://spamassassin.apache.org/publiccorpus/. Portions of this corpus are included in the code/data/ folder for this chapter and will be used

…

” and “table” and count how often they occur in one type of document versus the other. To show how this approach would work with the SpamAssassin public corpus, we’ve gone ahead and counted the number of times the terms “html” and “table” occurred. Table 3-1 shows the results. Table

…

make our model a Naive Bayes classifier. Writing Our First Bayesian Spam Classifier As we mentioned earlier in this chapter, we will be using the SpamAssassin public corpus to both train and test our classifier. This data consists of labeled emails from three categories: “spam,” “easy ham,” and “hard ham.” As

…

.11] helo=mail.uptime.at) by usw-sf-list1.sourceforge.net with esmtp (Exim 3.31-VA-mm2 #1 (Debian)) id 17hsoM-0000Ge-00 for <spamassassin-devel@lists.sourceforge.net>; Thu, 22 Aug 2002 07:19:47 -0700 Received: from [192.168.0.4] (chello062178142216.4.14.vie.surfer.at [62

…

.178.142.216]) (authenticated bits=0) by mail.uptime.at (8.12.5/8.12.5) with ESMTP id g7MEI7Vp022036 for <spamassassin-devel@lists.sourceforge.net>; Thu, 22 Aug 2002 16:18:07 +0200 From: David H=?ISO-8859-1?B?9g==?=hn <dh@uptime.at> To

…

: <spamassassin-devel@example.sourceforge.net> Message-Id: <B98ABFA4.1F87%dh@uptime.at> MIME-Version: 1.0 X-Trusted: YES X-From-Laptop: YES Content-Type: text/

…

="US-ASCII” Content-Transfer-Encoding: 7bit X-Mailscanner: Nothing found, baby Subject: [SAdev] Interesting approach to Spam handling.. Sender: spamassassin-devel-admin@example.sourceforge.net Errors-To: spamassassin-devel-admin@example.sourceforge.net X-Beenthere: spamassassin-devel@example.sourceforge.net X-Mailman-Version: 2.0.9-sf.net Precedence: bulk List-Help: <mailto

…

:spamassassin-devel-request@example.sourceforge.net?subject=help> List-Post: <mailto:spamassassin-devel@example.sourceforge.net> List-Subscribe: <https://example.sourceforge.net

…

/lists/listinfo/spamassassin-devel>, <mailto:spamassassin-devel-request@lists.sourceforge.net

…

?subject=subscribe> List-Id: SpamAssassin Developers <spamassassin-devel.example.sourceforge.net> List-Unsubscribe: <https://example

…

.sourceforge.net/lists/listinfo/spamassassin-devel>, <mailto:spamassassin-devel-request@lists.sourceforge.net?subject=unsubscribe> List-Archive: <http://www.geocrawler.com

…

/redir-sf.php3?list=spamassassin-devel> X-Original-Date: Thu, 22 Aug 2002 16:19

…

cell phone? Get a new here for FREE! https://www.inphonic.com/r.asp?r=sourceforge1&refcode1=vs3390 _______________________________________________ Spamassassin-devel mailing list Spamassassin-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/spamassassin-devel Note The “null line” separating the header from the body of an email is part of the protocol

…

and when. Unfortunately, such detailed email logs are not available to us in this exercise. Instead, we will again use the SpamAssassin public corpus, available for free download at http://spamassassin.apache.org/publiccorpus/. Though this data set was distributed as a means of testing spam classification algorithms, it also contains a

…

extract from the data. Clearly this is not ideal. Recall, however, that for this exercise we will be using only the ham messages from the SpamAssassin public corpus. If one receives a large volume of ham email messages from a certain address, then it may be that the user has a

…

hackers, and getting dirty with data is what we like! For this exercise, we will be focusing on only the ham email messages from the SpamAssassin public corpus. Unlike the spam classification exercise, here we are not concerned with the type of email, but rather with how each should be ranked

…

this exercise are tm, for extracting common terms from the emails subjects and bodies, and ggplot2, for visualizing the the results. Also, because the the SpamAssassin public corpus is a relatively large text data set, we will not duplicate it in the data/ folder for this chapter. Instead, we will set

…

, we need a common character representation of the dates, which leads directly to the second reason for our suffering: there is considerable variation within the SpamAssassin public corpus in how the receival dates and times of messages are represented. Example 4-2 illustrates a few examples of this variation. Example 4

…

: Mercedes-Benz G55 31.97963566 11/25/02 19:34 deafbox@hotmail.com Re: Men et Toil 34.7967621 10/10/02 13:14 yyyy@spamassassin.taint.org Re: [SAdev] fully-public corpus of mail available 53.94872021 10/9/02 21:47 quinlan@pathname.com Re: [SAdev] fully-public corpus

…

of mail available 29.48898756 10/9/02 18:23 yyyy@spamassassin.taint.org Re: [SAtalk] Re: fully-public corpus of mail available 44.17153847 10/9/02 13:30 haldevore@acm.org Re: From 25.02939914

…

the bishop 24.18504926 10/7/02 21:39 geege@barrera.org RE: The absurdities of life. 34.44120977 10/7/02 20:18 yyyy@spamassassin.taint.org Re: [SAtalk] Re: AWL bug in 2.42? 46.70665631 10/7/02 16:45 jamesr@best.com Re: erratum [Re: no matter

…

to be giving appropriately high ranks to emails from frequent senders, as is the case for outlier senders such as tomwhore@slack.net and yyyy@spamassassin.taint.org. Finally, and perhaps most encouraging, the ranker is prioritizing messages that were not present in the training data. In fact, only 12 out

…

a single data set. Comparing Algorithms Since we know how to use to SVMs, logistic regression, and kNN, let’s compare their performance on the SpamAssassin data set we worked with in Chapters 3 and 4. Experimenting with multiple algorithms is a good habit to develop when working with real-world

…

Spam Classifier, Writing Our First Bayesian Spam Classifier writing classifier, Writing Our First Bayesian Spam Classifier, Defining the Classifier and Testing It with Hard Ham SpamAssassin public corpus, This or That: Binary Classification, Priority Features of Email spread, Standard Deviations and Variances, Standard Deviations and Variances squared error, The Baseline Model

…

and Installing R Packages R Project for Statistical Computing, R for Machine Learning roll call data repository for US Congress, How Do US Senators Cluster? SpamAssassin public corpus, This or That: Binary Classification, Priority Features of Email Twitter API, Hacking Twitter Social Graph Data which function, Converting date strings and dealing

Ubuntu 15.04 Server with systemd: Administration and Reference

by Richard Petersen · 15 May 2015

User and Host Access Header and Body Checks Controlling Client, Senders, and Recipients POP and IMAP Server: Dovecot Dovecot Other POP and IMAP Servers Spam: SpamAssassin Mail Filtering: Amavisd-new Mailing Lists: Mailman 7. FTP FTP Servers Available Servers FTP Users Anonymous FTP: vsftpd The FTP User Account: anonymous Anonymous FTP

…

-fs.target systemd-journald-dev-log.socket nss-lookup.target network-online.target time-sync.target postgresql.service mysql.service clamav-daemon.service postgrey.service spamassassin.service saslauthd.service dovecot.service Wants=mail-transport-agent.target network-online.target Conflicts=shutdown.target [Service] Type=forking Restart=no TimeoutSec=5min IgnoreSIGPIPE=no

…

-IMAP server (http://courier-mta.org) is a small, fast IMAP server that provides extensive authentication support including LDAP and PAM (Universe repository). Spam: SpamAssassin With SpamAssassin, you can filter sent and received e-mail for spam. The filter examines both headers and content, drawing on rules designed to detect common spam

…

messages. When they are detected, it then tags the message as spam, so that a mail client can then discard it. SpamAssassin will also report spam messages to spam detection databases. The version of SpamAssassin distributed for Linux is the open source version developed by the Apache project, located at http

…

://spamassassin.apache.org. There you can find detailed documentation, FAQs, mailing lists, and even a listing of the tests that SpamAssassin performs. Note: For dovecot IMAP server you can use dovecot-antispam plugin to implement spam detection

…

. SpamAssassin rule files are located at /usr/share/spamassassin. The files contain rules for running tests such as detecting the fake hello in

…

the header. Configuration files for SpamAssassin are located at /etc/spamassassin. The local.cf file lists system-wide SpamAssassin options such as how to rewrite headers

…

. The init.pre file holds spam system configurations. Server options such as enabling SpamAssassin, are listed in the /etc/default spamassassin file. Users can set their own SpamAssassin option in their .spamassassin/user_prefs file. Common options include required_scorei, which sets a threshold for classifying a message as

…

messages from certain users and domains, and tagging options that either rewrite or just add SPAM labels. Check the Mail::SpamAssassin::Conf man page for details. Configuring Postfix for use with SpamAssassin can be complicated. A helpful tool for this task is amavisd-new, an interface between a mail transport agent like

…

Exim or Postfix and content checkers like SpamAssassin and virus checkers. Check http://www.ijs.si/software/amavisd/ for more details. Mail Filtering: Amavisd-new See the Ubuntu Server Guide for information on

…

.com/stable/serverguide/mail-filtering.html On Ubuntu you can set up mail filtering using Amavisd-new, which invokes the ClamAV virus protection utility and SpamAssassin to filter mail. You can also use external filters such as opendkim for Sendmail and python-policy-spf for Postfix. Avmadvisd-new which calls filtering

…

opnedkim or python-policy-spf (Postfix will use both), then Amavisd-new has the message scanned by ClamAV for viruses, followed by an analysis by SpamAssassin to see if it is spam. Only then does Amavisd-new allow the message to be placed in the in box. To implement mail filtering

…

, be sure you have installed amavisd-new, spamassassin, and clamav, along with the external filters. sudo apt-get install amavisd-new spamassassin clamav-daemon sudo apt-get install opendkim postfix-policyd-spf-python Ubuntu also recommends that you install supporting applications

…

amavis to use clamav to scan files. sudo adduser calmav amavis sudo adduser amavis clamav Enable spamassassin by editing the spamassassin configuration file, /etc/default/spamassassin, and setting the ENABLED entry to 1. ENABLED=1 Then start spamassassin. sudo service spamassasin start You can then configure Amavisd-new using files in the /etc/amavis

…

/conf.d directory. To activate virus detection and spamassassin, edit the /etc/amavis/conf.d/15_content_filter_mode file and uncomment the lines for

…

virus detection and spamassassin as indicated by the comments. Ubuntu also recommends that you disable the bounce response for spam emails by settings the final_spam_destiny option in

…

, link1 lynx, link2, link1 M Mail, link1 Amsvisd-new, link1 Dovecot, link1 IMAP, link1 lists, link1 mail filtering, link1 Mailman, link1 POP, link1 spam, link1 SpamAssassin, link1 Mail servers, link1 Postfix, link1 mailing lists Mailman, link1 Mailman, link1 main repository, link1 Man pages, link1 masquerading, link1 Metal as a Service (MAAS

…

Ubuntu Software Center, link1 unattended-upgrades, link1 Software & Updates, link1 Software Package Types, link1 Software updater, link1 sources.list, link2, link1 spam Amsvisd-new, link1 SpamAssassin, link1 split DNS, link1 Squid, link1 cache, link1 security, link1 squid.conf, link1 ssh, link1 SSH authentication, link1 configuration, link1 OpenSSH, link1 Port Forwarding, link1

Machine Learning for Email

by Drew Conway and John Myles White · 25 Oct 2011 · 163pp · 42,402 words

, we’re going to build a system for deciding whether an email is spam or ham. Our raw data are The SpamAssassin Public Corpus, available for free download at: http://spamassassin.apache.org/publiccorpus/. Portions of this corpus are included in the code/data/ folder for this chapter and will be used

…

” and “table” and count how often they occur in one type of document versus the other. To show how this approach would work with the SpamAssassin Public Corpus, we’ve gone ahead and counted the number of times the terms “html” and “table” occurred: Table 3-1 shows the results. Table

…

make our model a Naive Bayes classifier. Writing Our First Bayesian Spam Classifier As we mentioned earlier in this chapter, we will be using the SpamAssassin Public Corpus to both train and test our classifier. These data consist of labelled emails from three categories: “spam,” “easy ham,” and “hard ham.” As

…

.11] helo=mail.uptime.at) by usw-sf-list1.sourceforge.net with esmtp (Exim 3.31-VA-mm2 #1 (Debian)) id 17hsoM-0000Ge-00 for <spamassassin-devel@lists.sourceforge.net>; Thu, 22 Aug 2002 07:19:47 -0700 Received: from [192.168.0.4] (chello062178142216.4.14.vie.surfer.at [62

…

.178.142.216]) (authenticated bits=0) by mail.uptime.at (8.12.5/8.12.5) with ESMTP id g7MEI7Vp022036 for <spamassassin-devel@lists.sourceforge.net>; Thu, 22 Aug 2002 16:18:07 +0200 From: David H=?ISO-8859-1?B?9g==?=hn <dh@uptime.at> To

…

: <spamassassin-devel@example.sourceforge.net> Message-Id: <B98ABFA4.1F87%dh@uptime.at> MIME-Version: 1.0 X-Trusted: YES X-From-Laptop: YES Content-Type: text/

…

="US-ASCII” Content-Transfer-Encoding: 7bit X-Mailscanner: Nothing found, baby Subject: [SAdev] Interesting approach to Spam handling.. Sender: spamassassin-devel-admin@example.sourceforge.net Errors-To: spamassassin-devel-admin@example.sourceforge.net X-Beenthere: spamassassin-devel@example.sourceforge.net X-Mailman-Version: 2.0.9-sf.net Precedence: bulk List-Help: <mailto

…

:spamassassin-devel-request@example.sourceforge.net?subject=help> List-Post: <mailto:spamassassin-devel@example.sourceforge.net> List-Subscribe: <https://example.sourceforge.net

…

/lists/listinfo/spamassassin-devel>, <mailto:spamassassin-devel-request@lists.sourceforge.net

…

?subject=subscribe> List-Id: SpamAssassin Developers <spamassassin-devel.example.sourceforge.net> List-Unsubscribe: <https://example

…

.sourceforge.net/lists/listinfo/spamassassin-devel>, <mailto:spamassassin-devel-request@lists.sourceforge.net?subject=unsubscribe> List-Archive: <http://www.geocrawler.com

…

/redir-sf.php3?list=spamassassin-devel> X-Original-Date: Thu, 22 Aug 2002 16:19

…

cell phone? Get a new here for FREE! https://www.inphonic.com/r.asp?r=sourceforge1&refcode1=vs3390 _______________________________________________ Spamassassin-devel mailing list Spamassassin-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/spamassassin-devel Since we are focusing on only the email message body, we need to extract this text from the

…

when. Unfortunately, such detailed email logs are not available to us in this exercise. Instead, we will again be using the SpamAssassin Public Corpus, available for free download at: http://spamassassin.apache.org/publiccorpus/. Though this data set was distributed as a means of testing spam classification algorithms, it also contains a

…

extract from the data. Clearly this is not ideal. Recall, however, that for this exercise we will be using only the ham messages from the SpamAssassin Public Corpus. If one receives a large volume of ham email messages from a certain address, then it may be that the user has a

…

hackers, and getting dirty with data is what we like! For this exercise, we will be focusing on only the ham email messages from the SpamAssassin Public Corpus. Unlike the spam classification exercise, here we are not concerned with the type of email but rather with how each should be ranked

…

using in this exercise are tm, for extracting common terms from the emails subjects and bodies, and ggplot2, for visualizing the results. Also, since the SpamAssassin Public Corpus is a relatively large text data set, we will not duplicate it in the data/ folder for this chapter. Instead, we will set

…

, we need a common character representation of the dates, which leads directly to the second reason for our suffering: there is considerable variation within the SpamAssassin Public Corpus in how the dates and times messages received are represented. Example 4-2 illustrates a few examples of this variation. Example 4-2

…

: Mercedes-Benz G55 31.97963566 11/25/02 19:34 deafbox@hotmail.com Re: Men et Toil 34.7967621 10/10/02 13:14 yyyy@spamassassin.taint.org Re: [SAdev] fully-public corpus of mail available 53.94872021 10/9/02 21:47 quinlan@pathname.com Re: [SAdev] fully-public corpus

…

of mail available 29.48898756 10/9/02 18:23 yyyy@spamassassin.taint.org Re: [SAtalk] Re: fully-public corpus of mail available 44.17153847 10/9/02 13:30 haldevore@acm.org Re: From 25.02939914

…

to be giving appropriately high ranks to emails from frequent senders, as is the case for outlier senders such as tomwhore@slack.net and yyyy@spamassassin.taint.org. Finally, and perhaps most encouraging, the ranker is making messages priority that were not present in the training data. In fact, only 12

The Debian Administrator's Handbook, Debian Wheezy From Discovery to Mastery

by Raphaal Hertzog and Roland Mas · 24 Dec 2013 · 678pp · 159,840 words

which have been introduced by subsequent updates. Depending on the urgency, it can also contain updates for packages that have to evolve over time… like spamassassin's spam detection rules, clamav's virus database, or the daylight-saving rules of all timezones (tzdata). In practice, this repository is a subset of

…

come to the rescue (see sidebar GOING FURTHER Old package versions: snapshot.debian.org). Example 6.3. Installation of the unstable version of spamassassin # apt-get install spamassassin/unstable GOING FURTHER The cache of .deb files APT keeps a copy of each downloaded .deb file in the directory /var/cache/apt/archives

…

external to the email servers. Milters were initially introduced by Sendmail, but Postfix soon followed suit. QUICK LOOK A milter for Spamassassin The spamass-milter package provides a milter based on SpamAssassin, the famous unsolicited email detector. It can be used to flag messages as probable spams (by adding an extra header

Webbots, Spiders, and Screen Scrapers

by Michael Schrenk · 19 Aug 2009 · 371pp · 78,103 words

schrenk <me@server.com> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="0-349883719-1140382526=:7581" Content-Transfer-Encoding: 8bit X-Spam-Checker-Version: SpamAssassin 3.0.4 (2005-06-05) on mail2.server.com X-Spam-Level: X-Spam-Status: No, score=0.9 required=17.0 tests=HTML

…

the text returned by the mail server consists of headers, which tell the mail client the path the message took, which services touched it (like SpamAssassin), how to display or handle the message, to whom to send replies, and so forth. These headers include some familiar information such as the subject

Practical OCaml

by Joshua B. Smith · 30 Sep 2006

for such a useful utility). Running It You need a very large corpus of email to train a Bayesian classifier like this one. Luckily, the “spamassassin” developers have released a public corpus of email that can be used to develop antispam utilities. (A command-line client for this module is presented

…

at the end of this chapter.) This corpus, broken up into ham and spam of varying stripes, can be downloaded from http://spamassassin.apache.org/publiccorpus/. Running the code on these corpuses gives results not nearly as good as Paul Graham says he got, but they are still

Producing Open Source Software: How to Run a Successful Free Software Project

by Karl Fogel · 13 Oct 2005

often comes with some built-in spam prevention features, but you may want to add some third-party filters. I've had good experiences with SpamAssassin (spamassassin.apache.org) and SpamProbe (spamprobe.sourceforge.net), but this is not a comment on the many other open source spam filters out there, some of

Data Science from Scratch: First Principles with Python

by Joel Grus · 13 Apr 2015 · 579pp · 76,657 words

, self.k) def classify(self, message): return spam_probability(self.word_probs, message) Testing Our Model A good (if somewhat old) data set is the SpamAssassin public corpus. We’ll look at the files prefixed with 20021010. (On Windows, you might need a program like 7-Zip to decompress and extract

Pulling Strings With Puppet: Configuration Management Made Easy

by James Turnbull · 1 Jan 2007

-repo": baseurl => "http://repos.testing.com/fedora/$lsbdistrelease/", descr => "Testing.com's YUM repository", enabled => 1, gpgcheck => 0, } } class debian { $disableservices = ["hplip", "avahi-daemon", "rsync", "spamassassin"] service { $disableservices: enable => false, ensure => stopped, } } In Listing 4-7, we’ve created two classes; the first is fedora, which loads whenever a node returns

The Boy Who Could Change the World: The Writings of Aaron Swartz

by Aaron Swartz and Lawrence Lessig · 5 Jan 2016 · 377pp · 110,427 words

your local computer and then “upstreams” them to your website. Finally, while researching Webmake, the Perl CMS that generates pages like Jmason’s Weblog and SpamAssassin, I found a good bit of terminology for this. Some websites, the documentation explains, are fried up for the user every time. But others are

Higher-Order Perl: A Guide to Program Transformation

by Mark Jason Dominus · 14 Mar 2005 · 525pp · 149,886 words

Hands-On Machine Learning With Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems

by Aurélien Géron · 13 Mar 2017 · 1,331pp · 163,200 words

Puppet 3 Beginner's Guide

by John Arundel · 16 Apr 2013 · 241pp · 43,073 words

Hands-On Machine Learning With Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems

by Aurelien Geron · 14 Aug 2019