MaxFilter
I had an assignment from a class where I needed to do some kind of IA, web crawling or something like that. I wanted to try neural networks, so after some research I decided to try to make a spam filter.
The main source of inspiration I found was Lelia post graduation report(link), which had the same approach I wanted to try.
Pretended Result
The main behavior I wanted from the program was to:
- Read my gmail mailbox
- Parse my email
- Flag which emails are spam
- Send results report to my email
I wanted also to compare the results of a simple perceptron against the results of an MLP.
Implementation
I knew some C/C++, asm8086 and HTML/CSS/JS at this time, but none of them provided simple support for neural networks, mail interactions and html handling at the same time. I heard python had support for these so this was my first time using python.
Email Reading
To be able to connect to the email server and ask for my emails I used the poplib python module. After gathering the emails I used the BeatifulSoup module to ignore the html content in the emails and get the raw text. After getting the text some pre-processing is still needed to have better results so the following operations were done:
- Splitted the text by words using the nltk module with their word_tokenize method
- Removed some escaping characters that were still in the tokens and other special characters.
- Deleted words bigger than 20 characters in order to limit impossible words.
- Applied the Porter Stemmer stemming to each token and added their base words to the tokens list.
Features Selection and Handling
To the neural network be able to evaluate if the email is spam or not we need to have some features that it will use as input. They were gathered from the Lelia report. The features are:
- 48 features are the percentage of a specific word associated with spam in the email.
- 6 features present the percentage of specific characters in the whole text.
- 1 feature is the average frequency of upper case characters in the email words.
- 1 feature is the biggest sequence of uninterrupted upper case characters.
- 1 feature is the total of upper case characters in the email.
The output of the neural network will be a binary value, 0 if not spam or 1 if the evaluated email is spam.
Neural Networks Training and Classification
To train and test the neural networks the SpamBase dataset was used.
First we split the dataset in training and testing data. Then I implemented the Perceptron using the sklearn Perceptron module. It was configured to randomize the received test dataset and with the parameters for learning.
After that we start training the network and checking the resulting accuracy. After some training time the neural network is saved in a file using the python pickle module which allows to serialize python objects.
The process for using MLP is similar but has some differences: It uses The MLP from the pylearn2 module. The topology of the network is defined with two layers, the first has 57 nodes and uses a sigmoid function and the second one(output layer) has 2 nodes and uses a softmax function. * An sdg method is used for training the network.
Email Report
With the emails read from our email and the neural networks evaluating each one now we want to report the results of the evaluation in some way.
This way will be sending an email with the results, to do this we used the smtp lib and sent the email after the results are obtained.
Example of the report email:
Assunto : Dicas para usar o Gmail : Resultado –> 0
Assunto : Bem-vindo ao Gmail : Resultado –> 0
Assunto : Primeiros passos no Google+ : Resultado –> 0
Assunto : Bem-vindo(a) ao YouTube! : Resultado –> 0
Assunto : Thank you for downloading RapidMiner : Resultado –> 0
Assunto : Plano de Desenvolvimento de Software G13 : Resultado –> 0
Assunto : Your 14-Day Trial of RapidMiner : Resultado –> 0
Assunto : League of Legends: "Summoner's Rift Gameplay" : Resultado –> 0
Assunto : Risk List G13 – Convite para editar : Resultado –> 0
Assunto : How likely are you to recommend RapidMiner? : Resultado –> 0
Assunto : Your RapidMiner License Has Expired : Resultado –> 0
Assunto : League of Legends: "The Pledge – Kalista" : Resultado –> 0
Assunto : Conta do Google: tentativa de login bloqueada : Resultado –> 0
Assunto : League of Legends: "The Terror Beneath" : Resultado –> 0
Assunto : Conta do Google: tentativa de login bloqueada : Resultado –> 0
Assunto : Conta do Google: o acesso a aplicativos menos seguros foi ativado : Resultado –> 0
Assunto : Teste Pop : Resultado –> 1
Assunto : YOOOOO : Resultado –> 0
Assunto : Happy Holidays from RapidMiner : Resultado –> 0
Assunto : Meeting Minutes #8 – Convite para editar : Resultado –> 0
Assunto : Relatorio de SPAM : Resultado –> 0
Cli Interface
With all this we need to know which algorithm we want to use, which email account we will use, what is the password of the account, if we want to train more networks.
So a simple cli was implemented with prints and the raw_input function from python.
Results
- Cli to use the program.
- Configuration of which email to connect to.
- Menu to select the usage of Perceptron or MLP and also were to train new networks.
- Report Email.
Assunto : Dicas para usar o Gmail : Resultado -> 0
Assunto : Bem-vindo ao Gmail : Resultado -> 0
Assunto : Primeiros passos no Google+ : Resultado -> 0
Assunto : Bem-vindo(a) ao YouTube! : Resultado -> 0
Assunto : Thank you for downloading RapidMiner : Resultado -> 0
Assunto : Plano de Desenvolvimento de Software G13 : Resultado -> 0
Assunto : Your 14-Day Trial of RapidMiner : Resultado -> 0
Assunto : League of Legends: "Summoner's Rift Gameplay" : Resultado -> 0
Assunto : Risk List G13 - Convite para editar : Resultado -> 0
Assunto : How likely are you to recommend RapidMiner? : Resultado -> 0
Assunto : Your RapidMiner License Has Expired : Resultado -> 0
Assunto : League of Legends: "The Pledge - Kalista" : Resultado -> 0
Assunto : Conta do Google: tentativa de login bloqueada : Resultado -> 0
Assunto : League of Legends: "The Terror Beneath" : Resultado -> 0
Assunto : Conta do Google: tentativa de login bloqueada : Resultado -> 0
Assunto : Conta do Google: o acesso a aplicativos menos seguros foi ativado : Resultado -> 0
Assunto : Teste Pop : Resultado -> 1
Assunto : YOOOOO : Resultado -> 0
Assunto : Happy Holidays from RapidMiner : Resultado -> 0
Assunto : Meeting Minutes #8 - Convite para editar : Resultado -> 0
Assunto : Relatorio de SPAM : Resultado -> 0
- Identified Spam email.
Hello,
FREE STUFF WITH US!!!!
SUPER EASY MONEY!!!
Just to easy steps, check our site - www.test.com
- Algorithms Comparison.
Algorithm | Accuracy(%) | FP(%) | FN(%) |
---|---|---|---|
Perceptron | 90.11 | 4.32 | 5.55 |
MLP | 93.02 | 3.96 | 2.99 |
I was just trying to learn about neural networks bit still got a better accuracy than the post graduation.
Repository
If you want to check the source code go to the bitbucket repository.
NOTE: The code was done in python2 when I was still in my bachelors in 2015 +-, and I didn’t maintain it, so it will probably not work anymore.
Other
If you want to receive updates when there are new games or development stories join the mailing list