How to catch Russian spam by using the MSWregexp
Block spam there is in a foreign language can be hard and I have
got this question a number of times on how to do it, so that is why I have make
this tutorial
so it is easier for anybody to Config MSWregexp to catch them..
First task is to find a copy of the mail that is having the foreign language
that you like to block and it most be in the standard RFC mail format as the
Mailsweeper files normally is.
Here you can see the mail open up in notepad to show how it is looking in the
RAW code and note that notepad does not show the text right because it is not
decoding it with the charset "Windows-1251" as the header tells the mail client
to do, it simply show the standard Text ANSI version..

If you try rename the .MSG file to a .EML file and double click on it, it will
then open in Outlook express and this way you can now see how looks like when
the receiver see the mail inside a mail client..... Remember to scan the mail
for virus before you do this because there can why easy be some bad content in
the mail and you don't like that to execute on your computer...

Next task is to copy the .MSG file into the folder where you
have the MSWregexp application and when you have done that then open a DOS
prompt and go to the folder of mswregexp.
Try execute the MSWregexp on the .MSG file to see if it detect the mail as
spam..

As you can see it say that it has not detected this mail as Spam and that is
because I use a empty mswregexp.ini for this tutorial
as you can see in the screen dump below

as you can see in the INI file it is using the ISO-8859-1 to read the whole
file, so the MSWregexp will see the whole mail text the same way as when open in
notepad and not as the mail client there is using different encoding in the body
parts.
Next task is to see how MSWregexp actual is seeing the content of the body
part's that is having the Russian text in it, to do that you can run the
MSWregexp with the /showpart command to output how it read the content..
First look at the subject then use /showpart:msg_subject

Note the text is not showing as Russian text
Now look at the mail body and to show both the Text body and the
decode html body as one string then use /showpart:msg_textdecodehtml

as with the subject the content is not showing as Russian.
Now simply copy-paste the letters you like to catch into the INI
file and build a expression there can catch the mail

As you can see I have split each letter into it own string so it is easier to
read and edit when there is new thing you like to add.
The basic idea here is that every time it finds one of the letters it will add
25 to the mail score and what value to use is a bit of a guess of when it catch
all spam mails with the strings and do not have any false positive mails so it
can be that you must fine turn the score a number of time when you see it detect
some wrong mails...
The expressions are place under the [subjecttextdecodehtml] section and that
tells MSWregexp to look for the expressions in both the Subject, Text body and
Decode version of HTML body.
Next try test the mail with the new INI file and see what it
detects and what the score is, to do that use the /showtest:* to show at
the console what, where and how many times it has found one of the expressions
there is in the INI file.
Note: because of the long output I hade to split the output into two pictures
and cut some of it away, but it still shows the important things.


As you can see the mail now score 17000 and that is more then the 10000 is must
have to be detected as spam, also the Logtext string is telling what section it
has found what string..
The task showing in this tutorial is basically the same for each SPAM mail that you will try config mswregexp to detect, but to make it easier to build the more complexes regular expressions I have include the GUI app where you can copy the text into and then build the expression in a much faster way and when it is ready, you can copy it to the INI file.
Forum post about this tutorial
http://www.tooms.dk/forum/topic.asp?TOPIC_ID=43