Web Hosting Talk







View Full Version : Regexp used to filter junk email (question)


sydneyshan
12-26-2005, 10:16 PM
Hi there!

Of late, I've been receiving many junk mail messages to my two website domains - dreamscapemedia.com.au and shannonmurdoch.com. These junk mail messages are addressed to non-existant email boxes ie. ekjdlkd@shannonmurdoch.com and are unfortunately being caught in my 'catch-all' email box (which is irreversibly set to my main 'personal' email inbox, which I don't give out to anyone except on paper...).

I'd like to divert any mail not addressed specifically to one of my three email boxes to my generic 'contact@dreamscapemedia.com.au' inbox. As the catch-all option is not available, I've written a regular expression to catch any mail in the same fashion as a catch-all would:

^(?=personal)|^(?=subscriptions)|^(?=contact)|^(.+?)@(.+?)$

This I match against the following example emails:

personal@dreamscapemedia.com.au
contact@shannonmurdoch.com
contact@dreamscapemeida.com.au
mysubscriptions@dreamscapemedia.com.au
sdifjlksd@dreamscapemedia.com.au
fdsklf@dreamscapemedia.com.au

When matching, the regular expression is supposed to ignore any emails addressed to 'personal@*', 'mysubscriptions@*' and 'contact@*', whilst matching any other email address used (and therefore forwarding it on to the contact@dreamscapemedia.com.au box).

At present, this regexp matches the full line for the last two email addresses (good), and 'something' in all of the good email addresses which it is supposed to ignore. I don't know what that 'something' is - it's not a character, as nothing is returned apart from the fact that the regexp has matched 'something' in the good email addresses that it should not be. This is causing all emails to be caught by the regexp filter above.

Does anything stand out to you as being wrong...?

Thanks,
-Shan

sydneyshan
12-26-2005, 11:01 PM
Perhaps I've misunderstood the concept of the lookahead assertions... I've discovered they don't actually match (return) or ignore characters as I had assumed - the 'negative/positive' aspect in the term 'negative lookahead assertion' has to do with the direction the the regexp construct searches from the current point in the string - not performing a positive/negative switch as I had assumed (like the php construct below):

$myvar = true;
if( !$myvar ){ echo('$myvar is false.'); }

The ! switches the value from true to false or false to true depending on the variable.

I am at a complete lack to find any such construct within regular expressions, except the [^abc] negative class character, which has very limited and specific use from what my experiences have taught me.

Has anyone had any luck with exclusive regexp matching (as in, you state a number of phrases you DON'T want to be matched, followed by the phrases you DO want to be matched)?

Any help would be greatly appreciated!

-Shan

Burhan
12-27-2005, 03:31 AM
I must ask, why don't you just disable the catch-all? Wouldn't it be easier? Just setup forwards as you prefer.

sydneyshan
12-27-2005, 04:36 AM
My hosting account was set up in such a way that the catch-all was 'hard-wired' to the main email account. It unfortunately can't be turned off, and making it point to another email address means my personal email address doesn't receive mail...

Burhan
12-27-2005, 07:49 AM
What kind of webhosting to do you have? This seems a bit stupid. The first thing I do is turn off the catchall email addresses, or point it to /dev/null or equivalent. I am really surprised that its 'hard wired'. Ask your support service to turn it off.

Anyway, I am really not sure at what level you want to do the filtering. Is this something you are writing a separate script? Or you want to implement a filter in your MTA?

sydneyshan
12-27-2005, 09:50 AM
It's a cPanel hosting account.

I'm using the primary username as my personal email address. The catch-all is actually turned off at present, but when it is turned on I don't receive any mail to my personal email box. It's like the main email address/account is not a real or 'proper' email address and therefore messages sent to is are caught by the catch-all. The cPanel interface allows you to set up email filters on the server based on simple strings or regular expressions, but unfortunately there is no hierarchy if you know what I mean, so multiple rules are applied to the same incoming message. You can't tell it to ignore other filters if it matches a particular one like you can in Outlook etc.

I guess regular expressions are geared only toward MATCHING strings and don't have the ability to do what I'm looking for...?