Lazarus

Programming => General => Topic started by: KarenT on June 11, 2018, 08:05:23 pm

Title: Clean an email address?
Post by: KarenT on June 11, 2018, 08:05:23 pm
Hello,

I need to clean incoming email addresses. Can someone please point me at some code snippets?
All I want is the actual address. e.g.

something@somewhere.com

Reading the RFC and searching on-line an email address can be quite a mess.
See here for starters, page-down x 2:
https://haacked.com/archive/2007/08/21/i-knew-how-to-validate-an-email-address-until-i.aspx/

But I have seen
DesignSpark <designspark@news.rs-online.com> Using nlserver, Build 6.1.1.8705
"B&H Photo" <ord-status@bhphotovideo.com> Using MIME::Lite 3.01 (E2.72; F4.60; Q2.21; G4.21)

It gets complicated fast.
Title: Re: Clean an email address?
Post by: dsiders on June 11, 2018, 09:24:04 pm
Hello,

I need to clean incoming email addresses. Can someone please point me at some code snippets?
All I want is the actual address. e.g.

something@somewhere.com

Reading the RFC and searching on-line an email address can be quite a mess.
See here for starters, page-down x 2:
https://haacked.com/archive/2007/08/21/i-knew-how-to-validate-an-email-address-until-i.aspx/

But I have seen
DesignSpark <designspark@news.rs-online.com> Using nlserver, Build 6.1.1.8705
"B&H Photo" <ord-status@bhphotovideo.com> Using MIME::Lite 3.01 (E2.72; F4.60; Q2.21; G4.21)

It gets complicated fast.

I know Indy10 has the TIdEMailAddressItem and TIdEMailAddressList classes to help with the process (lib/Protocols/IdEmailAddress.pas). If nothing else, look at the logic for TIdEMailAddressItem.SetText for the particulars.

Hope that helps.

Don
Title: Re: Clean an email address?
Post by: Zvoni on June 12, 2018, 09:34:06 am
I found it always helpful to think about a logic/algorithm how i would implement it myself, before using someone else's code.

Get incoming string containing the email-address
parse the string looking for the "<"-character
from that position+1 parse the string further looking for the ">"-character.
now you've found your token containing the address
a further check if the token contains one (and only one) "@"-symbol by splitting the token along the @-character
the lower part is the name the upper part the domain
checking the domain if it at least contains one dot "."-character (2 or more if it's coming from subdomains (e.g. myname@sub.domain.com)

And what do you know?
I've written such a splitfunction as an excercise to get familiar with Pascal.

EDIT: I've just read your link.
I didn't know that you could use the @-symbol in the local part if you escape or quote it!!  :o :o :o
Title: Re: Clean an email address?
Post by: ASBzone on June 12, 2018, 03:03:49 pm
EDIT: I've just read your link.
I didn't know that you could use the @-symbol in the local part if you escape or quote it!!  :o :o :o

To be fair, I have never seen this used anywhere.

It seems to me that the logic you used would address the vast majority of the use-cases that are likely to occur on any regular basis.

And once you have extracted the contents in between "<" and ">" you can still choose to handle the occurrence of a second "@" if necessary.
Title: Re: Clean an email address?
Post by: KarenT on June 12, 2018, 03:24:26 pm
EDIT: I've just read your link.

:D As I was reading your post a grin began to form as I was pretty sure you had not checked what is allowable -- and -- not only a second "@", but as many as you like.

I am already doing as you suggested and a lot more, but still occasionally get a blank "To" or "From" address meaning my attempts have failed to clean it up.

Seems to my simple mind that something as basic as an email address should be defined in the RFC way more rigidly. But what do I know. :)
Title: Re: Clean an email address?
Post by: KarenT on June 12, 2018, 03:32:39 pm
I know Indy10

Thanks I had already checked but their checking is not as comprehensive as the stuff I have already developed. And, as mentioned on my other reply my function still occasionally cannot resolve to a clean address.

In my windows/Delphi days many years back, I remember seeing something on cleaning email address as part of a package like Synautils etc. But have spent two hours going through old backup HDDs and cannot find it. It was a weirdly named thing like "u18263address(..." probably the number was naming an RFC or something like that.
Title: Re: Clean an email address?
Post by: ASBzone on June 12, 2018, 03:47:57 pm
Seems to my simple mind that something as basic as an email address should be defined in the RFC way more rigidly. But what do I know. :)

Two words:  Backwards Compatibility

Some of these earlier standards were developed when there was quite a bit of flexibility across a variety of proprietary systems.  Today, there is a greater tendency to be a bit more structured in the RFCs...
Title: Re: Clean an email address?
Post by: Zvoni on June 12, 2018, 06:00:17 pm
Quote from: KarenT

I am already doing as you suggested and a lot more, but still occasionally get a blank "To" or "From" address meaning my attempts have failed to clean it up.
What are you struggling with?
Examples which fail?
Title: Re: Clean an email address?
Post by: KarenT on June 12, 2018, 08:13:48 pm
What are you struggling with?

I don't have anything at the moment and I am not so much struggling as in "don't know how to do it," but more along the lines of "can't keep up with the moving window." :)

No sooner do I include something weird in my "Clean" routine and I get another even weirder one, albeit usually weeks apart. I was hoping someone had been down this path before and had an all singing, all dancing version that coped with everything.
Title: Re: Clean an email address?
Post by: Zvoni on June 13, 2018, 08:02:47 am
I was hoping someone had been down this path before and had an all singing, all dancing version that coped with everything.

Ah no! Even the worlds most famous physicists haven't found the Formula of everything.
The Unified Field Theory of Programming still has to be discovered!  :P :P :P
Title: Re: Clean an email address?
Post by: Thaddy on June 13, 2018, 11:52:30 am
This may help:
Code: Pascal  [Select][+][-]
  1. program findemail;
  2. {$mode objfpc}
  3. // finds basically most valid email addresses in a string or file (99%)
  4. uses regexpr;
  5. var
  6.   Expr:TRegExpr;
  7. begin
  8.   Expr := TRegExpr.Create('[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}');
  9.   try
  10.     if Expr.Exec('vincent@vangogh.museum designspark@news.rs-online.com    <ord-status@bhphotovideo.com> some stuff "myemail@mail.info" mail@mail.mail.com') then
  11.     repeat
  12.       writeln(Expr.Match[0]);
  13.     until Expr.ExecNext =  false;
  14.   finally
  15.     Expr.free;
  16.   end;  
  17. end.
There is an official RegEx that conforms to the RFC 5322 that you can also use, but that is:
a) rather complex.
b) will fail in a case such as you mentioned.
c) I could not get it to work with TRegExpr.. :o :( Maybe someone else can.
RFC 5322 regexpr:
Code: [Select]
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])Probably needs some extra escaping to make that work in TRegExpr.

Title: Re: Clean an email address?
Post by: Thaddy on June 13, 2018, 02:05:28 pm
@KarenT
Here is the same code as a command-line utility:
Code: Pascal  [Select][+][-]
  1. program findemails;
  2. {$mode objfpc}
  3. // simple commandline utility
  4. // finds basically all email addresses in a text (including html) file (99%)
  5. uses sysutils, classes, regexpr;
  6. var
  7.   MyFile:TStringList;
  8.   Expr:TRegExpr;
  9. begin
  10.   if ParamCount <> 1 then
  11.   begin
  12.     writeln('Use: findemails <filename>');
  13.     Halt;
  14.   end;
  15.   if FileExists(ParamStr(1)) then
  16.   begin
  17.     MyFile := TStringList.Create;
  18.     MyFile.LoadFromFile(ParamStr(1));
  19.     try    
  20.       Expr := TRegExpr.Create('[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}');
  21.       try
  22.        if Expr.Exec(MyFile.Text) then
  23.        repeat
  24.          writeln(Expr.Match[0]);
  25.        until Expr.ExecNext =  false;
  26.       finally
  27.         Expr.free;
  28.       end;
  29.     finally
  30.       MyFile.Free;
  31.     end;
  32.   end;  
  33. end.
Title: Re: Clean an email address?
Post by: RayoGlauco on June 13, 2018, 02:18:43 pm
Only to mess things up a little more, what about IDN domains, that include international characters? https://en.wikipedia.org/wiki/Internationalized_domain_name (https://en.wikipedia.org/wiki/Internationalized_domain_name)
Title: Re: Clean an email address?
Post by: RayoGlauco on June 13, 2018, 02:44:48 pm
I found some examples of internationalised email adresses here: https://en.wikipedia.org/wiki/International_email (https://en.wikipedia.org/wiki/International_email)
  用户@例子.广告   
  अजय@डाटा.भारत   
  квіточка@пошта.укр   
  θσερ@εχαμπλε.ψομ   
  Dörte@Sörensen.example.com
  аджай@экзампл.рус   
TinyPortal © 2005-2018