Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/improve email extract #373

Merged
merged 6 commits into from
Oct 12, 2018

Conversation

klaxon1
Copy link
Contributor

@klaxon1 klaxon1 commented Oct 2, 2018

@edent
Copy link

edent commented Oct 3, 2018

Does this work with I18n domain names?

For example test@莎士比亚.org

(from https://shkspr.mobi/blog/2016/09/why-cant-you-send-email-to-a-chinese-address/)

@GCHQ77703
Copy link
Member

GCHQ77703 commented Oct 4, 2018

It does not appear to. Nor does it appear to work with any non-latin characters anywhere in the email address. E.g:

This is actually allowed though, according to RFC 5322, which specifically stipulates only latin characters in email addresses. Non-latin characters are usually converted into something that looks like:

Behind the scenes of whatever client you are using. Despite such characters not being allowed in some MTAs, might still be reasonable to support them ourselves as they are ubiquitous.

@klaxon1
Copy link
Contributor Author

klaxon1 commented Oct 5, 2018

good point all, i didnt consider those cases. lets leave this pull request open and i'll make some additional commits.

@klaxon1
Copy link
Contributor Author

klaxon1 commented Oct 11, 2018

so, I've learnt this week that email addresses are more complicated then i originally thought.
Hopefully this updated regex (from https://www.regextester.com/98066) seems to match 99% of valid email addresses.
The following email address are all extracted successfully:

伊昭傑@郵件.商務
म@मोहन.ईन्फो
юзер@екзампл.ком
θσερ@εχαμπλε.ψομ 
JosễSilvễ@googlễ.com
JosễSilvễ@google.com 
Josễ[email protected]
[email protected]
[email protected][email protected]
[email protected].
user+mailbox/[email protected]
用户@例子.广告
उपयोगकर्ता@उदाहरण.कॉम
юзер@екзампл.ком
θσερ@εχαμπλε.ψομ
Dörte@Sörensen.example.com
аджай@экзампл.рус
[email protected][email protected]
test@莎士比亚.org
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]

@n1474335 n1474335 merged commit e638fb6 into gchq:master Oct 12, 2018
@n1474335
Copy link
Member

Excellent, thanks very much for this. I haven't updated the built in email regex in the 'Regular Expression' operation as I feel this version should be a bit more quick and dirty. It's useful to have a fairly accurate one for 'Extract email addresses' though.

@klaxon1 klaxon1 deleted the feature/improve-email-extract branch October 13, 2018 04:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Email search regex does not include +
4 participants