Skip to content
This repository has been archived by the owner on Aug 24, 2022. It is now read-only.

Documentation on e-mail-address processing #35

Open
3 tasks
clhunsen opened this issue Nov 17, 2015 · 1 comment
Open
3 tasks

Documentation on e-mail-address processing #35

clhunsen opened this issue Nov 17, 2015 · 1 comment

Comments

@clhunsen
Copy link
Contributor

Referring to issue #34, the behavior and abilities of Codeface need to be documented. What kinds of From-line formats are supported when supplying mbox files to the mailing-list analysis of Codeface?

With the patch from issue #34, the following "abominations" are supported, additionally to the standard format Hans Huber <[email protected]> (according to @wolfgangmauerer on the maling list):

Hans Huber [email protected]
Hans Huber huber at hubercorp.com
Hans Huber ("AT" instead of "at" also works)
[email protected] Hans Huber
hans huber @ hubercorp.com Hans Huber
hans huber @ hubercorp.com (Hans Huber)

Furthermore, we have the via pattern (such as Hans Huber via corp-dev <[email protected]>) and likely others. Documentation on the treatment would help users (e.g., "The via pattern gets treated as follows: Remove the 'via ...' part and use the mail address as is." [I am not sure that this is actually the way it is handled, hence, this ticket...]).


Things to do

  • Document the various formats (abominations or not) that are supported by Codeface.
  • Factor out the processing routines and make them independent of document processing.
  • Implement a unit test case for all possibilities
@wolfgangmauerer
Copy link
Collaborator

Am 17/11/2015 um 17:01 schrieb Andreas Ringlstetter:

Which of these edge cases are specific to transforming incompatible mbox
formats,
which are specific to the ML analysis,
and which are possibly also effecting the parsing of Sign-Off patterns
in the VCS analysis?
none of them is specific to anything -- it's just that the amount of
creativity that goes into coming up with bogus formats for email
addresses in mails considerably exceeds the amount found in
tags.

As I suggested in the corresponding thread, it is surely useful to
separate the cleanup operations from document processing and make
the routines generically available.

There is also the |Huber, Hans| variation of names for all patterns.
This is already handled in the idManager.py, but not in the ML analysis.

thanks for catching this -- I was discussing this with Mitchell in this
thread, and he's currently looking into what the majority of bogus
use-cases for this pattern is.


Reply to this email directly or view it on GitHub
#35 (comment).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants