Documentation on e-mail-address processing #35

clhunsen · 2015-11-17T15:13:08Z

Referring to issue #34, the behavior and abilities of Codeface need to be documented. What kinds of From-line formats are supported when supplying mbox files to the mailing-list analysis of Codeface?

With the patch from issue #34, the following "abominations" are supported, additionally to the standard format Hans Huber <[email protected]> (according to @wolfgangmauerer on the maling list):

Hans Huber [email protected]
Hans Huber huber at hubercorp.com
Hans Huber ("AT" instead of "at" also works)
[email protected] Hans Huber
hans huber @ hubercorp.com Hans Huber
hans huber @ hubercorp.com (Hans Huber)

Furthermore, we have the via pattern (such as Hans Huber via corp-dev <[email protected]>) and likely others. Documentation on the treatment would help users (e.g., "The via pattern gets treated as follows: Remove the 'via ...' part and use the mail address as is." [I am not sure that this is actually the way it is handled, hence, this ticket...]).

Things to do

Document the various formats (abominations or not) that are supported by Codeface.
Factor out the processing routines and make them independent of document processing.
Implement a unit test case for all possibilities

The text was updated successfully, but these errors were encountered:

wolfgangmauerer · 2015-11-17T18:02:54Z

Am 17/11/2015 um 17:01 schrieb Andreas Ringlstetter:

Which of these edge cases are specific to transforming incompatible mbox
formats,
which are specific to the ML analysis,
and which are possibly also effecting the parsing of Sign-Off patterns
in the VCS analysis?
none of them is specific to anything -- it's just that the amount of
creativity that goes into coming up with bogus formats for email
addresses in mails considerably exceeds the amount found in
tags.

As I suggested in the corresponding thread, it is surely useful to
separate the cleanup operations from document processing and make
the routines generically available.

There is also the |Huber, Hans| variation of names for all patterns.
This is already handled in the idManager.py, but not in the ML analysis.

thanks for catching this -- I was discussing this with Mitchell in this
thread, and he's currently looking into what the majority of bogus
use-cases for this pattern is.

—
Reply to this email directly or view it on GitHub
#35 (comment).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation on e-mail-address processing #35

Documentation on e-mail-address processing #35

clhunsen commented Nov 17, 2015

wolfgangmauerer commented Nov 17, 2015

Documentation on e-mail-address processing #35

Documentation on e-mail-address processing #35

Comments

clhunsen commented Nov 17, 2015

wolfgangmauerer commented Nov 17, 2015