Deduplication of e-mail is a touchy subject.
What are you going to deduplicate on?
In my experience deduplication across multiple mailboxes using to, from, subject, date&time, and sometimes unique ID works, but still fraught with many issues.
For example, date&time - which one? What if there are automagic timezone adjustments by client software? to - is it the verified source, the SMTP "to" field? What about alias, or "sent in name of"?
Experimented with a percentage of content as part of the deduplication, but a simple version change or automatic conversion from HTML to rich text to text would mess the whole thing up. The process requires normalization of all messages to a single format, then deduplicated, then mark the matching originals.
All deduplication methods should be agreed at the meet & confer - and you better be there, or you will end up with a pile of mess on your hand - like agreement to deduplicate a single mailbox . . .
↧