| Back | Main view

IMiS ARChive advanced email parser configuration

Product:IMiS/ARChive
Release:Since 9.10.2010
Date:10/12/2022

Case: IMiS ARChive uses Tika parsers to automatically parse and extract information from archived emails. In this article we describe, how to configure IMiS ARChive email parsing plugin to achieve different parsing results.

Description:

Plugin configuration is an xml document, structured with next supported xml tags:
Supported email parsers:
"Debug" functionality allows dumping raw metadata property values from Tika. Next properties are not supported directly by Tika and are therefore injected in resultset by plugin implementation. To avoid property name collision, we use GUID values for property names:
All examples in this article were made with Tika 1.28.4. For RFC822 parsing examples, we use emails from TIKA-2478 ticket issue. For Microsoft Outlook item file format (msg) parsing examples, we use sample emails from next Tika tickets:
Example 1: minimal configuration for RFC822 parser implementation.

<Arguments>
    <Class>com.imis.imisarc.server.parser.impl.RFC822Parser</Class>
</Arguments>

Parsing "mixed-simple.eml" results in next content types:
If we manually inspect "mixed-simple.eml", we see that email is missing actual content. Manual inspection also reveals that email contains additional "inline" reference to "Mary with cooler.jpeg" attachment, which also does not have content. This can also be seen from server log, if we enable "debug" functionality.

Example 2: enabled debug functionality with enabled "inline" embedded resource type.

<Arguments>
    <Class>com.imis.imisarc.server.parser.impl.RFC822Parser</Class>
    <EnabledEmbeddedResourceType type="inline">true</EnabledEmbeddedResourceType>
    <Debug>true</Debug>
</Arguments>

Parsing "mixed-simple.eml" results in next content types:
Parsing "mixed-with-pdf-inline.eml" results in next content types:
Parsing "UET6KCXR5FYIEJYKUCK2AKF3FLXTRNAT.eml" results in two "message/rfc822" content types. Server log inspection reveals, that "UET6KCXR5FYIEJYKUCK2AKF3FLXTRNAT.eml" does not represent valid "message/rfc822" email. Log also show, that Tika recognize "message/rfc822" as embedded attachment with "org.apache.tika.parser.mbox.MboxParser" parser.

10/10/22 11:26:09.306 [iarcd:87845:7f86aad62640] INFO[6] <stdout>: EXTRACTCONTENT RAW METADATA DUMP BEGIN
10/10/22 11:26:09.306 [iarcd:87845:7f86aad62640] INFO[6] <stdout>: X-Parsed-By: 'org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.mbox.MboxParser'
10/10/22 11:26:09.306 [iarcd:87845:7f86aad62640] INFO[6] <stdout>: Content-Encoding: 'windows-1252'
10/10/22 11:26:09.306 [iarcd:87845:7f86aad62640] INFO[6] <stdout>: 0bb730be-9027-11ea-93b3-005056ab19ce: 'false'
10/10/22 11:26:09.306 [iarcd:87845:7f86aad62640] INFO[6] <stdout>: Content-Type: 'application/mbox'
10/10/22 11:26:09.306 [iarcd:87845:7f86aad62640] INFO[6] <stdout>: EXTRACTCONTENT RAW METADATA DUMP END
10/10/22 11:26:09.307 [iarcd:87845:7f86aad62640] INFO[6] <stdout>: GENERICEMAILEXTRACTCONTENTRESULT METADATA DUMP BEGIN
10/10/22 11:26:09.307 [iarcd:87845:7f86aad62640] INFO[6] <stdout>: EML_ATTACHMENTS: 'Content-Type: message/rfc822, filename: , B64 data length: 10792'
10/10/22 11:26:09.307 [iarcd:87845:7f86aad62640] INFO[6] <stdout>: EML_SIGNED: 'false'
10/10/22 11:26:09.307 [iarcd:87845:7f86aad62640] INFO[6] <stdout>: GENERICEMAILEXTRACTCONTENTRESULT METADATA DUMP END

Because we don't support "MboxParser" for parsing "message/rfc822", automatic email parsing fails. Since MboxParser recognized "message/rfc822" as attachment, it can be manually downloaded (for this and next examples, we name it "UET6KCXR5FYIEJYKUCK2AKF3FLXTRNAT-rfc822.eml") and uploaded again for automatic parsing.

Parsing "UET6KCXR5FYIEJYKUCK2AKF3FLXTRNAT-rfc822.eml" results in next content types:
Example 3: Tika RFC822Parser parser is configured to handle all body parts as embedded objects (see TIKA-2478).

<Arguments>
    <Class>com.imis.imisarc.server.parser.impl.RFC822Parser</Class>
    <EnabledEmbeddedResourceType type="inline">true</EnabledEmbeddedResourceType>
    <Debug>true</Debug>
    <AutodetectParserTikaConfig>
        <properties>
            <parsers>
                <parser class="org.apache.tika.parser.DefaultParser">
                    <parser-exclude class="org.apache.tika.parser.mail.RFC822Parser"/>
                </parser>
                <parser class="org.apache.tika.parser.mail.RFC822Parser">
                    <params>
                        <param name="extractAllAlternatives" type="bool">true</param>
                    </params>
                </parser>
            </parsers>
        </properties>
    </AutodetectParserTikaConfig>
</Arguments>

Parsing "mixed-simple.eml" results in next content types:
Parsing "mixed-with-pdf-inline.eml" results in next content types:
Parsing "UET6KCXR5FYIEJYKUCK2AKF3FLXTRNAT-rfc822.eml" results in next content types:
Example 4: example 3 configuration is extended with embedded metadata filter, which allows only "text/html" content extraction.

<Arguments>
    <Class>com.imis.imisarc.server.parser.impl.RFC822Parser</Class>
    <EnabledEmbeddedResourceType type="inline">true</EnabledEmbeddedResourceType>
    <Debug>true</Debug>
    <AutodetectParserTikaConfig>
        <properties>
            <parsers>
                <parser class="org.apache.tika.parser.DefaultParser">
                    <parser-exclude class="org.apache.tika.parser.mail.RFC822Parser"/>
                </parser>
                <parser class="org.apache.tika.parser.mail.RFC822Parser">
                    <params>
                        <param name="extractAllAlternatives" type="bool">true</param>
                    </params>
                </parser>
            </parsers>
        </properties>
    </AutodetectParserTikaConfig>
    <EmbeddedMetadataFilter key="Content-Type">text/html</EmbeddedMetadataFilter>
</Arguments>

Parsing "mixed-simple.eml" results in next content types:
Parsing "mixed-with-pdf-inline.eml" results in next content types:
Parsing "UET6KCXR5FYIEJYKUCK2AKF3FLXTRNAT-rfc822.eml" results in original email content (content type "message/rfc822") and no additional contents.

Example 5: minimal OutlookMsgEmailParser configuration with enabled "debug" logging.

<Arguments>
    <Class>com.imis.imisarc.server.parser.impl.OutlookMsgEmailParser</Class>
    <Debug>true</Debug>
</Arguments>

Parsing "test-outlook.msg" and "MIME.msg" results in original email content with content type "application/vnd.ms-outlook". Parsing "Fw Anime User Analysis.msg" results in in next content types:
Parsing "CFIGD37LXW6YJLE66KCQUV55SDU5YO5S.msg" results in in next content types:
Example 6: enable extracting all non-null body content, with enabled all supported embedded resource types.

<Arguments>
    <Class>com.imis.imisarc.server.parser.impl.OutlookMsgEmailParser</Class>
    <EnabledEmbeddedResourceType type="inline">true</EnabledEmbeddedResourceType>
    <EnabledEmbeddedResourceType type="attachment">true</EnabledEmbeddedResourceType>
    <EnabledEmbeddedResourceType type="macro">true</EnabledEmbeddedResourceType>
    <Debug>true</Debug>
    <AutodetectParserTikaConfig>
        <properties>
            <parsers>
                <parser class="org.apache.tika.parser.DefaultParser">
                    <parser-exclude class="org.apache.tika.parser.microsoft.OfficeParser"/>
                </parser>
                <parser class="org.apache.tika.parser.microsoft.OfficeParser">
                    <params>
                        <param name="extractAllAlternativesFromMSG" type="bool">true</param>
                    </params>
                </parser>
            </parsers>
        </properties>
    </AutodetectParserTikaConfig>
</Arguments>

Parsing "test-outlook.msg" and "MIME.msg" results in in next content types:
Parsing "Fw Anime User Analysis.msg" results in in next content types:
Parsing "CFIGD37LXW6YJLE66KCQUV55SDU5YO5S.msg" results in in next content types:
Related Documents:

https://tika.apache.org/1.28.4/configuring.html
https://tika.apache.org/1.28.4/api/org/apache/tika/parser/mail/RFC822Parser.html
https://issues.apache.org/jira/browse/TIKA-2478
https://tika.apache.org/1.28.4/api/org/apache/tika/parser/microsoft/OfficeParser.html
https://tika.apache.org/1.28.4/api/org/apache/tika/parser/microsoft/AbstractOfficeParser.html
https://datatracker.ietf.org/doc/html/rfc2156
https://datatracker.ietf.org/doc/html/rfc2822
https://datatracker.ietf.org/doc/html/rfc6758
https://docs.microsoft.com/en-us/openspecs/exchange_server_protocols/ms-oxmsg/b046868c-9fbf-41ae-9ffb-8de2bd4eec82
https://issues.apache.org/jira/browse/TIKA-54
https://issues.apache.org/jira/browse/TIKA-2694
https://issues.apache.org/jira/browse/TIKA-197
https://issues.apache.org/jira/browse/TIKA-2101

| Back | Main view