Index email attachment error

While trying to index an email attachment, the following error is thrown from Solr:

Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an array of length 1235904, but 1000000 is the maximum for this record type.
If the file is not corrupt, please open an issue on bugzilla to request
increasing the maximum allowable size for this record type.
As a temporary workaround, consider setting a higher override value with IOUtils.setByteArrayMaxOverride()
        at org.apache.poi.util.IOUtils.throwRFE(IOUtils.java:568)
        at org.apache.poi.util.IOUtils.checkLength(IOUtils.java:175)
        at org.apache.poi.util.IOUtils.safelyAllocate(IOUtils.java:547)
        at org.apache.poi.hmef.attribute.MAPIRtfAttribute.<init>(MAPIRtfAttribute.java:49)
        at org.apache.tika.parser.microsoft.OutlookExtractor.handleBodyChunks(OutlookExtractor.java:328)
        at org.apache.tika.parser.microsoft.OutlookExtractor.parse(OutlookExtractor.java:247)
        at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:199)
        at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:131)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        ... 14 common frames omitted

The email attachment is 2.4 MB. Is there anyway to configure the limit in the error?

Which version of XWiki is that ? It’s possible that the Tika or POI version has some bug as the message suggest and an upgrade will fix it.

I guess you cannot share that file for us to debug why it fail (or more probably test with latest version and send it to Tika if it does not work) ?

Latest XWiki zip version 11.9. Unfortunately, I can’t provide the email. I tried a 9.5 MB email and that worked. Tried upload the 2.4MB attachment again and same error.
Is there anyway to get a list of attachments that have NOT indexed successfully?

Not sure what Tika is, but let me know if there is any more info I can provide.

OK so no new version of Tika or POI available then.

Tika (https://tika.apache.org/) is the library we use to parse attachments and it’s using POI (https://poi.apache.org/) to parse mails according to the error.

Both have similar format ?

Not exactly. The one that is failing has a huge table in it. I was trying to narrow it down.

Anyway, one other thing that I noticed was that the search results don’t always show the context. For example, in an email attachment there is text “please check and confirm”. If I search for “confirm” it shows up highlighted. If I search for “check and confirm” it show the page match without any context of the attachment.