1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

APACHE POI: java.lang.IndexOutOfBoundsException was occurred after closed the FileSystem

Discussion in 'Apache' started by Hieu Tran, Jul 9, 2017.

  1. #1
    Issue
    - I was created the extended class of [org.apache.tika.parser.microsoft.ExcelExtractor] same as below to extract the embedded documents of excel(*.xls) file.

    
        public class CustomExcelExtractor extends ExcelExtractor {
            private CustomAbstractPOIFSExtractor poi;
        
            public CustomExcelExtractor(ParseContext context, Metadata metadata, Path outputDir) {
                super(context, metadata);
                poi = new CustomAbstractPOIFSExtractor();
            }
        
            @Override
            public void parse(
                    DirectoryNode root, XHTMLContentHandler xhtml,
                    Locale locale) throws IOException, SAXException, TikaException {
        
                // Extract embedded documents
                for (Entry entry : root) {
                    if (entry.getName().startsWith("MBD")
                            && entry instanceof DirectoryEntry) {
                        try {
                            poi.extractEmbeddedOfficeDoc((DirectoryEntry) entry, null,
                                    xhtml, embeddedCnt);
                        } catch (TikaException e) {
                            // ignore parse errors from embedded documents
                        }
                    }
                }
            }
        
            private class CustomAbstractPOIFSExtractor {
                private TikaConfig config = TikaConfig.getDefaultConfig();
        
                /**
                 * Handle an office document that's embedded at the POIFS level
                 */
                protected void extractEmbeddedOfficeDoc(
                        DirectoryEntry dir, String resourceName,
                        XHTMLContentHandler xhtml, int embeddedCnt)
                        throws IOException, SAXException, TikaException {
                    if (dir.hasEntry("Package")) {
                        return;
                    }
        
                    // It's regular OLE2:
                    POIFSDocumentType type = POIFSDocumentType.detectType(dir);
        
                    try {
                        if (type == POIFSDocumentType.WORDDOCUMENT) {
                            FileOutputStream fos = new FileOutputStream(new File("test.doc"));
                            HWPFDocument document = new HWPFDocument((DirectoryNode) dir);
                            document.write(fos);
                            document.close();
                        } else if (type == POIFSDocumentType.POWERPOINT) {
                            FileOutputStream fos = new FileOutputStream(new File("test.ppt"));
                            HSLFSlideShowImpl document = new HSLFSlideShowImpl((DirectoryNode) dir);
                            document.write(fos);
                            document.close(); // After call this method, I cannot continue to extract embedded documents
                        }
                    } catch (Exception ex) {
                        ex.printStackTrace();
                    }
                }
            }
        }
    
    
    Code (markup):

    SEMrush
    - [extractEmbeddedOfficeDoc] method will write the stream of embedded documents to files.
    When I call [HSLFSlideShowImpl.close()] method to close the stream of "test.ppt" document, I cannot continue to loop to extract the other embedded documents. The exception will be occurred.

    
    java.lang.IndexOutOfBoundsException: Block 1079 not found
    	at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.getBlockAt(NPOIFSFileSystem.java:486)
    	at org.apache.poi.poifs.filesystem.NPOIFSStream$StreamBlockByteBufferIterator.next(NPOIFSStream.java:169)
    	at org.apache.poi.poifs.filesystem.NPOIFSStream$StreamBlockByteBufferIterator.next(NPOIFSStream.java:142)
    	at org.apache.poi.poifs.filesystem.NDocumentInputStream.readFully(NDocumentInputStream.java:257)
    	at org.apache.poi.poifs.filesystem.NDocumentInputStream.readUShort(NDocumentInputStream.java:305)
    	at org.apache.poi.poifs.filesystem.DocumentInputStream.readUShort(DocumentInputStream.java:182)
    	at org.apache.poi.hssf.record.RecordInputStream$SimpleHeaderInput.readRecordSID(RecordInputStream.java:115)
    	at org.apache.poi.hssf.record.RecordInputStream.readNextSid(RecordInputStream.java:198)
    	at org.apache.poi.hssf.record.RecordInputStream.<init>(RecordInputStream.java:132)
    	at org.apache.poi.hssf.record.RecordInputStream.<init>(RecordInputStream.java:120)
    	at org.apache.poi.hssf.record.RecordFactoryInputStream.<init>(RecordFactoryInputStream.java:184)
    	at org.apache.poi.hssf.record.RecordFactory.createRecords(RecordFactory.java:491)
    	at org.apache.poi.hssf.usermodel.HSSFWorkbook.<init>(HSSFWorkbook.java:348)
    	at extractor.CustomExcelExtractor$CustomAbstractPOIFSExtractor.extractEmbeddedOfficeDoc(CustomExcelExtractor.java:324)
    	at extractor.CustomExcelExtractor.parse(CustomExcelExtractor.java:119)
    	at extractor.CustomOfficeParser.parse(CustomOfficeParser.java:76)
    	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:132)
    	at extractor.ExtractEmbeddedByTika.extract(ExtractEmbeddedByTika.java:37)
    	at main.ExcelEmbeddedtExtractor.main(ExcelEmbeddedtExtractor.java:61)
    Caused by: java.lang.IndexOutOfBoundsException: Unable to read 512 bytes from 552960 in stream of length -1
    	at org.apache.poi.poifs.nio.ByteArrayBackedDataSource.read(ByteArrayBackedDataSource.java:42)
    	at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.getBlockAt(NPOIFSFileSystem.java:484)
    	... 18 more
    
    Code (markup):
    Additional Information
    - The exception was NOT occurred after I called [HWPFDocument.close()] method.

    Now, I would like to fix this issue but I don't know the root cause of this issue. Please help me!

    Thanks in advance.
     
    Hieu Tran, Jul 9, 2017 IP
    SEMrush