how read excel file having more than 100000 row in java?

前端未结

关注

 2  1565

日久生厌

I am trying read excel file having more than 100000 rows in java using apache poi. but I am encountering few problems.

1-) It is taking 10 to 15 min in fetching dat

相关标签:

2条回答

长情又很酷

2021-01-15 17:16

Apache poi's default opening a workbook using WorkbookFactory.create or new XSSFWorkbook will always parsing the whole workbook inclusive all sheets. If the workbook contains much data this leads to high memory usage. Opening the workbook using a File instead of a InputStream decreases the memory usage. But this leads to other problems since the used file then cannot be overwritten, at least not when *.xlsx files.

There is XSSF and SAX (Event API) which get at the underlying XML data, and process using SAX.

But if we are already at this level where we get at the underlying XML data, and process it, then we could go one more step back too.

A *.xlsx file is a ZIP archive containing the data in XML files within a directory structure. So we can unzip the *.xlsx file and get the data from the XML files then.

There is /xl/sharedStrings.xml having all the string cell values in it. And there is /xl/workbook.xml describing the workbook structure. And there are /xl/worksheets/sheet1.xml, /xl/worksheets/sheet2.xml, ... which are storing the sheets' data. And there is /xl/styles.xml having the style settings for all cells in the sheets.

So all we need is working with ZIP file system using Java. This is supported using java.nio.file.FileSystems.

And we need a possibility for parsing XML. There Package javax.xml.stream is my favorite.

The following shows a working draft. It parses the /xl/sharedStrings.xml. Also it parses the /xl/styles.xml. But it gets only the number formats and the cell number format settings. The number format settings are essential for detecting date / time values. It then parses the /xl/worksheets/sheet1.xml which contains the data of the first sheet. For detecting whether a number format is a date format, and so the formatted cell contains a date / time value, one single apache poi class org.apache.poi.ss.usermodel.DateUtil is used. This is done to simplify the code. Of course even this class we could have coded ourself.

import java.nio.file.Paths;
import java.nio.file.Path;
import java.nio.file.Files;
import java.nio.file.FileSystems;
import java.nio.file.FileSystem;

import javax.xml.stream.*;
import javax.xml.stream.events.*;
import javax.xml.namespace.QName;

import java.util.List;
import java.util.ArrayList;
import java.util.Map;
import java.util.HashMap;
import java.util.Date;

import org.apache.poi.ss.usermodel.DateUtil;

public class UnZipAndReadXLSXFileSystem {

 public static void main (String args[]) throws Exception {

  XMLEventReader reader = null;
  XMLEvent event = null;
  Attribute attribute = null;
  StartElement startElement = null; 
  EndElement endElement = null; 

  String characters = null;
  StringBuilder stringValue = new StringBuilder(); //for collecting the characters to complete values 

  List<String> sharedStrings = new ArrayList<String>(); //list of shared strings

  Map<String, String> numberFormats = new HashMap<String, String>(); //map of number formats
  List<String> cellNumberFormats = new ArrayList<String>(); //list of cell number formats

  Path source = Paths.get("ExcelExample.xlsx"); //path to the Excel file

  FileSystem fs = FileSystems.newFileSystem(source, null); //get filesystem of Excel file

  //get shared strings ==============================================================================
  Path sharedStringsTable = fs.getPath("/xl/sharedStrings.xml");
  reader = XMLInputFactory.newInstance().createXMLEventReader(Files.newInputStream(sharedStringsTable));
  boolean siFound = false;
  while (reader.hasNext()) {
   event = (XMLEvent)reader.next();
   if (event.isStartElement()){
    startElement = (StartElement)event;
    if (startElement.getName().getLocalPart().equalsIgnoreCase("si")) {
     //start element of shared string item
     siFound = true;
     stringValue = new StringBuilder();
    } 
   } else if (event.isCharacters() && siFound) {
    //chars of the shared string item
    characters = event.asCharacters().getData();
    stringValue.append(characters);
   } else if (event.isEndElement() ) {
    endElement = (EndElement)event;
    if (endElement.getName().getLocalPart().equalsIgnoreCase("si")) {
     //end element of shared string item
     siFound = false;
     sharedStrings.add(stringValue.toString());
    }
   }
  }
  reader.close();
System.out.println(sharedStrings);
  //shared strings ==================================================================================

  //get styles, number formats are essential for detecting date / time values =======================
  Path styles = fs.getPath("/xl/styles.xml");
  reader = XMLInputFactory.newInstance().createXMLEventReader(Files.newInputStream(styles));
  boolean cellXfsFound = false;
  while (reader.hasNext()) {
   event = (XMLEvent)reader.next();
   if (event.isStartElement()){
    startElement = (StartElement)event;
    if (startElement.getName().getLocalPart().equalsIgnoreCase("numFmt")) {
     //start element of number format
     attribute = startElement.getAttributeByName(new QName("numFmtId"));
     String numFmtId = attribute.getValue();
     attribute = startElement.getAttributeByName(new QName("formatCode"));
     numberFormats.put(numFmtId, ((attribute != null)?attribute.getValue():"null"));
    } else if (startElement.getName().getLocalPart().equalsIgnoreCase("cellXfs")) {
     //start element of cell format setting
     cellXfsFound = true;

    } else if (startElement.getName().getLocalPart().equalsIgnoreCase("xf") && cellXfsFound ) {
     //start element of format setting in cell format setting
     attribute = startElement.getAttributeByName(new QName("numFmtId"));
     cellNumberFormats.add(((attribute != null)?attribute.getValue():"null"));
    }
   } else if (event.isEndElement() ) {
    endElement = (EndElement)event;
    if (endElement.getName().getLocalPart().equalsIgnoreCase("cellXfs")) {
     //end element of cell format setting
     cellXfsFound = false;
    }
   }
  }
  reader.close();
System.out.println(numberFormats);
System.out.println(cellNumberFormats);
  //styles ==========================================================================================

  //get sheet data of first sheet ===================================================================
  Path sheet1 = fs.getPath("/xl/worksheets/sheet1.xml");
  reader = XMLInputFactory.newInstance().createXMLEventReader(Files.newInputStream(sheet1));
  boolean rowFound = false;
  boolean cellFound = false;
  boolean cellValueFound = false;
  boolean inlineStringFound = false; 
  String cellStyle = null;
  String cellType = null;
  while (reader.hasNext()) {
   event = (XMLEvent)reader.next();
   if (event.isStartElement()){
    startElement = (StartElement)event;
    if (startElement.getName().getLocalPart().equalsIgnoreCase("row")) {
     //start element of row
     rowFound = true;
System.out.print("<Row");

     attribute = startElement.getAttributeByName(new QName("r"));
System.out.print(" r=" + ((attribute != null)?attribute.getValue():"null"));
System.out.println(">");

    } else if (startElement.getName().getLocalPart().equalsIgnoreCase("c") && rowFound) {
     //start element of cell in row
     cellFound = true;
System.out.print("<Cell");

     attribute = startElement.getAttributeByName(new QName("r"));
System.out.print(" r=" + ((attribute != null)?attribute.getValue():"null"));

     attribute = startElement.getAttributeByName(new QName("t"));
System.out.print(" t=" + ((attribute != null)?attribute.getValue():"null"));
     cellType = ((attribute != null)?attribute.getValue():null);

     attribute = startElement.getAttributeByName(new QName("s"));
System.out.print(" s=" + ((attribute != null)?attribute.getValue():"null"));
     cellStyle = ((attribute != null)?attribute.getValue():null);

System.out.print(">");

    } else if (startElement.getName().getLocalPart().equalsIgnoreCase("v") && cellFound) {
     //start element of value in cell
     cellValueFound = true;
System.out.print("<V>");
     stringValue = new StringBuilder();

    } else if (startElement.getName().getLocalPart().equalsIgnoreCase("is") && cellFound) {
     //start element of inline string in cell
     inlineStringFound = true;
System.out.print("<Is>");
     stringValue = new StringBuilder();

    }
   } else if (event.isCharacters() && cellFound && (cellValueFound || inlineStringFound)) {
    //chars of the cell value or the inline string
    characters = event.asCharacters().getData();
    stringValue.append(characters);

   } else if (event.isEndElement()) {
    endElement = (EndElement)event;
    if (endElement.getName().getLocalPart().equalsIgnoreCase("row")) {
     //end element of row
     rowFound = false;
System.out.println("</Row>");

    } else if (endElement.getName().getLocalPart().equalsIgnoreCase("c")) {
     //end element of cell
     cellFound = false;
System.out.println("</Cell>");

    } else if (endElement.getName().getLocalPart().equalsIgnoreCase("v")) {
     //end element of value
     cellValueFound = false;

     String cellValue = stringValue.toString();

     if ("s".equals(cellType)) {
      cellValue = sharedStrings.get(Integer.valueOf(cellValue));
     }

     if (cellStyle != null) {
      int s = Integer.valueOf(cellStyle);
      String formatIndex = cellNumberFormats.get(s);
      String formatString = numberFormats.get(formatIndex);
      if (DateUtil.isADateFormat(Integer.valueOf(formatIndex), formatString)) {
       double dDate = Double.parseDouble(cellValue); 
       Date date = DateUtil.getJavaDate(dDate);
       cellValue = date.toString();
      }
     }

System.out.print(cellValue);
System.out.print("</V>");

    } else if (endElement.getName().getLocalPart().equalsIgnoreCase("is")) {
     //end element of inline string
     inlineStringFound = false;

     String cellValue = stringValue.toString();
System.out.print(cellValue);
System.out.print("</Is>");

    }
   }
  }
  reader.close();
  //sheet data ======================================================================================

  fs.close();

 }
}

0 讨论(0)

天命终不由人

2021-01-15 17:23
Apache POI is your friend - this is correct. But I faced with OutOfMemory when I read very big Excel with formulas.

My solution. If you want to only read data from XLSX file, and do not worry about formulas, then do read it as simple xml file and extract data from it (it us pretty easy).
1. Read your *.xlsx file as zip archive
2. Got to xl\worksheets folder and you find one xml file per one sheet
3. Read this file using any xml reader and retrieve data you need
0 讨论(0)
发布评论:

提交评论
- 加载中...