问题
I have an application where the user will be uploading an excel file(.xlsx or .csv) with more than 10,000 rows with a single column "partId" containing the values to look for in database
I will be reading the excel values and store it in list object and pass the list as parameter to the Spring Boot JPA repository find method that builds IN clause query internally:
// Read excel file
stream = new ByteArrayInputStream(file.getBytes());
wb = WorkbookFactory.create(stream);
org.apache.poi.ss.usermodel.Sheet sheet = wb.getSheetAt(wb.getActiveSheetIndex());
Iterator<Row> rowIterator = sheet.rowIterator();
while(rowIterator.hasNext()) {
Row row = rowIterator.next();
Cell cell = row.getCell(0);
System.out.println(cell.getStringCellValue());
vinList.add(cell.getStringCellValue());
}
//JPA repository method that I used
findByPartIdInAndSecondaryId(List<String> partIds);
I read in many articles and experienced the same in above case that using IN query is inefficient for huge list of data.
How can I optimize the above scenario or write a new optimized query?
Also, please let me know if there is optimized way of reading an excel file than the above mentioned code snippet
It would be much helpful!! Thanks in advance!
回答1:
If the list is truly huge, you will never be lightning fast.
I see several options:
Send a query with a large
IN
list, as you mention in your question.Construct a statement that is a join with a large
VALUES
clause:SELECT ... FROM mytable JOIN (VALUES (42), (101), (43), ...) AS tmp(col) ON mytable.id = tmp.col;
Create a temporary table with the values and join with that:
BEGIN; CREATE TEMP TABLE tmp(col bigint) ON COMMIT DROP;
Then either
COPY tmp FROM STDIN; -- if Spring supports COPY
or
INSERT INTO tmp VALUES (42), (101), (43), ...; -- if not
Then
ANALYZE tmp; -- for good statistics SELECT ... FROM mytable JOIN tmp ON mytable.id = tmp.col; COMMIT; -- drops the temporary table
Which of these is fastest is best determined by trial and error for your case; I don't think that it can be said that one of the methods will always beat the others.
Some considerations:
Solutions 1. and 2. may result in very large statements, while solution 3. can be split in smaller chunks.
Solution 3. will very likely be slower unless the list is truly large.
来源:https://stackoverflow.com/questions/64785993/postgresql-in-clause-optimization-for-more-than-3000-values