How do i extract specific lines of data from a huge Excel sheet using Python?

前端 未结 4 1343
忘掉有多难
忘掉有多难 2021-01-17 08:26

I need to get specific lines of data that have certain key words in them (names) and write them to another file. The starting file is a 1.5 GB Excel file. I can\'t just open

相关标签:
4条回答
  • 2021-01-17 09:10

    I haven't used it, but xlrd looks like it does a good job reading Excel data.

    0 讨论(0)
  • 2021-01-17 09:16

    I'm the author and maintainer of xlrd. Please edit your question to provide answers to the following questions. [Such stuff in SO comments is VERY hard to read]

    1. How big is the file in MB? ["Huge" is not a useful answer]

    2. What software created the file?

    3. How much memory do you have on your computer?

    4. Exactly what happens when you try to open the file using Excel? Please explain "I can open it partially".

    5. Exactly what is the error message that you get when you try to open "C:\bigfile.xls" with your script using xlrd.open_workbook? Include the script that you ran, the full traceback, and the error message

    6. What operating system, what version of Python, what version of xlrd?

    7. Do you know how many worksheets there are in the file?

    0 讨论(0)
  • 2021-01-17 09:25

    Your problem is that you are using Excel 2003 .. You need to use a more recent version to be able to read this file. 2003 will not open files bigger than 1M rows.

    0 讨论(0)
  • 2021-01-17 09:26

    It sounds to me like you have a spreadsheet that was created using Excel 2007 and you have only Excel 2003.

    Excel 2007 can create worksheets with 1,048,576 rows by 16,384 columns while Excel 2003 can only work with 65,536 rows by 256 columns. Hence the reason you can't open the entire worksheet in Excel.

    If the workbook is just bigger in dimension then xlrd should work for reading the file, but if the file is actually bigger than the amount of memory you have in your computer (which I don't think is the case here since you can open the file with EditPad lite) then you would have to find an alternate method because xlrd reads the entire workbook into memory.

    Assuming the first case:

    import xlrd
    
    wb_path = r'c:\bigfile.xls'
    output_path = r'c:\output.txt'
    
    wb = xlrd.open(wb_path)
    ws = wb.sheets()[0]  # assuming you want to work with the first sheet in the workbook
    
    with open(output_path, 'w') as output_file:
        for i in xrange(ws.nrows):
            row = [cell.value for cell in ws.row(i)]
    
            # ... replace the following if statement with your own conditions ...
            if row[0] == u'interesting':
                output_file.write('\t'.join(row) + '\r\n')
    

    This will give you a tab-delimited output file that should open in Excel.

    Edit:

    Based on your answer to John Machin's question 5, make sure there is a file called 'bigfile.xls' located in the root of your C drive. If the file isn't there, change the wb_path to the correct location of the file you want to open.

    0 讨论(0)
提交回复
热议问题