问题
i am trying to scrap a website than contain many pages, with selenium i open each time a page in second 'TAB' and launch my function to get the data. after that i close the tab and open the next tab and continue extraction until the last page. my problem is when i save my data in the excel file, i found that it save just the last information extract from the last page(tab). can you help me to find my error ?
def scrap_client_infos(linksss):
tds=[] # tds is the list that contain the data
reader=pd.read_excel(r'C:\python projects\mada\db.xlsx')
writer= pd.ExcelWriter(r'C:\python projects\mada\db.xlsx',engine='openpyxl')
html = urlopen(linksss)
soup=BeautifulSoup.BeautifulSoup(html,'html.parser')
table=soup.find('table',attrs={'class':'r2'})
#scrab all the tr that contain text
for tr in table.find_all('tr'):
elem = tr.find('td').get_text()
elem=elem.replace('\t','')
elem=elem.replace('\n','')
elem=elem.replace('\r','')
tds.append(elem)
print(tds)
#selecting the data that i need to save in excel
raw_data={'sub_num':[tds[1]],'id':[tds[0]],'nationality':[tds[2]],'country':[tds[3]],'city':[tds[3]],'age':[tds[7]],'marital_status':[tds[6]],'wayy':[tds[5]]}
df=pd.DataFrame(raw_data,columns=['sub_num','id','nationality','country','city','age','marital_status','wayy'])
#save the data in excel file
df.to_excel(writer, sheet_name='Sheet1',startrow=len(reader), header=False)
writer.save()
return soup
P.S: i always want to fill the excel file from the last line
回答1:
To append excel data using Pandas, you need to set the worksheets in the writer object.
Update the last section in your code:
#save the data in excel file
from openpyxl import load_workbook
book = load_workbook(path)
startrw = book['Sheet1'].max_row+1
writer.book = book
writer.sheets = dict((ws.title, ws) for ws in book.worksheets) # prevent overwrite
df.to_excel(writer, sheet_name='Sheet1',startrow=startrw, header=False)
writer.save()
return soup
来源:https://stackoverflow.com/questions/63927426/how-to-fill-excel-file-from-selenium-scraping-in-loop-with-python