问题
This website
https://itportal.ogauthority.co.uk/information/well_data/lithostratigraphy_hierarchy/rptLithoStrat_1Page2.html
seems have a not well organized html table. the only identifier of table cells are width inside each tr tag. I want to scrape the information of all 60 pages. How I can find a way to scrape each row of tables appropriately? I know the size of header is 10 columns but since for some tr
tags, I have 5 td
tags and for some other I have more or less td
tags, it's not easy to exactly scrape the data according to its column.
Here you can see a part of code which is extracting just data related to one row but not with keeping the null values for empty cells.
soup = BeautifulSoup(page.content, 'lxml') # Parse the HTML as a string
table = soup.find_all('table')[0] # Grab the first table
new_table = pd.DataFrame(columns=range(0,10), index = [0]) # I know the size
row_marker = 0
for row in table.find_all('tr'):
column_marker = 0
columns = row.find_all('td')
for column in columns:
new_table.iat[row_marker,column_marker] = column.get_text()
column_marker += 1
It's the output which I have from this code (putting all values in a row without any gaps between them):
0 1 2 3 4 5 6 7 8 9
0 62.00 PACL Palaeocene Claystones SWAP NaN NaN NaN NaN NaN
but the real output should be something like this:
0 1 2 3 4 5 6 7 8 9
0 62.00 NaN NaN PACL Palaeocene Claystones NaN NaN NaN NaN SWAP
回答1:
I've used the method I mentioned in the comments (using width) to determine the null values in the data. Here's the Python code:
import requests
import bs4
URL = 'https://itportal.ogauthority.co.uk/information/well_data/lithostratigraphy_hierarchy/rptLithoStrat_1Page2.html'
response = requests.get(URL)
soup = bs4.BeautifulSoup(response.text, 'lxml')
tables = soup.find_all('table')
count = 0
cells_count = 0
for table in tables:
count +=1
if count >2:
row = table.tr
cells = row.find_all('td')
print ''
x = 0
width_diff = 0
cell_text = []
for cell in cells:
width = cell.get('width')
if int(width) < 10:
continue
if width_diff > 0:
cell_text.append('NaN ')
if width_diff > 50:
x += 2
cell_text.append('Nan ')
else:
x += 1
width_diff = 0
if x == 0 or x == 1 or x == 2 or x == 3 or x == 4 or x == 6:
width_range = [35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50]
elif x == 5:
width_range = [220,221,222,223,224,225,226,227,228,229,230]
elif x == 7:
width_range = [136]
if cell.text:
cell_text.append(cell.text.strip() + ' ')
else:
cell_text.append('NaN ')
if int(width) not in width_range:
width_diff = int(width) - width_range[-1]
x += 1
#print x,
length = len(cell_text)
for i in range(0, length):
print cell_text[i],
diff = 9 - length
if diff > 0:
for j in range(0, diff):
print 'NaN ',
As you can see, I've noticed that a certain width range is used in each column. By comparing each cell to its supposed width, we can determine how many spaces it takes. If the difference in width is too great, that means it takes the space of the next two cells.
It might need some refining, you'll need to test the script against all URLs to ensure that the data is absolutely clean.
Here's a sample output from running this code:
61.00 SED TERT WBDS NaN Woolwich Beds GP NaN WLDB
62.00 NaN NaN PACL NaN Palaeocene Claystones NaN Nan SWAP
63.00 NaN NaN SMFC NaN Shallow Marine Facies NaN Nan SONS
64.00 NaN NaN DMFC NaN Deep Marine Facies NaN NaN NaN
65.00 NaN NaN SLSY NaN Selsey Member GN NaN WSXB
66.00 NaN NaN MFM NaN Marsh Farm Member NaN NaN NaN
67.00 NaN NaN ERNM NaN Earnley Member NaN NaN NaN
68.00 NaN NaN WITT NaN Wittering Member NaN NaN NaN
69.00 NaN NaN WHI NaN Whitecliff Beds GZ NaN NaN
70.00 NaN NaN Nan WFSM NaN Whitecliff Sand Member NaN Nan GN
71.00 NaN WESQ NaN Nan Westray Group Equivalent NL GW WESH
72.00 NaN WESR NaN Nan Westray Group NM GO CNSB
73.00 NaN NaN THEF NaN Thet Formation NaN Nan MOFI
74.00 NaN NaN SKAD NaN Skade Formation NB NaN NONS
75.00 NaN NORD NaN Nan Nordland NP Q CNSB
75.50 NaN NaN SWCH NaN Swatchway Formation Q NaN MOFI
75.60 NaN NaN CLPT NaN Coal Pit Formation NaN NaN NaN
75.70 NaN NaN LNGB NaN Ling Bank Formation NaN NaN NaN
76.00 NaN NaN SHKL NaN Shackleton Formation GO QP ROCK
77.00 NaN NaN UGNS NaN Upper Tertiary sands NaN NM NONS
78.00 NaN NaN CLSD NaN Claret Sand NP NaN SVIG
79.00 NaN NaN BLUE NaN Blue Sand NaN NaN NaN
80.00 NaN NaN ABGF NaN Aberdeen Ground Formation QH NaN CNSB
81.00 NaN NaN NUGU NaN Upper Glauconitic Unit NB NA MOFI
82.00 NaN NaN POWD NaN Powder Sand GN NaN SVIG
83.00 NaN NaN BASD NaN Basin Sand NaN Nan CNSB
84.00 NaN NaN CRND NaN Crenulate Sand NaN NaN NaN
85.00 NaN NaN NORS NaN Nordland Sand QP NaN SONS
86.00 NaN NaN MIOS NaN Miocene Sand NM NaN ESHB
87.00 NaN NaN MIOL NaN Miocene Limestone NaN Nan CNSB
88.00 NaN NaN FLSF NaN Fladen Sand Formation GP GO WYGG
Note: I don't know how the 0 in the first cell of your example is created, so I left it out of the answer. I don't know if it's supposed to be scraped as well, because I didn't find it anywhere.
回答2:
@samy Thank you very much for your cool method to scrape this website:
来源:https://stackoverflow.com/questions/55404147/how-to-scrape-not-well-structured-html-tables-with-beautifulsoup-in-python