How to scrape not well structured html tables with Beautifulsoup in Python?

问题

This website https://itportal.ogauthority.co.uk/information/well_data/lithostratigraphy_hierarchy/rptLithoStrat_1Page2.html seems have a not well organized html table. the only identifier of table cells are width inside each tr tag. I want to scrape the information of all 60 pages. How I can find a way to scrape each row of tables appropriately? I know the size of header is 10 columns but since for some tr tags, I have 5 td tags and for some other I have more or less td tags, it's not easy to exactly scrape the data according to its column.

Here you can see a part of code which is extracting just data related to one row but not with keeping the null values for empty cells.

soup = BeautifulSoup(page.content, 'lxml') # Parse the HTML as a string
table = soup.find_all('table')[0] # Grab the first table
new_table = pd.DataFrame(columns=range(0,10), index = [0]) # I know the size
row_marker = 0
for row in table.find_all('tr'):
     column_marker = 0
     columns = row.find_all('td')
     for column in columns:
           new_table.iat[row_marker,column_marker] = column.get_text()
           column_marker += 1

It's the output which I have from this code (putting all values in a row without any gaps between them):

     0     1     2                  3     4   5    6    7    8    9  

0  62.00    PACL  Palaeocene Claystones  SWAP  NaN  NaN  NaN  NaN  NaN

but the real output should be something like this:

   0        1    2   3                        4   5    6    7    8    9  

0  62.00   NaN NaN  PACL  Palaeocene Claystones  NaN  NaN  NaN  NaN  SWAP

回答1:

I've used the method I mentioned in the comments (using width) to determine the null values in the data. Here's the Python code:

import requests                                                                                                                                                                                                                  
import bs4                                                                                                                                                                                                                       

URL = 'https://itportal.ogauthority.co.uk/information/well_data/lithostratigraphy_hierarchy/rptLithoStrat_1Page2.html'                                                                                                           

response = requests.get(URL)                                                                                                                                                                                                     
soup = bs4.BeautifulSoup(response.text, 'lxml')                                                                                                                                                                                  

tables = soup.find_all('table')                                                                                                                                                                                                  
count = 0                                                                                                                                                                                                                        
cells_count = 0                                                                                                                                                                                                                  

for table in tables:                                                                                                                                                                                                             
        count +=1                                                                                                                                                                                                                
        if count >2:                                                                                                                                                                                                             
                row = table.tr                                                                                                                                                                                                   
                cells = row.find_all('td')                                                                                                                                                                                       
                print ''                                                                                                                                                                                                         
                x = 0                                                                                                                                                                                                            
                width_diff = 0                                                                                                                                                                                                   
                cell_text = []                                                                                                                                                                                                   
                for cell in cells:                                                                                                                                                                                               
                        width = cell.get('width')                                                                                                                                                                                
                        if int(width) < 10:                                                                                                                                                                                      
                                continue                                                                                                                                                                                         

                        if width_diff > 0:                                                                                                                                                                                       
                                cell_text.append('NaN ')                                                                                                                                                                         
                                if width_diff > 50:                                                                                                                                                                              
                                        x += 2                                                                                                                                                                                   
                                        cell_text.append('Nan ')                                                                                                                                                                 
                                else:
                                        x += 1
                                width_diff = 0

                        if x == 0 or x == 1 or x == 2 or x == 3 or x == 4 or x == 6:
                                width_range = [35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50]
                        elif x == 5:
                                width_range = [220,221,222,223,224,225,226,227,228,229,230]
                        elif x == 7:
                                width_range = [136]


                        if cell.text:
                                cell_text.append(cell.text.strip() + ' ')
                        else:
                                cell_text.append('NaN ')

                        if int(width) not in width_range:
                                width_diff = int(width) - width_range[-1]
                        x += 1
                        #print x,
                length = len(cell_text)
                for i in range(0, length):
                        print cell_text[i],
                diff = 9 - length
                if diff > 0:
                        for j in range(0, diff):
                                print 'NaN ',

As you can see, I've noticed that a certain width range is used in each column. By comparing each cell to its supposed width, we can determine how many spaces it takes. If the difference in width is too great, that means it takes the space of the next two cells.

It might need some refining, you'll need to test the script against all URLs to ensure that the data is absolutely clean.

Here's a sample output from running this code:

61.00  SED  TERT  WBDS  NaN  Woolwich Beds  GP  NaN  WLDB                                                                                                                                                                        
62.00  NaN  NaN  PACL  NaN  Palaeocene Claystones  NaN  Nan  SWAP                                                                                                                                                                
63.00  NaN  NaN  SMFC  NaN  Shallow Marine Facies  NaN  Nan  SONS                                                                                                                                                                
64.00  NaN  NaN  DMFC  NaN  Deep Marine Facies  NaN  NaN  NaN                                                                                                                                                                    
65.00  NaN  NaN  SLSY  NaN  Selsey Member  GN  NaN  WSXB                                                                                                                                                                         
66.00  NaN  NaN  MFM  NaN  Marsh Farm Member  NaN  NaN  NaN                                                                                                                                                                      
67.00  NaN  NaN  ERNM  NaN  Earnley Member  NaN  NaN  NaN                                                                                                                                                                        
68.00  NaN  NaN  WITT  NaN  Wittering Member  NaN  NaN  NaN                                                                                                                                                                      
69.00  NaN  NaN  WHI  NaN  Whitecliff Beds  GZ  NaN  NaN                                                                                                                                                                         
70.00  NaN  NaN  Nan  WFSM  NaN  Whitecliff Sand Member  NaN  Nan  GN                                                                                                                                                            
71.00  NaN  WESQ  NaN  Nan  Westray Group Equivalent  NL  GW  WESH                                                                                                                                                               
72.00  NaN  WESR  NaN  Nan  Westray Group  NM  GO  CNSB                                                                                                                                                                          
73.00  NaN  NaN  THEF  NaN  Thet Formation  NaN  Nan  MOFI                                                                                                                                                                       
74.00  NaN  NaN  SKAD  NaN  Skade Formation  NB  NaN  NONS                                                                                                                                                                       
75.00  NaN  NORD  NaN  Nan  Nordland  NP  Q  CNSB                                                                                                                                                                                
75.50  NaN  NaN  SWCH  NaN  Swatchway Formation  Q  NaN  MOFI                                                                                                                                                                    
75.60  NaN  NaN  CLPT  NaN  Coal Pit Formation  NaN  NaN  NaN                                                                                                                                                                    
75.70  NaN  NaN  LNGB  NaN  Ling Bank Formation  NaN  NaN  NaN                                                                                                                                                                   
76.00  NaN  NaN  SHKL  NaN  Shackleton Formation  GO  QP  ROCK                                                                                                                                                                   
77.00  NaN  NaN  UGNS  NaN  Upper Tertiary sands  NaN  NM  NONS                                                                                                                                                                  
78.00  NaN  NaN  CLSD  NaN  Claret Sand  NP  NaN  SVIG                                                                                                                                                                           
79.00  NaN  NaN  BLUE  NaN  Blue Sand  NaN  NaN  NaN                                                                                                                                                                             
80.00  NaN  NaN  ABGF  NaN  Aberdeen Ground Formation  QH  NaN  CNSB                                                                                                                                                             
81.00  NaN  NaN  NUGU  NaN  Upper Glauconitic Unit  NB  NA  MOFI                                                                                                                                                                 
82.00  NaN  NaN  POWD  NaN  Powder Sand  GN  NaN  SVIG                                                                                                                                                                           
83.00  NaN  NaN  BASD  NaN  Basin Sand  NaN  Nan  CNSB                                                                                                                                                                           
84.00  NaN  NaN  CRND  NaN  Crenulate Sand  NaN  NaN  NaN                                                                                                                                                                        
85.00  NaN  NaN  NORS  NaN  Nordland Sand  QP  NaN  SONS                                                                                                                                                                         
86.00  NaN  NaN  MIOS  NaN  Miocene Sand  NM  NaN  ESHB                                                                                                                                                                          
87.00  NaN  NaN  MIOL  NaN  Miocene Limestone  NaN  Nan  CNSB                                                                                                                                                                    
88.00  NaN  NaN  FLSF  NaN  Fladen Sand Formation  GP  GO  WYGG

Note: I don't know how the 0 in the first cell of your example is created, so I left it out of the answer. I don't know if it's supposed to be scraped as well, because I didn't find it anywhere.

回答2:

@samy Thank you very much for your cool method to scrape this website:

来源：https://stackoverflow.com/questions/55404147/how-to-scrape-not-well-structured-html-tables-with-beautifulsoup-in-python

标签

python

html

web-scraping

html-table

beautifulsoup