Extract string from HTML String

后端未结

关注

 5  683

i want to extract a number from a html string (i usually do not know the number).

The crucial part looks like this:


                      
              相关标签:


      
      
        
          5条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  生来不讨喜        
                
              
                            
                2021-01-26 05:02
              
            
            
                                                                       
in your view.py document you can try this:

import re
my_string="TOTAL : 286"
int(re.search(r'\d+', my_string).group())



  286

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  暗喜        
                
              
                            
                2021-01-26 05:06
              
            
            
                                                                       
You can use string partitioning to extract a "number" string from the whole HTML string like this (assuming HTML code is in html_string variable): 

num_string=html_string.partition("TOTAL:")[2].partition("<")[0]

there you get num_string with the number as a string, then simply convert it to an integer or whatever you want. 
Keep in mind that this will process the first occurence of anything that looks like "TOTAL: anything_goes_here <", so you want to make sure that this pattern is unique.  
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  醉酒成梦        
                
              
                            
                2021-01-26 05:17
              
            
            
                                                                       
If the string "TOTAL : number" is unique then use a regular expression to first search this substring and then extract the number from it.

import re

string = 'test test="3" test="search_summary_figure WHR WVM">TOTAL : 286</test>'

reg__expr = r'TOTAL\s:\s\d+'  # TOTAL<whitespace>:<whitespace><number>
# find the substring
result = re.findall(reg__expr, string)
if result:

   substring = result[0]

   reg__expr = r'\d+'  # <number>
   result = re.findall(reg__expr, substring)
   number = int(result[0])

   print(number)


You can test your own regular expressions here https://regex101.com/
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  情书的邮戳        
                
              
                            
                2021-01-26 05:21
              
            
            
                                                                       
You can try the following like this below:

    line = "TOTAL : 286"
    if line.startswith('TOTAL : '):
        print(line[8:len(line)])


Output :

    286

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  眼角桃花        
                
              
                            
                2021-01-26 05:29
              
            
            
                                                                       
If your HTML String is this: 

html_string = """<test test="3" test="search_summary_figure WHR WVM">TOTAL : 286</test>
<tagend>"""


Try this:

int(html_string.split("</test>")[0].split(":")[-1].replace(" ", ""))

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复