How to extract only numbers from input file. Numbers can be float/int

后端未结

关注

 3  625

I want to extract numbers(integers and float) from a file(exclude all special symbols and alphabets). Numbers from all positions.

import re
file = open(\'input_f


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  滥情空心        
                
              
                            
                2021-01-29 14:26
              
            
            
                                                                       
Without clarification, you can try the following.

re.findall(r'[+-]?\d+(?:\.\d+)?', line)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  醉话见心        
                
              
                            
                2021-01-29 14:30
              
            
            
                                                                       
re.findall("[+-]?\d+\.?\d*",some_text)


I think at least

[+-]? zero or one of either + or - (ie optional)

\d+ one or more digits

\.? optionally a decimal

\d* zero or more additional numbers
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  渐次进展        
                
              
                            
                2021-01-29 14:37
              
            
            
                                                                       
Maybe this will help. 

The string here can be your line. I just put in dummy text.

import re

string = "He is 100, I am 18.5 and we are 0.67. Maybe we should 100, 200, and 200b 200, 67.88"

s = re.findall(r"[-+]?\d*\.\d+|\d+", string)

print(s)


Spits out the following when executed:

['100', '18.5', '0.67', '100', '200', '200', '200', '67.88']


Experiment

I performed a little experiment on the part corpus of Frankenstein.

Note I use .read() to read the entire file instead of line by line processing.

import re

file = open('frank.txt', 'r')

file = file.read()

numbers = re.findall(r"[-+]?\d*\.\d+|\d+", file)

print(numbers)


This was the result:

['17', '2008', '84', '1', '11', '17', '2', '28', '17', '3', '7', '17', '4', '5', '17', '31', '13', '17', '19', '17', '1', '2', '3', '4', '5', '6', '18', '17', '7', '7', '12', '17', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '27', '20', '21', '22', '18', '17', '23', '24', '26', '17', '2', '5', '7', '12', '9', '11', '84', '84', '8', '84', '1', '1', '1', '.8', '1', '1', '1', '1', '1', '1', '1', '.1', '1', '.2', '1', '.1', '1', '.7', '1', '.8', '1', '.9', '1', '.3', '1', '.1', '1', '.7', '1', '.4', '1', '.5', '1', '.1', '1', '.6', '1', '.1', '1', '.7', '1', '.8', '1', '.9', '1', '.8', '20', '60', '4', '30', '1', '.3', '90', '1', '.9', '3', '1', '1', '.1', '1', '.2', '1', '.3', '3', '1', '.3', '90', '1', '.4', '1', '.3', '1', '.5', '1', '.6', '2', '2001', '3', '4', '3', '501', '3', '64', '6221541', '501', '3', '4557', '99712', '809', '1500', '84116', '801', '596', '1887', '4', '1', '5', '000', '50', '5']


Unit Testing

I wrote a lighter version that works with your string supplied. 

import unittest
import re


# Extract numbers improved
def extract_numbers_improved(x):

    numbers = re.findall(r"[-+]?\d*\.\d+|\d+", x)

    return numbers


# Unit Test
class Test(unittest.TestCase):
    def testcase(self):

        teststr = "12asdasdsa 33asdsad 44 aidsasdd 2231%#@ qqq55 2222ww ww qq 1asdasd 33##$11 42.09 12$"
        self.assertEqual(extract_numbers_improved(\
            teststr), ['12', '33', '44', '2231', '55', '2222', '1', '33', '11', '42.09', '12'])

unittest.main()


When things pass, this gives a green signal, as shown below:

Ran 1 test in 0.000s

OK

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复