How to remove special characters from txt files using Python

后端未结

关注

 3  1776

from glob import glob
pattern = \"D:\\\\report\\\\shakeall\\\\*.txt\"
filelist = glob(pattern)
def countwords(fp):
    with open(fp) as fh:
        return len(fh.rea


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  一向        
                
              
                            
                2021-01-06 05:02
              
            
            
                                                                       
I'm pretty new and I doubt this is very elegant at all, but one option would be to take your string(s) after reading them in and running them through string.translate() to strip out the punctuation. Here is the Python documentation for it for version 2.7 (which i think you're using).

As far as the actual code, it might be something like this (but maybe someone better than me can confirm/improve on it):

fileString.translate(None, string.punctuation)


where "fileString" is the string that your open(fp) read in. "None" is provided in place of a translation table (which would normally be used to actually change some characters into others), and the second parameter, string.punctuation (a Python string constant containing all the punctuation symbols) is a set of characters that will be deleted from your string.

In the event that the above doesn't work, you could modify it as follows:

inChars = string.punctuation
outChars = ['']*32
tranlateTable = maketrans(inChars, outChars)
fileString.translate(tranlateTable)


There are a couple of other answers to similar questions i found via a quick search. I'll link them here, too, in case you can get more from them.

Removing Punctuation From Python List Items

Remove all special characters, punctuation and spaces from string

Strip Specific Punctuation in Python 2.x



Finally, if what I've said is completely wrong please comment and i'll remove it so that others don't try what I've said and become frustrated.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  半阙折子戏        
                
              
                            
                2021-01-06 05:06
              
            
            
                                                                       
import re
string = open('a.txt').read()
new_str = re.sub('[^a-zA-Z0-9\n\.]', ' ', string)
open('b.txt', 'w').write(new_str)


It will change every non alphanumeric char to white space.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  情书的邮戳        
                
              
                            
                2021-01-06 05:12
              
            
            
                                                                       
import re


Then replace

[uniquewords.add(x) for x in open(os.path.join(root,name)).read().split()]


By

[uniquewords.add(re.sub('[^a-zA-Z0-9]*$', '', x) for x in open(os.path.join(root,name)).read().split()]


This will strip all trailing non-alphanumeric characters from each word before adding it to the set.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复