Read csv file with hidden or invisible character ^M

前端未结

关注

 3  982

I am attempting unsuccessfully to read a *.csv file containing hidden or invisible characters. The file contents are shown here:

my.data2 <- read.table(text


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  既然无缘        
                
              
                            
                2021-01-25 09:26
              
            
            
                                                                       
Here is code that can handle white space (i.e., multiple words) within fields:

nfields <- 4

bb <- readLines('c:/users/mmiller21/simple R programs/invisible.delimiter4.csv')
bb

pattern <- "(?<=\\,)(?=)"                  # split on commas
cc <- strsplit(bb, pattern, perl=TRUE)
dd <- unlist(cc)
ee <- dd[dd != ' ' & dd != '' & dd != ','] # remove empty elements
ff <- gsub(",", "", ee)                    # remove commas

m = matrix(ff, ncol=nfields, byrow=TRUE)   # store data in matrix

# returns string w/o leading or trailing whitespace
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
nn <- trim(m)
nn


Here are the contents of the original data set:

Common.name, Scientific.name, Stuff1, Stuff2
Greylag Goose, Anser anser, AAC aa, rr bb
Snow Goose, Anser caerulescens, AAC aa aa, rr bb bb
Greater Canada Goose, Branta canadensis, AAC, rr bb
Barnacle Goose, Branta leucopsis, AAC aa, rr
Brent Goose, Branta bernicla, AAC, rr bb bb bb


I simple removed the dots from the common name and scientific name and added extra text to the third and fourth columns.

Here is the output:

     [,1]                   [,2]                 [,3]        [,4]         
[1,] "Common.name"          "Scientific.name"    "Stuff1"    "Stuff2"     
[2,] "Greylag Goose"        "Anser anser"        "AAC aa"    "rr bb"      
[3,] "Snow Goose"           "Anser caerulescens" "AAC aa aa" "rr bb bb"   
[4,] "Greater Canada Goose" "Branta canadensis"  "AAC"       "rr bb"      
[5,] "Barnacle Goose"       "Branta leucopsis"   "AAC aa"    "rr"         
[6,] "Brent Goose"          "Branta bernicla"    "AAC"       "rr bb bb bb"

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  无人共我        
                
              
                            
                2021-01-25 09:43
              
            
            
                                                                       
In gVim you should be able to remove the ^M characters by typing the following:

:%s/<ctrl>V<ctrl>M//g<return>


If you've typed it in correctly it will look like ':%s/^M//g' in gVim.  When you press return, gVim searches (the 's') for what's between the first and second slash and replaces it with what's between the second and third slash, globally (the 'g').

NOTE: If you are on a Windows box and <ctrl>V seems to be pasting text, then gVim may be configured with 'windows behavior'.  In that case, use <ctrl>Q<ctrl>M instead of <ctrl>V<ctrl>M.

When I load your sample file into gVim 7.3, it looks like this:



After typing the characters

:%s/<ctrl>V<ctrl>M//g


but BEFORE hitting return I see this:



After hitting return I see this:



You can then do File->Save or File->Save As, which do what you would expect.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  离开以前        
                
              
                            
                2021-01-25 09:43
              
            
            
                                                                       
Here's a solution using scan to read the data, matrix to structure it, and data.frame to make it into a data frame:

readF <- function(path, nfields=4){    
    m = matrix(
          gsub(",","",scan(path,what=rep("",nfields))),
              ncol=nfields,byrow=TRUE)
    d = data.frame(m[-1,])
    names(d)=m[1,]
    d
}


So first check the file duplicates your problem :

> read.csv("./invisible.delimiter2.csv")
            Common.name    Scientific.name Stuff1 Stuff2
1         Greylag.Goose        Anser.anser              
2                   AAC                 rr              
3            Snow.Goose                                 
4    Anser.caerulescens                                 
5                   AAC                 rr              
6  Greater.Canada.Goose  Branta.canadensis    AAC     rr
7        Barnacle.Goose   Branta.leucopsis              
8                   AAC                 rr              
9           Brent.Goose    Branta.bernicla              
10                  AAC                 rr        


and then see if my function solves it:

> readF("./invisible.delimiter2.csv")
Read 24 items
           Common.name    Scientific.name Stuff1 Stuff2
1        Greylag.Goose        Anser.anser    AAC     rr
2           Snow.Goose Anser.caerulescens    AAC     rr
3 Greater.Canada.Goose  Branta.canadensis    AAC     rr
4       Barnacle.Goose   Branta.leucopsis    AAC     rr
5          Brent.Goose    Branta.bernicla    AAC     rr


Feel free to pick the function apart to see how it works.

I suspect the source of the problem is that the ^M is in the field data, and because you're fields aren't quoted then R can't tell if its a real line end or one in a field. There's some notes about embedded newlines in quoted fields in the documentation for read.csv etc.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复