Why am I getting X. in my column names when reading a data frame?

后端 未结 5 651
迷失自我
迷失自我 2020-11-28 06:03

I asked a question about this a few months back, and I thought the answer had solved my problem, but I ran into the problem again and the solution didn\'t work for me.

相关标签:
5条回答
  • 2020-11-28 06:11

    I just came across this problem and it was for a simple reason. I had labels that began with a number, and R was adding an X in front of them all. I think R is confused with a number in the header and applies a letter to differentiate from values.

    So, "3_in" became "X3_in" etc... I solved by switching the label to "in_3" and the issues was resolved.

    I hope this helps someone.

    0 讨论(0)
  • 2020-11-28 06:16

    I ran over a similar problem and wanted to share the following lines of code to correct the column names. Certainly not perfect, since clean programming in the forehand would be better, but maybe helpful as a starting point to someone as quick and dirty approach. (I would have liked to add them as comment to Ryan's question/Gavin's answer, but my reputation is not high enough, so I had to post an additional answer - sorry).

    In my case several steps of writing and reading data produced one or more columns named "X", X.1",... containing content in the X-column and row numbers in the X.1,...-columns. In my case the content of the X-column should be used as row names and the other X.1,...-columns should be deleted.

    Correct_Colnames <- function(df) {
    
     delete.columns <- grep("(^X$)|(^X\\.)(\\d+)($)", colnames(df), perl=T)
    
      if (length(delete.columns) > 0) {
    
       row.names(df) <- as.character(df[, grep("^X$", colnames(df))])
       #other data types might apply than character or 
       #introduction of a new separate column might be suitable
    
       df <- df[,-delete.columns]
    
       colnames(df) <- gsub("^X", "",  colnames(df))
       #X might be replaced by different characters, instead of being deleted
      }
    
      return(df)
    }
    
    0 讨论(0)
  • 2020-11-28 06:30

    read.csv() is a wrapper around the more general read.table() function. That latter function has argument check.names which is documented as:

    check.names: logical.  If ‘TRUE’ then the names of the variables in the
             data frame are checked to ensure that they are syntactically
             valid variable names.  If necessary they are adjusted (by
             ‘make.names’) so that they are, and also to ensure that there
             are no duplicates.
    

    If your header contains labels that are not syntactically valid then make.names() will replace them with a valid name, based upon the invalid name, removing invalid characters and possibly prepending X:

    R> make.names("$Foo")
    [1] "X.Foo"
    

    This is documented in ?make.names:

    Details:
    
        A syntactically valid name consists of letters, numbers and the
        dot or underline characters and starts with a letter or the dot
        not followed by a number.  Names such as ‘".2way"’ are not valid,
        and neither are the reserved words.
    
        The definition of a _letter_ depends on the current locale, but
        only ASCII digits are considered to be digits.
    
        The character ‘"X"’ is prepended if necessary.  All invalid
        characters are translated to ‘"."’.  A missing value is translated
        to ‘"NA"’.  Names which match R keywords have a dot appended to
        them.  Duplicated values are altered by ‘make.unique’.
    

    The behaviour you are seeing is entirely consistent with the documented way read.table() loads in your data. That would suggest that you have syntactically invalid labels in the header row of your CSV file. Note the point above from ?make.names that what is a letter depends on the locale of your system; The CSV file might include a valid character that your text editor will display but if R is not running in the same locale that character may not be valid there, for example?

    I would look at the CSV file and identify any non-ASCII characters in the header line; there are possibly non-visible characters (or escape sequences; \t?) in the header row also. A lot may be going on between reading in the file with the non-valid names and displaying it in the console which might be masking the non-valid characters, so don't take the fact that it doesn't show anything wrong without check.names as indicating that the file is OK.

    Posting the output of sessionInfo() would also be useful.

    0 讨论(0)
  • 2020-11-28 06:30

    When the column names don´t have correct form, R put an "X" at the start of the column name during the import. For example it is usually happening when your column names starts with number or some spacial character. The check.names = FALSE cause it will not happen - there will be no "X". However some functions may not work if the column names starts with numbers or other special character. Example is rbind.fill function.

    So after the application of that function (with "corrected colnames") I use this simple thing to get rid of the "X".

    destroyX = function(es) {
      f = es
      for (col in c(1:ncol(f))){ #for each column in dataframe
        if (startsWith(colnames(f)[col], "X") == TRUE)  { #if starts with 'X' ..
          colnames(f)[col] <- substr(colnames(f)[col], 2, 100) #get rid of it
        }
      }
      assign(deparse(substitute(es)), f, inherits = TRUE) #assign corrected data to original name
    }
    
    0 讨论(0)
  • 2020-11-28 06:32

    I solved a similar problem by including row.names=FALSE as an argument in the write.csv function. write.csv was including the row names as an unnamed column in the CSV file and read.csv was naming that column 'X' when it read the CSV file.

    0 讨论(0)
提交回复
热议问题