Importing fread vs read.table and errors

问题

When I import a .csv file with read.table, with the call df <- read.table("ModelSugar(new) real_thesis_experiment-table_1.csv", skip = 6, sep = ",", head = TRUE) and I check the summary of the data I get (only first 3 columns of 45 are shown):

 X.run.number. scenario        configuration   
 Min.   :   1 "pessimistic":999994   "central":999994  
 1st Qu.: 650                                            
 Median :1299                                            
 Mean   :1299                                            
 3rd Qu.:1949                                            
 Max.   :2600

With this dataframe I can make nice graphics. However, I have 80 .csv files with a total size of 40 GB, so I want to import only specific columns.

I figured this would be easier with fread (from the data.table package). So I imported 5 columns and rbind them together into one dataframe with the call

my.files <- list.files(pattern=".csv")
my.data <- lapply(my.files,fread, header = FALSE, select = c(1,2,3,25,29), sep=",") 
df <- do.call("rbind", my.data)

The summary of that dataframe looks like(4 of 5 columns shown:

[run number]         scenario         configuration         [step]         
 Length:999994      Length:999994      Length:999994      Length:999994     
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character

With this dataframe I cannot make the graphics that I could with read.table. I guess that this has to do with the class of the columns' values.

How can I make sure that the dataframe created with fread has the same characteristics as the one with read.table, so that I can make the graphics I want?

EDIT

I found out that when I first split the .csv in excel into columns and then use the fread call with sep = ";" instead of sep = ",", that it does work. Strange... And I don't want to convert the .csv files into columns in excel manually.

回答1:

First 5 columns (of 45) of dfshort look like this:

   X.run.number.      scenario configuration biobased.chemical.industry
1              3 "pessimistic"     "central"    "modification-dominant"
2              2 "pessimistic"     "central"    "modification-dominant"
3              3 "pessimistic"     "central"    "modification-dominant"
4              4 "pessimistic"     "central"    "modification-dominant"
5              2 "pessimistic"     "central"    "modification-dominant"
6              1 "pessimistic"     "central"    "modification-dominant"
7              3 "pessimistic"     "central"    "modification-dominant"
8              3 "pessimistic"     "central"    "modification-dominant"
9              2 "pessimistic"     "central"    "modification-dominant"
10             4 "pessimistic"     "central"    "modification-dominant"
   distributed.sugar.factory.investment.costs
1                                    70000000
2                                    70000000
3                                    70000000
4                                    70000000
5                                    70000000
6                                    70000000
7                                    70000000

Template looks like this:

 run_number      scenario configuration tick financial_balance_SU
1          3 "pessimistic"     "central"    0                    0
2          2 "pessimistic"     "central"    0                    0
3          3 "pessimistic"     "central"    1                    0
4          4 "pessimistic"     "central"    0                    0
5          2 "pessimistic"     "central"    1                    0
6          1 "pessimistic"     "central"    0                    0

df looks like this:

   run_number        scenario configuration tick financial_balance_SU
1:      23377 ""pessimistic""     ""mixed""  200  6.079728695488823E9
2:      23377 ""pessimistic""     ""mixed""  201  6.079728695488823E9
3:      23378 ""pessimistic""     ""mixed""  192   9.10006561818864E9
4:      23377 ""pessimistic""     ""mixed""  202  6.079728695488823E9
5:      23377 ""pessimistic""     ""mixed""  203  6.079728695488823E9
6:      23378 ""pessimistic""     ""mixed""  193   9.10006561818864E9

EDIT

str(dfshort)

'data.frame':   10 obs. of  45 variables:
 $ X.run.number.                                        : int  3 2 3 4 2 1 3 3 2 4
 $ scenario                                             : Factor w/ 1 level "\"pessimistic\"": 1 1 1 1 1 1 1 1 1 1
 $ configuration                                        : Factor w/ 1 level "\"central\"": 1 1 1 1 1 1 1 1 1 1
 $ biobased.chemical.industry                           : Factor w/ 1 level "\"modification-dominant\"": 1 1 1 1 1 1 1 1 1 1
 $ distributed.sugar.factory.investment.costs           : int  70000000 70000000 70000000 70000000 70000000 70000000 70000000 70000000 70000000 70000000
 $ beet.syrups.factory.investment.costs                 : int  1000000 1000000 1000000 1000000 1000000 1000000 1000000 1000000 1000000 1000000
 $ ethanol.factory.investment.costs                     : int  1500000 1500000 1500000 1500000 1500000 1500000 1500000 1500000 1500000 1500000
 $ market.share.beet.syrups.increase                    : num  0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002
 $ demand.beets.for.chemical.EU.increase                : num  0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01
 $ transport.costs                                      : int  1 1 1 1 1 1 1 1 1 1
 $ washing.at.farmer                                    : Factor w/ 1 level "\"no\"": 1 1 1 1 1 1 1 1 1 1
 $ beet.syrups.price.percentage.of.sugar.price          : num  0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3
 $ CO2.tax.                                             : Factor w/ 1 level "\"yes\"": 1 1 1 1 1 1 1 1 1 1
 $ sugar.tax                                            : num  0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2
 $ CO2.tax                                              : int  13 13 13 13 13 13 13 13 13 13
 $ market.share.increase.period                         : int  10 10 10 10 10 10 10 10 10 10
 $ electricity.source                                   : Factor w/ 1 level "\"conventional-mix\"": 1 1 1 1 1 1 1 1 1 1
 $ white.sugar.price.EU.maximum                         : int  1000 1000 1000 1000 1000 1000 1000 1000 1000 1000
 $ white.sugar.price.EU.minimum                         : int  200 200 200 200 200 200 200 200 200 200
 $ beet.syrups.price.EU.maximum                         : int  500 500 500 500 500 500 500 500 500 500
 $ beet.syrups.price.EU.minimum                         : int  100 100 100 100 100 100 100 100 100 100
 $ ethanol.price.EU.maximum                             : int  2 2 2 2 2 2 2 2 2 2
 $ ethanol.price.EU.minimum                             : num  0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3
 $ years.taken.into.account                             : int  5 5 5 5 5 5 5 5 5 5
 $ X.step.                                              : int  0 0 1 0 1 0 2 3 2 1
 $ financial.balance.farmers                            : num  0 0 0 0 0 ...
 $ diesel.use.farmers                                   : int  0 0 0 0 0 0 0 0 0 0
 $ N.use.farmers                                        : int  0 0 0 0 0 0 0 0 0 0
 $ financial.balance.SU                                 : num  0 0 0 0 0 ...
 $ electricity.use.SU                                   : int  0 0 0 0 0 0 0 0 0 0
 $ financial.balance.central.sugar.factories            : num  0 0 0 0 0 ...
 $ electricity.use.central.sugar.factories              : int  0 0 0 0 0 0 0 0 0 0
 $ financial.balance.distributed.sugar.factories        : num  0 0 0 0 0 ...
 $ electricity.use.distributed.sugar.factories          : int  0 0 0 0 0 0 0 0 0 0
 $ financial.balance.beet.syrups.factories              : int  0 0 0 0 0 0 0 0 0 0
 $ electricity.use.beet.syrups.factories                : int  0 0 0 0 0 0 0 0 0 0
 $ financial.balance.ethanol.factories                  : int  0 0 0 0 0 0 0 0 0 0
 $ electricity.use.ethanol.factories                    : int  0 0 0 0 0 0 0 0 0 0
 $ transport.costs.yearly                               : num  0 0 0 0 0 ...
 $ diesel.use.total.transport                           : num  0 0 0 0 0 ...
 $ profit.per.tonne.sugar.beet.central.sugar.factory    : num  0 0 0 0 0 ...
 $ profit.per.tonne.sugar.beet.distributed.sugar.factory: num  0 0 0 0 0 ...
 $ profit.per.tonne.sugar.beet.sugar.from.beet.syrups   : int  0 0 0 0 0 0 0 0 0 0
 $ profit.per.tonne.sugar.beet.beet.syrups.factory      : int  0 0 0 0 0 0 0 0 0 0
 $ profit.per.tonne.sugar.beet.ethanol.factory          : num  0 0 0 0 0 ...

str(df)

Classes ‘data.table’ and 'data.frame':  19000000 obs. of  5 variables:
 $ run_number          : chr  "23377" "23377" "23378" "23377" ...
 $ scenario            : chr  "\"\"pessimistic\"\"" "\"\"pessimistic\"\"" "\"\"pessimistic\"\"" "\"\"pessimistic\"\"" ...
 $ configuration       : chr  "\"\"mixed\"\"" "\"\"mixed\"\"" "\"\"mixed\"\"" "\"\"mixed\"\"" ...
 $ tick                : chr  "200" "201" "192" "202" ...
 $ financial_balance_SU: chr  "6.079728695488823E9" "6.079728695488823E9" "9.10006561818864E9" "6.079728695488823E9" ...
 - attr(*, ".internal.selfref")=<externalptr>

str(template)

'data.frame':   10 obs. of  5 variables:
 $ run_number          : int  3 2 3 4 2 1 3 3 2 4
 $ scenario            : Factor w/ 1 level "\"pessimistic\"": 1 1 1 1 1 1 1 1 1 1
 $ configuration       : Factor w/ 1 level "\"central\"": 1 1 1 1 1 1 1 1 1 1
 $ tick                : int  0 0 1 0 1 0 2 3 2 1
 $ financial_balance_SU: num  0 0 0 0 0 ...

回答2:

What you can do is read one file with write.csv and save 10 rows of that file as template and then you can do the following-

## Getting your files using fread
dfshort <- read.table("ModelSugar(new) real_thesis_experiment-table_1.csv", skip = 6, sep = ",", nrows = 10, head = TRUE)
df_needed<-dfshort[1:10]
template <- subset(df_needed,select=c(columns_required)) ##select whatever cols you need

##Read you large files using fread
my.files <- list.files(pattern=".csv")
my.data <- lapply(my.files,fread, header = FALSE, select = c(1,2,3,25,29), sep=",") 
df <- do.call("rbind", my.data)

## changing cols types as per your template
result = data.frame(
  lapply(setNames(,names(template)), function(x) 
    if (x %in% names(df)) as(df[[x]], class(template[[x]])) 
    else template[[x]][NA_integer_]
  ), stringsAsFactors = FALSE)

Then, you can use it to plot because it will have same class types which you get using write.csv.

Try this

dfshort <- read.table("ModelSugar(new) real_thesis_experiment-table_1.csv", skip = 6, sep = ",", nrows = 10, head = TRUE)
    template <- copy(dfshort)
    my.files <- list.files(pattern=".csv")
    my.data <- lapply(my.files,fread, header = FALSE, colClasses = c(1,2,3,25,29), sep=",") 
    df <- do.call("rbind", my.data)

    result = data.frame(
      lapply(setNames(,names(template)), function(x) 
        if (x %in% names(df)) as(df[[x]], class(template[[x]])) 
        else template[[x]][NA_integer_]
      ), stringsAsFactors = FALSE)

来源：https://stackoverflow.com/questions/48566897/importing-fread-vs-read-table-and-errors

标签

import

read.table