问题
When I import a .csv file with read.table, with the call df <- read.table("ModelSugar(new) real_thesis_experiment-table_1.csv", skip = 6, sep = ",", head = TRUE)
and I check the summary of the data I get (only first 3 columns of 45 are shown):
X.run.number. scenario configuration
Min. : 1 "pessimistic":999994 "central":999994
1st Qu.: 650
Median :1299
Mean :1299
3rd Qu.:1949
Max. :2600
With this dataframe I can make nice graphics. However, I have 80 .csv files with a total size of 40 GB, so I want to import only specific columns.
I figured this would be easier with fread
(from the data.table package). So I imported 5 columns and rbind them together into one dataframe with the call
my.files <- list.files(pattern=".csv")
my.data <- lapply(my.files,fread, header = FALSE, select = c(1,2,3,25,29), sep=",")
df <- do.call("rbind", my.data)
The summary of that dataframe looks like(4 of 5 columns shown:
[run number] scenario configuration [step]
Length:999994 Length:999994 Length:999994 Length:999994
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
With this dataframe I cannot make the graphics that I could with read.table. I guess that this has to do with the class of the columns' values.
How can I make sure that the dataframe created with fread has the same characteristics as the one with read.table, so that I can make the graphics I want?
EDIT
I found out that when I first split the .csv in excel into columns and then use the fread call with sep = ";" instead of sep = ",", that it does work. Strange... And I don't want to convert the .csv files into columns in excel manually.
回答1:
First 5 columns (of 45) of dfshort look like this:
X.run.number. scenario configuration biobased.chemical.industry
1 3 "pessimistic" "central" "modification-dominant"
2 2 "pessimistic" "central" "modification-dominant"
3 3 "pessimistic" "central" "modification-dominant"
4 4 "pessimistic" "central" "modification-dominant"
5 2 "pessimistic" "central" "modification-dominant"
6 1 "pessimistic" "central" "modification-dominant"
7 3 "pessimistic" "central" "modification-dominant"
8 3 "pessimistic" "central" "modification-dominant"
9 2 "pessimistic" "central" "modification-dominant"
10 4 "pessimistic" "central" "modification-dominant"
distributed.sugar.factory.investment.costs
1 70000000
2 70000000
3 70000000
4 70000000
5 70000000
6 70000000
7 70000000
Template looks like this:
run_number scenario configuration tick financial_balance_SU
1 3 "pessimistic" "central" 0 0
2 2 "pessimistic" "central" 0 0
3 3 "pessimistic" "central" 1 0
4 4 "pessimistic" "central" 0 0
5 2 "pessimistic" "central" 1 0
6 1 "pessimistic" "central" 0 0
df looks like this:
run_number scenario configuration tick financial_balance_SU
1: 23377 ""pessimistic"" ""mixed"" 200 6.079728695488823E9
2: 23377 ""pessimistic"" ""mixed"" 201 6.079728695488823E9
3: 23378 ""pessimistic"" ""mixed"" 192 9.10006561818864E9
4: 23377 ""pessimistic"" ""mixed"" 202 6.079728695488823E9
5: 23377 ""pessimistic"" ""mixed"" 203 6.079728695488823E9
6: 23378 ""pessimistic"" ""mixed"" 193 9.10006561818864E9
EDIT
str(dfshort)
'data.frame': 10 obs. of 45 variables:
$ X.run.number. : int 3 2 3 4 2 1 3 3 2 4
$ scenario : Factor w/ 1 level "\"pessimistic\"": 1 1 1 1 1 1 1 1 1 1
$ configuration : Factor w/ 1 level "\"central\"": 1 1 1 1 1 1 1 1 1 1
$ biobased.chemical.industry : Factor w/ 1 level "\"modification-dominant\"": 1 1 1 1 1 1 1 1 1 1
$ distributed.sugar.factory.investment.costs : int 70000000 70000000 70000000 70000000 70000000 70000000 70000000 70000000 70000000 70000000
$ beet.syrups.factory.investment.costs : int 1000000 1000000 1000000 1000000 1000000 1000000 1000000 1000000 1000000 1000000
$ ethanol.factory.investment.costs : int 1500000 1500000 1500000 1500000 1500000 1500000 1500000 1500000 1500000 1500000
$ market.share.beet.syrups.increase : num 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002
$ demand.beets.for.chemical.EU.increase : num 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01
$ transport.costs : int 1 1 1 1 1 1 1 1 1 1
$ washing.at.farmer : Factor w/ 1 level "\"no\"": 1 1 1 1 1 1 1 1 1 1
$ beet.syrups.price.percentage.of.sugar.price : num 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3
$ CO2.tax. : Factor w/ 1 level "\"yes\"": 1 1 1 1 1 1 1 1 1 1
$ sugar.tax : num 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2
$ CO2.tax : int 13 13 13 13 13 13 13 13 13 13
$ market.share.increase.period : int 10 10 10 10 10 10 10 10 10 10
$ electricity.source : Factor w/ 1 level "\"conventional-mix\"": 1 1 1 1 1 1 1 1 1 1
$ white.sugar.price.EU.maximum : int 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000
$ white.sugar.price.EU.minimum : int 200 200 200 200 200 200 200 200 200 200
$ beet.syrups.price.EU.maximum : int 500 500 500 500 500 500 500 500 500 500
$ beet.syrups.price.EU.minimum : int 100 100 100 100 100 100 100 100 100 100
$ ethanol.price.EU.maximum : int 2 2 2 2 2 2 2 2 2 2
$ ethanol.price.EU.minimum : num 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3
$ years.taken.into.account : int 5 5 5 5 5 5 5 5 5 5
$ X.step. : int 0 0 1 0 1 0 2 3 2 1
$ financial.balance.farmers : num 0 0 0 0 0 ...
$ diesel.use.farmers : int 0 0 0 0 0 0 0 0 0 0
$ N.use.farmers : int 0 0 0 0 0 0 0 0 0 0
$ financial.balance.SU : num 0 0 0 0 0 ...
$ electricity.use.SU : int 0 0 0 0 0 0 0 0 0 0
$ financial.balance.central.sugar.factories : num 0 0 0 0 0 ...
$ electricity.use.central.sugar.factories : int 0 0 0 0 0 0 0 0 0 0
$ financial.balance.distributed.sugar.factories : num 0 0 0 0 0 ...
$ electricity.use.distributed.sugar.factories : int 0 0 0 0 0 0 0 0 0 0
$ financial.balance.beet.syrups.factories : int 0 0 0 0 0 0 0 0 0 0
$ electricity.use.beet.syrups.factories : int 0 0 0 0 0 0 0 0 0 0
$ financial.balance.ethanol.factories : int 0 0 0 0 0 0 0 0 0 0
$ electricity.use.ethanol.factories : int 0 0 0 0 0 0 0 0 0 0
$ transport.costs.yearly : num 0 0 0 0 0 ...
$ diesel.use.total.transport : num 0 0 0 0 0 ...
$ profit.per.tonne.sugar.beet.central.sugar.factory : num 0 0 0 0 0 ...
$ profit.per.tonne.sugar.beet.distributed.sugar.factory: num 0 0 0 0 0 ...
$ profit.per.tonne.sugar.beet.sugar.from.beet.syrups : int 0 0 0 0 0 0 0 0 0 0
$ profit.per.tonne.sugar.beet.beet.syrups.factory : int 0 0 0 0 0 0 0 0 0 0
$ profit.per.tonne.sugar.beet.ethanol.factory : num 0 0 0 0 0 ...
str(df)
Classes ‘data.table’ and 'data.frame': 19000000 obs. of 5 variables:
$ run_number : chr "23377" "23377" "23378" "23377" ...
$ scenario : chr "\"\"pessimistic\"\"" "\"\"pessimistic\"\"" "\"\"pessimistic\"\"" "\"\"pessimistic\"\"" ...
$ configuration : chr "\"\"mixed\"\"" "\"\"mixed\"\"" "\"\"mixed\"\"" "\"\"mixed\"\"" ...
$ tick : chr "200" "201" "192" "202" ...
$ financial_balance_SU: chr "6.079728695488823E9" "6.079728695488823E9" "9.10006561818864E9" "6.079728695488823E9" ...
- attr(*, ".internal.selfref")=<externalptr>
str(template)
'data.frame': 10 obs. of 5 variables:
$ run_number : int 3 2 3 4 2 1 3 3 2 4
$ scenario : Factor w/ 1 level "\"pessimistic\"": 1 1 1 1 1 1 1 1 1 1
$ configuration : Factor w/ 1 level "\"central\"": 1 1 1 1 1 1 1 1 1 1
$ tick : int 0 0 1 0 1 0 2 3 2 1
$ financial_balance_SU: num 0 0 0 0 0 ...
回答2:
What you can do is read one file with write.csv and save 10 rows of that file as template and then you can do the following-
## Getting your files using fread
dfshort <- read.table("ModelSugar(new) real_thesis_experiment-table_1.csv", skip = 6, sep = ",", nrows = 10, head = TRUE)
df_needed<-dfshort[1:10]
template <- subset(df_needed,select=c(columns_required)) ##select whatever cols you need
##Read you large files using fread
my.files <- list.files(pattern=".csv")
my.data <- lapply(my.files,fread, header = FALSE, select = c(1,2,3,25,29), sep=",")
df <- do.call("rbind", my.data)
## changing cols types as per your template
result = data.frame(
lapply(setNames(,names(template)), function(x)
if (x %in% names(df)) as(df[[x]], class(template[[x]]))
else template[[x]][NA_integer_]
), stringsAsFactors = FALSE)
Then, you can use it to plot because it will have same class types which you get using write.csv.
Try this
dfshort <- read.table("ModelSugar(new) real_thesis_experiment-table_1.csv", skip = 6, sep = ",", nrows = 10, head = TRUE)
template <- copy(dfshort)
my.files <- list.files(pattern=".csv")
my.data <- lapply(my.files,fread, header = FALSE, colClasses = c(1,2,3,25,29), sep=",")
df <- do.call("rbind", my.data)
result = data.frame(
lapply(setNames(,names(template)), function(x)
if (x %in% names(df)) as(df[[x]], class(template[[x]]))
else template[[x]][NA_integer_]
), stringsAsFactors = FALSE)
来源:https://stackoverflow.com/questions/48566897/importing-fread-vs-read-table-and-errors