Importing from CSV from a specified range of values

问题

I am trying to read in a CSV file and I am running into the following error.

Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : 
  line 1097 did not have 5 elements

After further inspection of the CSV file I find that around line 1097 there is a break in the rows and starts a new header with annualised data (I am interested in monthly for now).

temp <- tempfile()
download.file("http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/F-F_Research_Data_Factors_CSV.zip",temp, mode="wb")
unzip(temp, "F-F_Research_Data_Factors.CSV")
French <- read.table("F-F_Research_Data_Factors.CSV", sep=",", skip = 3, header=T, nrows = 100)

The above code downloads the zip file and imports the CSV file into R for the first 100 rows which works perfectly. However the first 100 rows (for illustrative purposes) are data points from the 1920´s and 1930´s which is not what I am particularly interested in.

My question is, how can I import data from a value in the first comma separated CSV file, i.e. 192607 (1926-07) until say 195007 (1950-07) -I am able to import the most recent values by changing nrow = 1095 but this is not what I exactly am trying to achieve.

Snapshot of the data;

,Mkt-RF,SMB,HML,RF
192607,    2.96,   -2.30,   -2.87,    0.22
192608,    2.64,   -1.40,    4.19,    0.25
192609,    0.36,   -1.32,    0.01,    0.23

... Line 1100

 Annual Factors: January-December 
,Mkt-RF,SMB,HML,RF
  1927,   29.47,   -2.46,   -3.75,    3.12
  1928,   35.39,    4.20,   -6.15,    3.56

回答1:

The first table in the file is between the first two zero length lines so this would read it in without the junk before and after and then subset it on the indicated dates:

# read first table in file
Lines <- readLines("F-F_Research_Data_Factors.CSV")
ix <- which(Lines == "")
DF0 <- read.csv(text = Lines[ix[1]:ix[2]])  # all rows in first table

# subset it to indicated dates
DF <- subset(DF0, X >= 192607 & X <= 195007)

Note: If we want all the tables it appears that lines beginning with comma start each table and blank lines end them (except the first blank line comes before the tables) so using Lines from above this gives a list L whose ith component is the ith table in the file.

st <- grep("^,", Lines)  # starting line numbers
en <- which(Lines == "")[-1]  # ending line numbers
L <- Map(function(st, en) read.csv(text = Lines[st:en]), st, en)

回答2:

I used read.csv instead of read.table

French <- read.csv("F-F_Research_Data_Factors.CSV", sep = ",", skip = 3, 
header = T )

and get 1188 observations. I think you can subset your dataset from here.

来源：https://stackoverflow.com/questions/47141541/importing-from-csv-from-a-specified-range-of-values

标签

csv

data-manipulation