问题
I'm trying to read CSV files from GTFS.zip with help of uniVocity-parsers and run into an issue that I can't figure out. For some reason it seems the first column of some CSV files won't be parsed correctly. For example in the "stops.txt" file that looks like this:
stop_id,stop_name,stop_lat,stop_lon,location_type,parent_station
"de:3811:30215:0:6","Freiburg Stübeweg","48.0248455941735","7.85563688037231","","Parent30215"
"de:8311:30054:0:1","Freiburg Schutternstraße","48.0236251356332","7.72434519425597","","Parent30054"
"de:8311:30054:0:2","Freiburg Schutternstraße","48.0235446600679","7.72438739944883","","Parent30054"
The "stop_id" field won't be parsed correctly will have the value "null"
This is the method I'm using to read the file:
public <T> List<T> readCSV(String path, String file, BeanListProcessor<T> processor) {
List<T> content = null;
try {
// Get zip file
ZipFile zip = new ZipFile(path);
// Get CSV file
ZipEntry entry = zip.getEntry(file);
InputStream in = zip.getInputStream(entry);
CsvParserSettings parserSettings = new CsvParserSettings();
parserSettings.setProcessor(processor);
parserSettings.setHeaderExtractionEnabled(true);
CsvParser parser = new CsvParser(parserSettings);
parser.parse(new InputStreamReader(in));
content = processor.getBeans();
zip.close();
return content;
} catch (Exception e) {
e.printStackTrace();
}
return content;
}
And this is how my Stop Class looks like:
public class Stop {
@Parsed
private String stop_id;
@Parsed
private String stop_name;
@Parsed
private String stop_lat;
@Parsed
private String stop_lon;
@Parsed
private String location_type;
@Parsed
private String parent_station;
public Stop() {
}
public Stop(String stop_id, String stop_name, String stop_lat, String stop_lon, String location_type,
String parent_station) {
this.stop_id = stop_id;
this.stop_name = stop_name;
this.stop_lat = stop_lat;
this.stop_lon = stop_lon;
this.location_type = location_type;
this.parent_station = parent_station;
}
// --------------------- Getter --------------------------------
public String getStop_id() {
return stop_id;
}
public String getStop_name() {
return stop_name;
}
public String getStop_lat() {
return stop_lat;
}
public String getStop_lon() {
return stop_lon;
}
public String getLocation_type() {
return location_type;
}
public String getParent_station() {
return parent_station;
}
// --------------------- Setter --------------------------------
public void setStop_id(String stop_id) {
this.stop_id = stop_id;
}
public void setStop_name(String stop_name) {
this.stop_name = stop_name;
}
public void setStop_lat(String stop_lat) {
this.stop_lat = stop_lat;
}
public void setStop_lon(String stop_lon) {
this.stop_lon = stop_lon;
}
public void setLocation_type(String location_type) {
this.location_type = location_type;
}
public void setParent_station(String parent_station) {
this.parent_station = parent_station;
}
@Override
public String toString() {
return "Stop [stop_id=" + stop_id + ", stop_name=" + stop_name + ", stop_lat=" + stop_lat + ", stop_lon="
+ stop_lon + ", location_type=" + location_type + ", parent_station=" + parent_station + "]";
}
}
If I call the method i get this output which is not correct:
PartialReading pr = new PartialReading();
List<Stop> stops = pr.readCSV("VAGFR.zip", "stops.txt", new BeanListProcessor<Stop>(Stop.class));
for (int i = 0; i < 4; i++) {
System.out.println(stops.get(i).toString());
}
Output:
Stop [stop_id=null, stop_name=Freiburg Stübeweg, stop_lat=48.0248455941735, stop_lon=7.85563688037231, location_type=null, parent_station=Parent30215]
Stop [stop_id=null, stop_name=Freiburg Schutternstraße, stop_lat=48.0236251356332, stop_lon=7.72434519425597, location_type=null, parent_station=Parent30054]
Stop [stop_id=null, stop_name=Freiburg Schutternstraße, stop_lat=48.0235446600679, stop_lon=7.72438739944883, location_type=null, parent_station=Parent30054]
Stop [stop_id=null, stop_name=Freiburg Waltershofen Ochsen, stop_lat=48.0220902613143, stop_lon=7.7205756507492, location_type=null, parent_station=Parent30055]
Does anyone know why this happens and how I can fix it? This also happens in the "routes.txt" and "trips.txt" files that I tested. This is the GTFS file : http://stadtplan.freiburg.de/sld/VAGFR.zip
回答1:
If you print the headers you will notice that the first column doesn't look right. That's because you are parsing a file encoded using UTF-8 with a BOM marker.
Basically the file starts with a few bytes indicating what is the encoding. Until version 2.5.*, the parser didn't handle that internally, and you had to skip these bytes to get the correct output:
//... your code here
ZipEntry entry = zip.getEntry(file);
InputStream in = zip.getInputStream(entry);
if(in.read() == 239 & in.read() == 187 & in.read() == 191){
System.out.println("UTF-8 with BOM, bytes discarded");
}
CsvParserSettings parserSettings = new CsvParserSettings();
//...rest of your code here
The above hack will work on any version before 2.5.*, but you could also use Commons-IO provides a BOMInputStream
for convenience and a more clean handling of this sort of thing - it's just VERY slow.
Updating to a recent version should take care of it automatically.
Hope it helps.
来源:https://stackoverflow.com/questions/42415549/univocity-doesnt-parse-the-first-column-into-beans