问题
I have some data in tsv format compressed using lzo. Now, I would like to use these data in a java spark program.
At the moment, I am able to decompress the files and then import them in Java as text files using
SparkSession spark = SparkSession.builder()
.master("local[2]")
.appName("MyName")
.getOrCreate();
Dataset<Row> input = spark.read()
.option("sep", "\t")
.csv(args[0]);
input.show(5); // visually check if data were imported correctly
where I have passed the path to the decompressed file in the first argument. If I pass the lzo file as an argument, the result of show is illegible garbage.
Is there a way to make it work? I use IntelliJ as an IDE and the project is set-up in Maven.
回答1:
I found a solution. It consists of two parts: installing the hadoop-lzo package and configuring it; after doing this, the code will remain the same as in the question, provided one is OK with the lzo file being imported in a single partition.
In the following I will explain how to do it for a maven project set up in IntelliJ.
Installing the package hadoop-lzo: you need to modify the
pom.xml
file that is in your maven project folder. It should contain the following excerpt:<repositories> <repository> <id>twitter-twttr</id> <url>http://maven.twttr.com</url> </repository> </repositories> <properties> <maven.compiler.source>1.8</maven.compiler.source> <maven.compiler.target>1.8</maven.compiler.target> </properties> <dependencies> <dependency> <!-- Apache Spark main library --> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.11</artifactId> <version>2.1.0</version> </dependency> <dependency> <!-- Packages for datasets and dataframes --> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2.11</artifactId> <version>2.1.0</version> </dependency> <!-- https://mvnrepository.com/artifact/com.hadoop.gplcompression/hadoop-lzo --> <dependency> <groupId>com.hadoop.gplcompression</groupId> <artifactId>hadoop-lzo</artifactId> <version>0.4.20</version> </dependency> </dependencies>
This will activate the maven Twitter repository that contains the package hadoop-lzo and make hadoop-lzo available for the project.
The second step is to create a
core-site.xml
file to tell hadoop that you have installed a new codec. It should be placed somewhere in the program folders. I put it undersrc/main/resources/core-site.xml
and marked the folder as a resource (right click on the folder from the IntelliJ Project panel -> Mark Directory as -> Resources root). Thecore-site.xml
file should contain:<configuration> <property> <name>io.compression.codecs</name> <value>org.apache.hadoop.io.compress.DefaultCodec, com.hadoop.compression.lzo.LzoCodec, com.hadoop.compression.lzo.LzopCodec, org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.BZip2Codec</value> </property> <property> <name>io.compression.codec.lzo.class</name> <value>com.hadoop.compression.lzo.LzoCodec</value> </property> </configuration>
And that's it! Run your program again and it should work!
来源:https://stackoverflow.com/questions/45376241/importing-a-lzo-file-into-java-spark-as-dataset