sequencefile | 易学教程

Get HDFS file path in PySpark for files in sequence file format

阅读更多关于 Get HDFS file path in PySpark for files in sequence file format

问题 My data on HDFS is in Sequence file format. I am using PySpark (Spark 1.6) and trying to achieve 2 things: Data path contains a timestamp in yyyy/mm/dd/hh format that I would like to bring into the data itself. I tried SparkContext.wholeTextFiles but I think that might not support Sequence file format. How do I deal with the point above if I want to crunch data for a day and want to bring in the date into the data? In this case I would be loading data like yyyy/mm/dd/* format. Appreciate any

Why the SequenceFile is truncated?

阅读更多关于 Why the SequenceFile is truncated?

问题 I am learning Hadoop and this problem has baffled me for a while. Basically I am writing a SequenceFile to disk and then read it back. However, every time I get an EOFException when reading. A deeper look reveals that when writing the sequence file, it is prematurely truncated, and it always happens after writing index 962, and the file always has a fixed size of 45056 bytes. I am using Java 8 and Hadoop 2.5.1 on a MacBook Pro. In fact, I tried the same code on another Linux machine under

Exporting sequence file to Oracle by Sqoop

阅读更多关于 Exporting sequence file to Oracle by Sqoop

问题 I have been trying to find some documentations about how we can export sequence file to Oracle by using Sqoop. Is that possible? Currently I have my files(in HDFS) in text based format and I am using Sqoop to export those files to some Oracle's tables and its working fine. Now I want to change the format of the file from text to sequence file or something else (Avro later). So what I need to do if I want to export different file format from HDFS to Oracle using Sqoop? Any information will be

Hadoop Mahout Clustering

阅读更多关于 Hadoop Mahout Clustering

问题 I am trying to apply canopy clustering in Mahout. I already converted a text file into sequence file. But i cannot view the sequence file. Anyways I thought of applying canopy clustering by giving the following command, hduser@ubuntu:/usr/local/mahout/trunk$ mahout canopy -i /user/Hadoop/mahout_seq/seqdata -o /user/Hadoop/clustered_data -t1 5 -t2 3 I got the following error, 16/05/10 17:02:03 INFO mapreduce.Job: Task Id : attempt_1462850486830_0008_m_000000_1, Status : FAILED Error: java.lang

Hadoop HDFS: Read sequence files that are being written

阅读更多关于 Hadoop HDFS: Read sequence files that are being written

问题 I am using Hadoop 1.0.3. I write logs to an Hadoop sequence file into HDFS, I call syncFS() after each bunch of logs but I never close the file (except when I am performing daily rolling). What I want to guarantee is that the file is available to readers while the file is still being written. I can read the bytes of the sequence file via FSDataInputStream, but if I try to use SequenceFile.Reader.next(key,val), it returns false at the first call. I know the data is in the file since I can read

How can I use Mahout's sequencefile API code?

阅读更多关于 How can I use Mahout's sequencefile API code?

问题 There exists in Mahout a command for create sequence file as bin/mahout seqdirectory -c UTF-8 -i <input address> -o <output address> . I want use this command as code API. 回答1: You can do something like this: import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.SequenceFile; import org.apache.hadoop.io.Text; Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); Path

how to limit size of Hadoop Sequence file?

阅读更多关于 how to limit size of Hadoop Sequence file?

问题 I am writing Hadoop seq file using txt as input. I know how to write Sequence file from text file. But i want to limit the output sequence file to some specific size say, 256MB. Is there any inbuilt method to do this? 回答1: AFIAK you'll need to write your own custom output format to limit output file sizes - by default FileOutputFormats create a single output file per reducer. Another option is to create your sequence files as normal, then then a second job (map only), with identity mappers

Running MapReduce on Hbase Exported Table thorws Could not find a deserializer for the Value class: 'org.apache.hadoop.hbase.client.Result

阅读更多关于 Running MapReduce on Hbase Exported Table thorws Could not find a deserializer for the Value class: 'org.apache.hadoop.hbase.client.Result

问题 I have taken the Hbase table backup using Hbase Export utility tool . hbase org.apache.hadoop.hbase.mapreduce.Export "FinancialLineItem" "/project/fricadev/ESGTRF/EXPORT" This has kicked in mapreduce and transferred all my table data into Output folder . As per the document the file format will of the ouotput file is sequence file . So i ran below code to extract my key and value from the file . Now i want to run mapreduce to read the key value from the output file but getting below exception

How to create hadoop sequence file in local file system without hadoop installation?

阅读更多关于 How to create hadoop sequence file in local file system without hadoop installation?

问题 Is it possible to create hadoop sequence file from java only without installing hadoop? I need a standalone java program that create sequence file locally. My java program will run in env that does not have hadoop install. 回答1: You would need the libraries but not the installation. Use SequenceFile.Writer Sample code : import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io

Convert a text file to sequence format in Spark Java

阅读更多关于 Convert a text file to sequence format in Spark Java

问题 In Spark Java, how do I convert a text file to a sequence file? The following is my code: SparkConf sparkConf = new SparkConf().setAppName("txt2seq"); sparkConf.setMaster("local").set("spark.executor.memory", "1g"); sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer"); JavaSparkContext ctx = new JavaSparkContext(sparkConf); JavaPairRDD<String, String> infile = ctx.wholeTextFiles("input_txt"); infile.saveAsNewAPIHadoopFile("outfile.seq", String.class, String.class,