snappy

Is it possible to load Avro files with Snappy compression into BigQuery?

。_饼干妹妹 提交于 2019-12-11 04:28:56
问题 I know that BigQuery supports Avro file upload and I'm successful in loading Avro file into BigQuery. Using below command, java -jar avro-tools-1.7.7.jar fromjson --codec snappy --schema-file SourceSchema.avsc Source.json > Output.snappy.avro I have generated an Avro file with Snappy compression and trying to load into BigQuery but Load job fails with below errors, Errors: file-00000000: The Apache Avro library failed to parse file file-00000000. (error code: invalid) Is it possible to load

mvn and the make package error

╄→尐↘猪︶ㄣ 提交于 2019-12-11 04:23:32
问题 OK. Here's the problem and it's driving me crazy!!! I followed the instruction online, installed hadoop and when running the text it said snappy local library can't be loaded. It's said I have to install snappy first and then install hadoop-snappy. I download snappy-1.0.4 from google code and do the following: cd ../snappy-1.0.4 ./configure make sudo make install Then it's the problem when: mvn package -Dsnappy.prefix=/usr/local The post online said by default the snappy should be installed

Decrypting Hadoop Snappy File

别说谁变了你拦得住时间么 提交于 2019-12-10 23:29:32
问题 So I'm having some issues decrypting a snappy file from HDFS. If I use hadoop fs -text I am able to uncompress and output the file just file. However if I use hadoop fs -copyToLocal and try to uncompress the file with python-snappy I get snappy.UncompressError: Error while decompressing: invalid input My python program is very simple and looks like this: import snappy with open (snappy_file, "r") as input_file: data = input_file.read() uncompressed = snappy.uncompress(data) print uncompressed

How to snappy compress a file using a python script

感情迁移 提交于 2019-12-10 11:55:35
问题 I am trying to compress in snappy format a csv file using a python script and the python-snappy module. This is my code so far: import snappy d = snappy.compress("C:\\Users\\my_user\\Desktop\\Test\\Test_file.csv") with open("compressed_file.snappy", 'w') as snappy_data: snappy_data.write(d) snappy_data.close() This code actually creates a snappy file, but the snappy file created only contains a string: "C:\Users\my_user\Desktop\Test\Test_file.csv" So I am a bit lost on getting my csv

How can I insert into a hive table with parquet fileformat and SNAPPY compression?

泄露秘密 提交于 2019-12-10 11:47:14
问题 Hive 2.1 I have following table definition : CREATE EXTERNAL TABLE table_snappy ( a STRING, b INT) PARTITIONED BY (c STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION '/' TBLPROPERTIES ('parquet.compress'='SNAPPY'); Now, I would like to insert data into it : INSERT INTO table_snappy

Java压缩算法性能比较

戏子无情 提交于 2019-12-10 00:26:46
前言 游戏开发中,经常在玩家进入游戏的时候进行必要的信息初始化,往往这个初始化信息数据包是相对来说还是比较大的,一般在30-40kb左右,还是有必要进行压缩一下再发送消息,刚好前段时间看过,里面列举了一些常用的压缩算法,如下图所示: 是否可切分表示是否可以搜索数据流的任意位置并进一步往下读取数据,这项功能在Hadoop的MapReduce中尤其适合。 下面对这几种压缩格式进行简单的介绍,并进行压力测试,进行性能比较 DEFLATE DEFLATE是同时使用了LZ77算法与哈夫曼编码(Huffman Coding)的一个无损数据压缩算法,DEFLATE压缩与解压的源代码可以在自由、通用的压缩库zlib上找到,zlib官网: http://www.zlib.net/ jdk中对zlib压缩库提供了支持,压缩类Deflater和解压类Inflater,Deflater和Inflater都提供了native方法 private native int deflateBytes(long addr, byte[] b, int off, int len, int flush); private native int inflateBytes(long addr, byte[] b, int off, int len) throws DataFormatException;

spark returns error libsnappyjava.so: failed to map segment from shared object: Operation not permitted

空扰寡人 提交于 2019-12-09 12:52:54
问题 I have just extracted and setup spark 1.6.0 into environment that has a fresh install of hadoop 2.6.0 and hive 0.14. I have verified that hive, beeline and mapreduce works fine on examples. However, as soon as I run sc.textfile() within spark-shell, it returns an error: $ spark-shell Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.6.0 /_/ Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_67) Type in expressions to

spark & 文件压缩

核能气质少年 提交于 2019-12-08 19:24:43
hdfs中存储的文件一般都是多副本存储,对文件进行压缩,不仅可以节约大量空间,适当的存储格式还能对读取性能有非常大的提升。 文本文件压缩 bzip2 压缩率最高,压缩解压速度较慢,支持split。 import org .apache .hadoop .io .compress .BZip 2Codec rdd .saveAsTextFile ( "codec/bzip2" ,classOf[BZip2Codec]) snappy json文本压缩率 38.2%,压缩和解压缩时间短。 import org .apache .hadoop .io .compress .SnappyCodec rdd .saveAsTextFile ( "codec/snappy" ,classOf[SnappyCodec]) gzip 压缩率高,压缩和解压速度较快,不支持split,如果不对文件大小进行控制,下次分析可能可能会造成效率低下的问题。 json文本压缩率23.5%,适合使用率低,长期存储的文件。 import org .apache .hadoop .io .compress .GzipCodec rdd .saveAsTextFile ( "codec/gzip" ,classOf[GzipCodec]) parquet文件压缩 parquet为文件提供了列式存储

Why is parquet slower for me against text file format in hive?

我们两清 提交于 2019-12-07 05:12:20
问题 OK! So I decided to use Parquet as storage format for hive tables and before I actually implement it in my cluster, I decided to run some tests. Surprisingly, Parquet was slower in my tests as against the general notion that it is faster then plain text files. Please be noted that I am using Hive-0.13 on MapR Follows the flow of my operations Table A Format - Text Format Table size - 2.5 Gb Table B Format - Parquet Table size - 1.9 Gb [Create table B stored as parquet as select * from A]

spark1.1.0 snappy依赖高版本gcc

我只是一个虾纸丫 提交于 2019-12-06 19:37:17
最近更新了spark到1.1.0版本。跑任务出错。 Caused by: java.lang.UnsatisfiedLinkError: /tmp/snappy-1.0.5.3-6ceb7982-8940-431c-95a8-25b3684fa0be-libsnappyjava.so: /usr/lib64 /libstdc++.so.6: version `GLIBCXX_3.4.9' not found (required by /tmp/snappy-1.0.5.3 由于我们的系统是rhel5,glibc版本只到3.4.8,而snappy需要使用3.4.9版本,坑爹。 spark1.0.0没有出现过这个问题,为了减少系统改动,重新编译了spark,并将pom.xml 的snappy版本(1.5.3)手动修改成spark1.0.0中的1.5.0版本。但是问题还是没能解决。 最终只能编译了更高版本的gcc,我用的gcc4.7.3,将新gcc的libstd++.so.6.x链接过去/usr/lib64/l ibstd++.so.6 , 问题解决。 gcc编译时依赖的库: ../configure --prefix=/usr/local/gcc-4.7.3 --enable-threads=posix --disable-bootstrap --disable-multilib -