bzip2

Organizing files in tar bz2 file with python

[亡魂溺海] 提交于 2019-12-05 21:31:50
I have about 200,000 text files that are placed in a bz2 file. The issue I have is that when I scan the bz2 file to extract the data I need, it goes extremely slow. It has to look through the entire bz2 file to fine the single file I am looking for. Is there anyway to speed this up? Also, I thought about possibly organizing the files in the tar.bz2 so I can instead have it know where to look. Is there anyway to organize files that are put into a bz2? More Info/Edit: I need to query the compressed file for each textfile. Is there a better compression method that supports such a large number of

GoLang: Decompress bz2 in on goroutine, consume in other goroutine

故事扮演 提交于 2019-12-05 20:30:19
I am a new-grad SWE learning Go (and loving it). I am building a parser for Wikipedia dump files - basically a huge bzip2-compressed XML file (~50GB uncompressed). I want to do both streaming decompression and parsing, which sounds simple enough. For decompression, I do: inputFilePath := flag.Arg(0) inputReader := bzip2.NewReader(inputFile) And then pass the reader to the XML parser: decoder := xml.NewDecoder(inputFile) However, since both decompressing and parsing are expensive operations, I would like to have them run on separate Go routines to make use of additional cores. How would I go

Why is seeking from the end of a file allowed for BZip2 files and not Gzip files?

痴心易碎 提交于 2019-12-05 16:34:34
The Question I am parsing large compressed files in Python 2.7.6 and would like to know the uncompressed file size before starting. I am trying to use the second technique presented in this SO answer . It works for bzip2 formatted files but not gzip formatted files. What is different about the two compression algorithms that causes this? Example Code This code snipped demonstrates the behavior, assuming you have "test.bz2" and "test.gz" present in your current working directory: import os import bz2 import gzip bz = bz2.BZ2File('test.bz2', mode='r') bz.seek(0, os.SEEK_END) bz.close() gz = gzip

BZ2 compression in C++ with bzlib.h

∥☆過路亽.° 提交于 2019-12-05 14:54:50
I currently need some help learning how to use bzlib.h header. I was wondering if anyone would be as so kind to help me figure out a compressToBZ2() function in C++ without using any Boost libraries? void compressBZ2(std::string file) { std::ifstream infile; int fileDestination = infile.open(file.c_str()); char bz2Filename[] = "file.bz2"; FILE *bz2File = fopen(bz2Filename, "wb"); int bzError; const int BLOCK_MULTIPLIER = 7; BZFILE *myBZ = BZ2_bzWriteOpen(&bzError, bz2File, BLOCK_MULTIPLIER, 0, 0); const int BUF_SIZE = 10000; char* buf = new char[BUF_SIZE]; ssize_t bytesRead; while ((bytesRead

How to use bzip2 format in iOS? Apple tell me bzBuffToBuffDecompress is private APIs

泄露秘密 提交于 2019-12-04 12:34:24
问题 Today I submit my iOS App to app store, but soon I got a mail from apple, it said that, cannot be posted to the App Store because it is using private or undocumented APIs: Private Symbol References BZ2_bzBuffToBuffDecompress As you know, as outlined in the iPhone Developer Program License Agreement section 3.3.1, the use of non-public APIs is not permitted. Before your application can be reviewed by the App Review Team, please resolve this issue and upload a new binary to iTunes Connect. What

bz2 in javascript

╄→гoц情女王★ 提交于 2019-12-03 21:53:50
Are there any javascript libraries that can take a byte array and bz2 decompress it into another byte array? I know that many browsers have this capability for an entire stream, but this array is at an offset from the start of the stream. Yes. Here's ont for byte array: https://github.com/antimatter15/bzip2.js And for binary strings: https://github.com/kirilloid/bzip2-js 来源: https://stackoverflow.com/questions/9434613/bz2-in-javascript

How to use awk for a compressed file

有些话、适合烂在心里 提交于 2019-12-03 15:54:11
问题 How can I change the following command for a compressed file? awk 'FNR==NR { array[$1,$2]=$8; next } ($1,$2) in array { print $0 ";" array[$1,$2] }' input1.vcf input2.vcf The command working fine with normal file. I need to change the command for compressed files. 回答1: You need to read them compressed files like this: awk '{ ... }' <(gzip -dc input1.vcf.gz) <(gzip -dc input2.vcf.gz) Try this: awk 'FNR==NR { sub(/AA=\.;/,""); array[$1,$2]=$8; next } ($1,$2) in array { print $0 ";" array[$1,$2]

ssh multiple commands appends question mark to file name

笑着哭i 提交于 2019-12-02 14:30:55
问题 I have a database transfer script, which uses bzip2 to minimise locking of large databases on a server. First line is ssh root@server "mysqldump db | bzip2 >/root/db.sql.bz2" This works on many servers, but on a new Ubuntu 14.04 server the file created on the server has a question mark appended: ls -la gt* -rw-r--r-- 1 root root 2364190 Nov 21 00:25 db.sql.bz2? Any idea why this may be happening? 回答1: Does your script have CR+LF line-endings? Make sure to use Unix (LF) line endings. 来源: https

What is the difference between incremental and one-shot compression?

爱⌒轻易说出口 提交于 2019-12-01 11:13:46
I am trying to use the bz2 and/or lzma packages in python. I am trying to compress a database dump in csv format and then put it to a zip file. I got it to work with one-shot compression with both the packages. Code for which looks like this: with ZipFile('something.zip', 'w') as zf: content = bz2.compress(bytes(csv_string, 'UTF-8')) # also with lzma zf.writestr( 'something.csv' + '.bz2', content, compress_type=ZIP_DEFLATED ) When I try to use incremental compression then it creates a .zip file which when I try to extract keeps giving some archive file recursively. Code for which looks like

Bzip2 block header: 1AY&SY

丶灬走出姿态 提交于 2019-12-01 09:12:48
This is the question about bzip2 archive format . Any Bzip2 archive consists of file header, one or more blocks and tail structure. All blocks should start with "1AY&SY", 6 bytes of BCD-encoded digits of the Pi number, 0x314159265359. According to the source of bzip2 : /*-- A 6-byte block header, the value chosen arbitrarily as 0x314159265359 :-). A 32 bit value does not really give a strong enough guarantee that the value will not appear by chance in the compressed datastream. Worst-case probability of this event, for a 900k block, is about 2.0e-3 for 32 bits, 1.0e-5 for 40 bits and 4.0e-8