bigdata

What is the best beetween multiple small h5 files or one huge?

假装没事ソ 提交于 2020-08-07 04:54:19
问题 I'm working with huge sattelite data that i'm splitting into small tiles to feed a deep learning model. I'm using pytorch, which means the data loader can work with multiple thread. [settings : python, Ubuntu 18.04] I can't find any answer of which is the best in term of data accessing and storage between : registering all the data in one huge HDF5 file (over 20Go) splitting it into multiple (over 16 000) small HDF5 files (approx 1.4Mo). Is there any problem of multiple access of one file by

Hive: modify external table's location take too long

送分小仙女□ 提交于 2020-07-07 05:38:09
问题 Hive has two kinds of tables which are Managed and External Tables, for the difference, you can check Managed. VS External Tables. Currently, to move external database from HDFS to Alluxio , I need to modify external table's location to alluxio:// . The statement is something like: alter table catalog_page set location "alluxio://node1:19998/user/root/tpcds/1000/catalog_returns" According to my understanding, it should be a simple metastore modification,however, for some tables modification,

Is there a way to catch executor killed exception in Spark?

吃可爱长大的小学妹 提交于 2020-06-26 06:14:13
问题 During execution of my Spark program, sometimes (The reason for it is still a mystery to me) yarn kills containers (executors) giving the message that the memory limit was exceeded. My program does recover though with Spark re-executing the task by spawning a new container. However, in my program, a task also creates some intermediate files on the disk. When a container is killed, those files are left behind. Is there a way I can catch the executor-killed as an exception so that I can delete

jq streaming - filter nested list and retain global structure

非 Y 不嫁゛ 提交于 2020-06-25 21:14:38
问题 In a large json file, I want to remove some elements from a nested list, but keep the overall structure of the document. My example input it this (but the real one is large enough to demand streaming). { "keep_untouched": { "keep_this": [ "this", "list" ] }, "filter_this": [ {"keep" : "true"}, { "keep": "true", "extra": "keeper" } , { "keep": "false", "extra": "non-keeper" } ] } The required output just has one element of the 'filter_this' block removed: { "keep_untouched": { "keep_this": [

Converting hdf5 to csv or tsv files

寵の児 提交于 2020-05-25 17:10:25
问题 I am looking for a sample code which can convert .h5 files to csv or tsv. I have to read .h5 and output should be csv or tsv. Sample code would be much appreciated,please help as i have stuck on it for last few days.I followed wrapper classes but don't know how to use that.I am not a good programmer so facing lot of problem. please help thanks a lot in advance 回答1: You can also use h5dump -o dset.asci -y -w 400 dset.h5 -o dset.asci specifies the output file -y -w 400 specifies the dimension

Spark uses s3a: java.lang.NoSuchMethodError

喜你入骨 提交于 2020-05-16 03:56:31
问题 First update According to my current understanding, the issue is because the spark version I used,it should be spark_without_hadoop. The version mismatch is the reason why my compiled time and running time has mismatch. I'm doing something about the combination of spark_with_hadoop2.7 (2.4.3), hadoop (3.2.0) and Ceph luminous. However, when I tried to use spark to access ceph (for example, start spark-sql on shell), exception like below shows: INFO impl.MetricsSystemImpl: s3a-file-system

How to scale operations with a massive dictionary of lists in Python?

寵の児 提交于 2020-05-15 08:12:10
问题 I'm dealing with a "big data" problem in python, and I am really struggling for scalable solutions. The data structure I currently have is a massive dictionary of lists, with millions of keys and lists with millions of items. I need to do an operation on the items in the list. The problem is two-fold: (1) How to do scalable operations on a data structure this size? (2) How to do this with constraints of memory? For some code, here's a very basic example of a dictionary of lists: example_dict1

Spark Reading Compressed with Special Format

生来就可爱ヽ(ⅴ<●) 提交于 2020-05-15 04:20:06
问题 I have a file .gz I need to read this file and add the time and file name to this file I have some problems and need your help to recommend a way for this points. Because the file is compressed the first line is reading with not the proper format I think due to encoding problem I tried the below code but not working implicit val codec = Codec("UTF-8") codec.onMalformedInput(CodingErrorAction.REPLACE) codec.onUnmappableCharacter(CodingErrorAction.REPLACE) File has special format and I need to

Does stages in an application run parallel in spark?

僤鯓⒐⒋嵵緔 提交于 2020-05-14 18:25:06
问题 I have a doubt that, how do stages execute in a spark application. Is there any consistency in execution of stages that can be defined by programmer or will it derived by spark engine? 回答1: Check the entities(stages, partitions) in this pic: pic credits Does stages in a job(spark application ?) run parallel in spark? Yes, they can be executed in parallel if there is no sequential dependency. Here Stage 1 and Stage 2 partitions can be executed in parallel but not Stage 0 partitions, because of

Regex, grep lines containing only 1 occurrences of a char

被刻印的时光 ゝ 提交于 2020-05-08 04:50:42
问题 I am looking for an efficient regular expression (preferably possessive), which i can use to grep lines containing only one delimiter (',') from a big file (5Gb) : E.G X,Y X1,Y1,Y2 X3,Y3 X4,Y4 X5,Y5,Z6 >>> grep "???" big_file X,Y X3,Y3 X4,Y4 回答1: Shouldn't a simple ^[^,]*,[^,]*$ avoid backtracking, because of the start/end-of-string markers? 回答2: Although @Rawling (one of the answers here) is right and his regular expression is correct, it is still not possessive and therefore not optimized,