large-data-volumes

Getting random results from large tables

不问归期 提交于 2019-12-05 14:19:58
I'm trying to get 4 random results from a table that holds approx 7 million records. Additionally, I also want to get 4 random records from the same table that are filtered by category. Now, as you would imagine doing random sorting on a table this large causes the queries to take a few seconds, which is not ideal. One other method I thought of for the non-filtered result set would be to just get PHP to select some random numbers between 1 - 7,000,000 or so and then do an IN(...) with the query to only grab those rows - and yes, I know that this method has a caveat in that you may get less

Alternatives to huge drop down lists (24,000+ items)

佐手、 提交于 2019-12-05 05:31:22
In my admin section, when I edit items, I have to attach each item to a parent item. I have a list of over 24,000 parent items, which are listed alphabetically in a drop down list (a list of music artists). The edit page that lists all these items in a drop down menu is 2MB, and it lags like crazy for people with old machines, especially in Internet Explorer. Whats a good alternative to replicate the same function, where I would need to select 1 of these 24,000 artists, without actually having them all pre-loaded into a drop down menu. Instead of filling a drop list with so many names you

Large MySQL tables

大憨熊 提交于 2019-12-05 02:47:09
问题 For a web application I'm developing, I need to store a large number of records. Each record will consist of a primary key and a single (short-ish) string value. I expect to have about 100GB storage available and would like to be able to use it all. The records will be inserted, deleted and read frequently and I must use a MySQL database. Data integrity is not crucial, but performance is. What issues and pitfalls am I likely to encounter and which storage engine would be best suited to the

STXXL equivalent in Java

大兔子大兔子 提交于 2019-12-05 02:25:48
问题 I'm searching a collection framework designed for huge datasets in Java that behaves transparently, like STXXL does for C++. It should transparently swap to disk, but in a much more efficient manner than plain OS-based VM swapping. A StringBuffer/String drop-in replacement would be a big plus. 回答1: These fill needs only partially: Oracle Berkeley DB Java Edition database backed collections: http://www.oracle.com/technology/documentation/berkeley-db/je/java/index.html Joafip persistent

what changes when your input is giga/terabyte sized?

对着背影说爱祢 提交于 2019-12-04 07:42:10
问题 I just took my first baby step today into real scientific computing today when I was shown a data set where the smallest file is 48000 fields by 1600 rows (haplotypes for several people, for chromosome 22). And this is considered tiny. I write Python, so I've spent the last few hours reading about HDF5, and Numpy, and PyTable, but I still feel like I'm not really grokking what a terabyte-sized data set actually means for me as a programmer. For example, someone pointed out that with larger

Best way to store/retrieve millions of files when their meta-data is in a SQL Database

不问归期 提交于 2019-12-04 03:14:52
I have a process that's going to initially generate 3-4 million PDF files, and continue at the rate of 80K/day. They'll be pretty small (50K) each, but what I'm worried about is how to manage the total mass of files I'm generating for easy lookup. Some details: I'll have some other steps to run once a file have been generated, and there will be a few servers participating, so I'll need to watch for files as they're generated. Once generated, the files will be available though a lookup process I've written. Essentially, I'll need to pull them based on an order number, which is unique per file.

psycopg2 COPY using cursor.copy_from() freezes with large inputs

岁酱吖の 提交于 2019-12-03 21:58:05
Consider the following code in Python, using psycopg2 cursor object (Some column names were changed or omitted for clarity): filename='data.csv' file_columns=('id', 'node_id', 'segment_id', 'elevated', 'approximation', 'the_geom', 'azimuth') self._cur.copy_from(file=open(filename), table=self.new_table_name, columns=file_columns) The database is located on a remote machine on a fast LAN. Using \COPY from bash works very fast, even for large (~1,000,000 lines) files. This code is ultra-fast for 5,000 lines, but when data.csv grows beyond 10,000 lines, the program freezes completely. Any

Large MySQL tables

匆匆过客 提交于 2019-12-03 17:13:59
For a web application I'm developing, I need to store a large number of records. Each record will consist of a primary key and a single (short-ish) string value. I expect to have about 100GB storage available and would like to be able to use it all. The records will be inserted, deleted and read frequently and I must use a MySQL database. Data integrity is not crucial, but performance is. What issues and pitfalls am I likely to encounter and which storage engine would be best suited to the task? Many thanks, J Whatever solution you use, since you say your database will be write-heavy you need

Docker Data Volume Container - Can I share across swarm

眉间皱痕 提交于 2019-12-03 05:47:20
问题 I know how to create and mount a data volume container to multiple other containers using --volumes-from, but I do have a few questions regarding it's usage and limitations: Situation: I am looking to use a data volume container to store user uploaded images in for my web application. This data volume container will be used/mounted by many other containers running the web frontend. Questions: Can data volume containers be used/mounted in containers residing on other hosts within a docker

Plotting of very large data sets in R

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-03 04:41:13
问题 How can I plot a very large data set in R? I'd like to use a boxplot, or violin plot, or similar. All the data cannot be fit in memory. Can I incrementally read in and calculate the summaries needed to make these plots? If so how? 回答1: In supplement to my comment to Dmitri answer, a function to calculate quantiles using ff big-data handling package: ffquantile<-function(ffv,qs=c(0,0.25,0.5,0.75,1),...){ stopifnot(all(qs<=1 & qs>=0)) ffsort(ffv,...)->ffvs j<-(qs*(length(ffv)-1))+1 jf<-floor(j)