large-data-volumes | 易学教程

Statistics on large table presented on the web

阅读更多关于 Statistics on large table presented on the web

问题 We have a large table of data with about 30 000 0000 rows and growing each day currently at 100 000 rows a day and that number will increase over time. Today we generate different reports directly from the database (MS-SQL 2012) and do a lot of calculations. The problem is that this takes time. We have indexes and so on but people today want blazingly fast reports. We also want to be able to change timeperiods, different ways to look at the data and so on. We only need to look at data that is

Tips for working with large quantity .txt files (and overall large size) - python?

阅读更多关于 Tips for working with large quantity .txt files (and overall large size) - python?

问题 I'm working on a script to parse txt files and store them into a pandas dataframe that I can export to a CSV. My script works easily when I was using <100 of my files - but now when trying to run it on the full sample, I'm running into a lot of issues. Im dealing with ~8000 .txt files with an average size of 300 KB, so in total about 2.5 GB in size. I was wondering if I could get tips on how to make my code more efficient. for opening and reading files, I use: filenames = os.listdir('.') dict

Is there a way to maintain a 200MB immutable data structure in memory and access it from a script?

阅读更多关于 Is there a way to maintain a 200MB immutable data structure in memory and access it from a script?

问题 I have a list of 9 million IPs and, with a set of hash tables, I can make a constant-time function that returns if a particular IP is in that list. Can I do it in PHP? If so, how? 回答1: The interesting thing about this question is the number of directions you can go. I'm not sure if caching is your best option simply because of the large set of data and the relatively low number of queries on it. Here are a few ideas. 1) Build a ram disk. Link your mysql database table to use the ramdisk

Select Count(*) over large amount of data

阅读更多关于 Select Count(*) over large amount of data

问题 I want to do this for a Report but i have 20,000,000 of records in my table and it causes an TimeOut in my application. SELECT T.transactionStatusID, TS.shortName AS TransactionStatusDefShortName, count(*) AS qtyTransactions FROM Transactions T INNER JOIN TransactionTypesCurrencies TTC ON T.id_Ent = TTC.id_Ent AND T.trnTypeCurrencyID = TTC.trnTypeCurrencyID INNER JOIN TransactionStatusDef TS ON T.id_Ent = TS.ent_Ent AND T.transactionStatusID = TS.ID WHERE T.id_Ent = @id_Ent GROUP BY T

dumping a mysql table to CSV (stdout) and then tunneling the output to another server

阅读更多关于 dumping a mysql table to CSV (stdout) and then tunneling the output to another server

问题 I'm trying to move a database table to another server; the complication is that the machine currently running the table has little to no space left; so I'm looking for a solution that can work over the net. I have tried mysqldumping the database from the src machine and piping it into mysql at the dest; but my database has 48m rows and even when turning auto_commit off & trx_commit cmd to 2; I am getting some dog slow times. mysqldump -uuser -ppass --opt dbname dbtable | mysql -h remove

Best way to store/retrieve millions of files when their meta-data is in a SQL Database

阅读更多关于 Best way to store/retrieve millions of files when their meta-data is in a SQL Database

问题 I have a process that's going to initially generate 3-4 million PDF files, and continue at the rate of 80K/day. They'll be pretty small (50K) each, but what I'm worried about is how to manage the total mass of files I'm generating for easy lookup. Some details: I'll have some other steps to run once a file have been generated, and there will be a few servers participating, so I'll need to watch for files as they're generated. Once generated, the files will be available though a lookup process

How to plot large data vectors accurately at all zoom levels in real time?

阅读更多关于 How to plot large data vectors accurately at all zoom levels in real time?

问题 I have large data sets (10 Hz data, so 864k points per 24 Hours) which I need to plot in real time. The idea is the user can zoom and pan into highly detailed scatter plots. The data is not very continuous and there are spikes. Since the data set is so large, I can't plot every point each time the plot refreshes. But I also can't just plot every nth point or else I will miss major features like large but short spikes. Matlab does it right. You can give it a 864k vector full of zeros and just

Avoid an “out of memory error” in Java(eclipse), when using large data structure?

阅读更多关于 Avoid an “out of memory error” in Java(eclipse), when using large data structure?

问题 OK, so I am writing a program that unfortunately needs to use a huge data structure to complete its work, but it is failing with a "out of memory error" during its initialization. While I understand entirely what that means and why it is a problem, I am having trouble overcoming it, since my program needs to use this large structure and I don't know any other way to store it. The program first indexes a large corpus of text files that I provide. This works fine. Then it uses this index to

NTFS directory has 100K entries. How much performance boost if spread over 100 subdirectories?

阅读更多关于 NTFS directory has 100K entries. How much performance boost if spread over 100 subdirectories?

问题 Context We have a homegrown filesystem-backed caching library. We currently have performance problems with one installation due to large number of entries (e.g. up to 100,000). The problem: we store all fs entries in one "cache directory". Very large directories perform poorly. We're looking at spreading those entries over subdirectories--as git does, e.g. 100 subdirectories with ~ 1,000 entries each. The question I understand that smaller directories sizes will help with filesystem access.

Need to compare very large files around 1.5GB in python

阅读更多关于 Need to compare very large files around 1.5GB in python

问题 "DF","00000000@11111.COM","FLTINT1000130394756","26JUL2010","B2C","6799.2" "Rail","00000.POO@GMAIL.COM","NR251764697478","24JUN2011","B2C","2025" "DF","0000650000@YAHOO.COM","NF2513521438550","01JAN2013","B2C","6792" "Bus","00009.GAURAV@GMAIL.COM","NU27012932319739","26JAN2013","B2C","800" "Rail","0000.ANU@GMAIL.COM","NR251764697526","24JUN2011","B2C","595" "Rail","0000MANNU@GMAIL.COM","NR251277005737","29OCT2011","B2C","957" "Rail","0000PRANNOY0000@GMAIL.COM","NR251297862893","21NOV2011",