How can I analyse ~13GB of data?

前端未结

关注

 4  1437

I have ~300 text files that contain data on trackers, torrents and peers. Each file is organised like this:

tracker.txt

time torrent


                      
              相关标签:


      
      
        
          4条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  我在风中等你        
                
              
                            
                2021-02-07 07:09
              
            
            
                                                                       
You state that your MySQL queries took too long. Have you ensured that proper indices are in place to support the kind of request you submitted? In your example, that would be an index for Peer.ip (or even a nested index (Peer.ip,Peer.id)) and an index for TorrentAtPeer.peer.

As I understand you Java results, you have much data but not that many different strings. So you could perhaps save some time by assigning a unique number to each tracker, torrent and peer. Using one table for each, with some indexed value holding the string and a numeric primary key as the id. That way, all tables relating these entities would only have to deal with those numbers, which could save a lot of space and make your operations a lot faster.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  再見小時候        
                
              
                            
                2021-02-07 07:15
              
            
            
                                                                       
I would give MySQL another try but with a different schema:


do not use id-columns here
use natural primary keys here: 



Peer: ip, port

Torrent: infohash

Tracker: url

TorrentPeer: peer_ip, torrent_infohash, peer_port, time

TorrentTracker: tracker_url, torrent_infohash, time

use innoDB engine for all tables


This has several advantages:


InnoDB uses clustered indexes for primary key. Means that all data can be retrieved directly from index without additional lookup when you only request data from primary key columns. So InnoDB tables are somewhat index-organized tables.
Smaller size since you do not have to store the surrogate keys. -> Speed, because lesser IO for the same results.
You may be able to do some queries now without using (expensive) joins, because you use natural primary and foreign keys. For example the linking table TorrentAtPeer directly contains the peer ip as foreign key to the peer table. If you need to query the torrents used by peers in a subnetwork you can now do this without using a join, because all relevant data is in the linking table.


If you want the torrent count per peer and you want the peer's ip in the results too then we again have an advantage when using natural primary/foreign keys here.

With your schema you have to join to retrieve the ip:

SELECT Peer.ip, COUNT(DISTINCT torrent) 
    FROM TorrentAtPeer, Peer 
    WHERE TorrentAtPeer.peer = Peer.id 
    GROUP BY Peer.ip;


With natural primary/foreign keys:

SELECT peer_ip, COUNT(DISTINCT torrent) 
    FROM TorrentAtPeer 
    GROUP BY peer_ip;


EDIT
Well, original posted schema was not the real one. Now the Peer table has a port field. I would suggest to use primary key (ip, port) here and still drop the id column. This also means that the linking table needs to have multicolumn foreign keys. Adjusted the answer ...
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  醉话见心        
                
              
                            
                2021-02-07 07:18
              
            
            
                                                                       
If you could use C++, you should take a look at Boost flyweight.

Using flyweight, you can write your code as if you had strings, but each instance of a string (your tracker name, etc.) uses only the size of a pointer.

Regardless of the language, you should convert the IP address to an int (take a look at this question) to save some more memory.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  一个人的身影        
                
              
                            
                2021-02-07 07:18
              
            
            
                                                                       
You most likely have a problem that can be solved by NOSQL  and distributed technologies.

i) I would write a distributed system using Hadoop/HBase. 

ii) Rent several tens / hundred AWS machines, but only for a few seconds (It'll still cost you less than a $0.50)

iii) Profit!!!
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复