How can I analyse ~13GB of data?

前端 未结 4 1439
梦谈多话
梦谈多话 2021-02-07 06:26

I have ~300 text files that contain data on trackers, torrents and peers. Each file is organised like this:

tracker.txt

time torrent
            


        
4条回答
  •  再見小時候
    2021-02-07 07:15

    I would give MySQL another try but with a different schema:

    • do not use id-columns here
    • use natural primary keys here:

      Peer: ip, port
      Torrent: infohash
      Tracker: url
      TorrentPeer: peer_ip, torrent_infohash, peer_port, time
      TorrentTracker: tracker_url, torrent_infohash, time

    • use innoDB engine for all tables

    This has several advantages:

    • InnoDB uses clustered indexes for primary key. Means that all data can be retrieved directly from index without additional lookup when you only request data from primary key columns. So InnoDB tables are somewhat index-organized tables.
    • Smaller size since you do not have to store the surrogate keys. -> Speed, because lesser IO for the same results.
    • You may be able to do some queries now without using (expensive) joins, because you use natural primary and foreign keys. For example the linking table TorrentAtPeer directly contains the peer ip as foreign key to the peer table. If you need to query the torrents used by peers in a subnetwork you can now do this without using a join, because all relevant data is in the linking table.

    If you want the torrent count per peer and you want the peer's ip in the results too then we again have an advantage when using natural primary/foreign keys here.

    With your schema you have to join to retrieve the ip:

    SELECT Peer.ip, COUNT(DISTINCT torrent) 
        FROM TorrentAtPeer, Peer 
        WHERE TorrentAtPeer.peer = Peer.id 
        GROUP BY Peer.ip;
    

    With natural primary/foreign keys:

    SELECT peer_ip, COUNT(DISTINCT torrent) 
        FROM TorrentAtPeer 
        GROUP BY peer_ip;
    

    EDIT Well, original posted schema was not the real one. Now the Peer table has a port field. I would suggest to use primary key (ip, port) here and still drop the id column. This also means that the linking table needs to have multicolumn foreign keys. Adjusted the answer ...

提交回复
热议问题