How Should I Implement a Huge but Simple Indexed StringList in Delphi?

后端 未结 7 1616
攒了一身酷
攒了一身酷 2021-01-06 09:19

I am using Delphi 2009. I have a very simple data structure, with 2 fields:

  1. A string that is the key field I need to retrieve by and is usually 4 to 15 charact
相关标签:
7条回答
  • 2021-01-06 09:54

    If you more often need large datasets, and have some money to spare, simply stuff 16GB ( 500-750 Eur) in a machine, and make a separate process with some 64-bit compiler (*) of it that you query over e.g. shared mem or other IPC method.

    In that case you can use the in-memory approach till Delphi 64-bit finally comes out. Since your data seems to be simple ( a map from array of char to array of char) it is easily to export over IPC.

    This is of course if this approach has any merit for your case (like it is a cache or so), which I can't determine from your question.

    (*) I recommend FPC of course :-)

    I did this once, till about 5 million objects, 5 GB of data.

    I got permission to open source the container types I made for it, they are at:

    http://www.stack.nl/~marcov/lightcontainers.zip (warning: very dirty code)

    mghie: to answer in another cliche: There is no silver bullet

    Databases have a lot of other assumptions too

    • their generalized approach make relative inefficient use of memory. Most notably your dataset using normal memory storage techniques falls inside the affordable memory ranges, which are of course typically bigger for a server (my bad assumption here, apparantly) than for a client.
    • databases assume that their resultsets can reduced to small sets within the database-server with a relative straight kind of processing, and assisted by indexing.
    • they have a relatively high latency.
    • they are relatively bad in some kinds of processing (like multidimensional analysis/ OLAP, which is why databases need to be extended for that)

    This makes databases relatively bad for use in e.g. caches, loadbalancers etc. Of course that is all provided that you need the speed. But the initial question felt a bit speed-sensitive to me.

    In a past job my function in an database oriented firm was to do everything but that, IOW fix the problems when the standard approach couldn't hack it (or required 4 socket Oracle servers for jobs where the budget didn't warrant such expenses). The solution/hack written above was a bit of OLAPpy, and connected to hardware (a rfid chipprogramming device), requiring some guaranteed response time. Two months of programming time, still runs, and couldn't even buy a windows server + oracle license for the cost.

    0 讨论(0)
  • 2021-01-06 09:59

    You should analyse your data. If

    1. a sizeable part of the data values is larger than the default file system block size,
    2. you don't want to search in the data values using SQL (so it doesn't matter what format they are stored in), and
    3. you really need random access over the whole database,

    then you should test whether compressing your data values increases performance. The decompression of data values (especially on a modern machine with multiple cores, performed in background threads) should incur only a small performance hit, but the gains from having to read fewer blocks from the hard disc (especially if they are not in the cache) could be much larger.

    But you need to measure, maybe the database engine stores compressed data anyway.

    0 讨论(0)
  • 2021-01-06 10:04

    BerkleyDB is exactly that

    0 讨论(0)
  • 2021-01-06 10:06

    For more than 10GB in data, a database is exactly what you need. It will handle indexing for rapidly locating data (your random retrieval), the functionality for adding, modifying, and deleting data, and the actual storage, as well as much more if you so choose.

    There are dozens of posts here related to which databases are available for use in Delphi, including built-ins and FOS ones like Firebird.

    0 讨论(0)
  • 2021-01-06 10:07

    Synopse Big Table by A. Bouchez. See his answer to my other question about SQLite/DISQLite.

    It wasn't even developed when I first asked this question, but now it's a quite mature and fully functional unit.

    0 讨论(0)
  • 2021-01-06 10:15

    Since your data is more than 3GB, you will need to make sure what ever database engine you select either handles tables that large, or split things up into multiple tables, which I would suggest doing no matter what the maximum size of a single table. If you perform the split, perform it as evenly as possible on a logical key break so that its easy to determine which table to use by the first or first two characters of the key. This will greatly reduce the search times by eliminating any records which could never match your query to start with.

    If you just want raw performance, and will only be performing read only lookups into the data, then your better served by an ordered index file(s) using a fixed size record for your keys which points to your data file. You can then perform a binary search easily on this data and avoid any database overhead. For even more of a performance gain, you can pre-load/cache the midpoints into memory to reduce repetitive reads.

    A simple fixed size record for your specs might look like:

    type
      rIndexRec = record
        KeyStr  : String[15];  // short string 15 chars max
        DataLoc : integer;     // switch to int64 if your using gpHugeFile
      end;
    

    For initial loading, use the Turbo Power sort found in the SysTools, which the latest version for Delphi 2009/2010 can be downloaded on the songbeamers website. The DataLoc would be the stream position of your datastring record, which writing/reading might look like the following:

    function WriteDataString(aDataString:String;aStream:tStream):integer;
    var
      aLen : integer;
    begin
      Result := aStream.Position;
      aLen := Length(aDataString);
      aStream.Write(aLen,sizeOf(aLen));
      aStream.Write(aDataString[1],aLen*sizeOf(Char));
    end;
    
    function ReadDataString(aPos:Integer;aStream:tStream):String;
    var
      aLen : integer;
    begin
      if aStream.Position <> aPos then
        aStream.Seek(aPos,soFromBeginning);
      result := '';
      aStream.Read(aLen,SizeOf(aLen));
      SetLength(Result,aLen);
      if aStream.Read(Result[1],aLen*sizeOf(Char)) <> aLen*SizeOf(Char) then
        raise Exception.Create('Unable to read entire data string');
    end;
    

    When you are creating your index records, the dataloc would be set to datastring record position. It doesn't matter the order in which records are loaded, as long as the index records are sorted. I used just this technique to keep a 6 billion record database up to date with monthly updates, so it scales to the extreme easily.

    EDIT: Yes, the code above is limited to around 2GB per datafile, but you can extend it by using gpHugeFile, or segmenting. I prefer the segmenting into multiple logical files < 2gb each, which will take up slightly less disk space.

    0 讨论(0)
提交回复
热议问题