Parallel processing strings Delphi full available CPU usage

问题

The goal is to achieve full usage of the available cores, in converting floats to strings in a single Delphi application. I think this problem applies to the general processing of string. Yet in my example I am specifically using the FloatToStr method.

What I am doing (I've kept this very simple so there is little ambiguity around the implementation):

Using Delphi XE6
Create thread objects which inherit from TThread, and start them.
In the thread execute procedure it will convert a large amount of doubles into strings via the FloatToStr method.
To simplify, these doubles are just the same constant, so there is no shared or global memory resource required by the threads.

Although multiple cores are used, the CPU usage % always will max out on the amount of a single core. I understand this is an established issue. So I have some specific questions.

In a simple way the same operation could be done by multiple app instances, and thereby achieve more full usage of the available CPU. Is it possible to do this effectively within the same executable ? I.e. assign threads different process ids on the OS level or some equivalent division recognised by the OS ? Or is this simply not possible in out of the box Delphi ?

On scope : I know there are different memory managers available & other groups have tried changing some of the lower level asm lock usage http://synopse.info/forum/viewtopic.php?id=57 But, I am asking this question in the scope of not doing things at such a low level.

Thanks

Hi J. My code is deliberately very simple :

TTaskThread = class(TThread)
public
  procedure Execute; override;
end;

procedure TTaskThread.Execute;
var
  i: integer;
begin
  Self.FreeOnTerminate := True;
  for i := 0 to 1000000000 do
    FloatToStr(i*1.31234);
end;

procedure TfrmMain.Button1Click(Sender: TObject);
var
  t1, t2, t3: TTaskThread;
begin
  t1 := TTaskThread.Create(True);
  t2 := TTaskThread.Create(True);
  t3 := TTaskThread.Create(True);
  t1.Start;
  t2.Start;
  t3.Start;
end;

This is a 'test code', where the CPU (via performance monitor) maxes out at 25% (I have 4 cores). If the FloatToStr line is swapped for a non string operation, e.g. Power(i, 2), then the performance monitor shows the expected 75% usage. (Yes there are better ways to measure this, but I think this is sufficient for the scope of this question)

I have explored this issue fairly thoroughly. The purpose of the question was to put forth the crux of the issue in a very simple form.

I am asking about limitations when using the FloatToStr method. And asking is there an implementation incarnation which will permit better usage of available cores.

Thanks.

回答1:

I second what everyone else has said in the comments. It is one of the dirty little secrets of Delphi that the FastMM memory manager is not scalable.

Since memory managers can be replaced you can simply replace FastMM with a scalable memory manager. This is a rapidly changing field. New scalable memory managers pop up every few months. The problem is that it is hard to write a correct scalable memory manager. What are you prepared to trust? One thing that can be said in FastMM's favour is that it is robust.

Rather than replacing the memory manager, it is better to replace the need to replace the memory manager. Simply avoid heap allocation. Find a way to do your work with need for repeated calls to allocate dynamic memory. Even if you had a scalable heap manager, heap allocation would still cost.

Once you decide to avoid heap allocation the next decision is what to use instead of FloatToStr. In my experience the Delphi runtime library does not offer much support. For example, I recently discovered that there is no good way to convert an integer to text using a caller supplied buffer. So, you may need to roll your own conversion functions. As a simple first step to prove the point, try calling sprintf from msvcrt.dll. This will provide a proof of concept.

回答2:

If you can't change the memory manager (MM) the only thing to do is to avoid using it where MM could be a bottleneck.

As for float to string conversion (Disclamer: I tested the code below with Delphi XE) instead of

procedure Test1;
var
  i: integer;
  S: string;

begin
  for i := 0 to 10 do begin
    S:= FloatToStr(i*1.31234);
    Writeln(S);
  end;
end;

you can use

procedure Test2;
var
  i: integer;
  S: string;
  Value: Extended;

begin
  SetLength(S, 64);
  for i := 0 to 10 do begin
    Value:= i*1.31234;
    FillChar(PChar(S)^, 64, 0);
    FloatToText(PChar(S), Value, fvExtended, ffGeneral, 15, 0);
    Writeln(S);
  end;
end;

which produce the same result but does not allocate memory inside the loop.

回答3:

And take attention

function FloatToStr(Value: Extended): string; overload;
function FloatToStr(Value: Extended; const FormatSettings: TFormatSettings): string; overload;

The first form of FloatToStr is not thread-safe, because it uses localization information contained in global variables. The second form of FloatToStr, which is thread-safe, refers to localization information contained in the FormatSettings parameter. Before calling the thread-safe form of FloatToStr, you must populate FormatSettings with localization information. To populate FormatSettings with a set of default locale values, call GetLocaleFormatSettings.

回答4:

Much thanks for your knowledge and help so far. As per your suggestions I've attempted to write an equivalent FloatToStr method in a way which avoids heap allocation. To some success. This is by no means a solid fool proof implementation, just nice and simple proof of concept which could be extended upon to achieve a more satisfying solution.

(Should also note using XE6 64-bit)

Experiment result/observations:

the CPU usage % was proportional to the number of threads started (i.e. each thread = 1 core maxed out via performance monitor).
as expected, with more threads started, performance degraded somewhat for each individual one (i.e. time measured to perform task - see code).

times are just rough averages

8 cores 3.3GHz - 1 thread took 4200ms. 6 threads took 5200ms each.
8 cores 2.5GHz - 1 thread took 4800ms. 2=>4800ms, 4=>5000ms, 6=>6300ms.

I did not calculate the overall time for a total multi thread run. Just observed CPU usage % and measured individual thread times.

Personally I find it a little hilarious that this actually works :) Or perhaps I have done something horribly wrong ?

Surely there are library units out there which resolve these things ?

The code:

unit Main;

interface

uses
  Winapi.Windows, Winapi.Messages, System.SysUtils, System.Variants, System.Classes, Vcl.Graphics,
  Vcl.Controls, Vcl.Forms, Vcl.Dialogs, Vcl.StdCtrls,
  Generics.Collections,
  DateUtils;

type
  TfrmParallel = class(TForm)
    Button1: TButton;
    Memo1: TMemo;
    procedure Button1Click(Sender: TObject);
  private
    { Private declarations }
  public
    { Public declarations }
  end;

  TTaskThread = class(TThread)
  private
    Fl: TList<double>;
  public
    procedure Add(l: TList<double>);
    procedure Execute; override;
  end;

var
  frmParallel: TfrmParallel;

implementation

{$R *.dfm}

{  TTaskThread  }

procedure TTaskThread.Add(l: TList<double>);
begin
  Fl := l;
end;

procedure TTaskThread.Execute;
var
  i, j: integer;
  s, xs: shortstring;

  FR: TFloatRec;
  V: double;
  Precision, D: integer;

  ZeroCount: integer;

  Start, Finish: TDateTime;

  procedure AppendByteToString(var Result: shortstring; const B: Byte);
  const
    A1 = '1';
    A2 = '2';
    A3 = '3';
    A4 = '4';
    A5 = '5';
    A6 = '6';
    A7 = '7';
    A8 = '8';
    A9 = '9';
    A0 = '0';
  begin
    if B = 49 then
      Result := Result + A1
    else if B = 50 then
      Result := Result + A2
    else if B = 51 then
      Result := Result + A3
    else if B = 52 then
      Result := Result + A4
    else if B = 53 then
      Result := Result + A5
    else if B = 54 then
      Result := Result + A6
    else if B = 55 then
      Result := Result + A7
    else if B = 56 then
      Result := Result + A8
    else if B = 57 then
      Result := Result + A9
    else
      Result := Result + A0;
  end;

  procedure AppendDP(var Result: shortstring);
  begin
    Result := Result + '.';
  end;

begin
  Precision := 9;
  D := 1000;
  Self.FreeOnTerminate := True;
  //
  Start := Now;
  for i := 0 to Fl.Count - 1 do
  begin
    V := Fl[i];   

//    //orignal way - just for testing
//    xs := shortstring(FloatToStrF(V, TFloatFormat.ffGeneral, Precision, D));

    //1. get float rec     
    FloatToDecimal(FR, V, TFloatValue.fvExtended, Precision, D);
    //2. check sign
    if FR.Negative then
      s := '-'
    else
      s := '';
    //2. handle negative exponent
    if FR.Exponent < 1 then
    begin
      AppendByteToString(s, 0);
      AppendDP(s);
      for j := 1 to Abs(FR.Exponent) do
        AppendByteToString(s, 0);
    end;      
    //3. count consecutive zeroes
    ZeroCount := 0;
    for j := Precision - 1 downto 0 do
    begin
      if (FR.Digits[j] > 48) and (FR.Digits[j] < 58) then
        Break;
      Inc(ZeroCount);
    end;
    //4. build string
    for j := 0 to Length(FR.Digits) - 1 do
    begin
      if j = Precision then
        Break;
      //cut off where there are only zeroes left up to precision
      if (j + ZeroCount) = Precision then
        Break;
      //insert decimal point - for positive exponent
      if (FR.Exponent > 0) and (j = FR.Exponent) then
        AppendDP(s);
      //append next digit
      AppendByteToString(s, FR.Digits[j]);
    end;      

//    //use just to test agreement with FloatToStrF
//    if s <> xs then
//      frmParallel.Memo1.Lines.Add(string(s + '|' + xs));

  end;
  Fl.Free;

  Finish := Now;
  //
  frmParallel.Memo1.Lines.Add(IntToStr(MillisecondsBetween(Start, Finish))); 
  //!YES LINE IS NOT THREAD SAFE!
end;

procedure TfrmParallel.Button1Click(Sender: TObject);
var
  i: integer;
  t: TTaskThread;
  l: TList<double>;
begin
  //pre generating the doubles is not required, is just a more useful test for me
  l := TList<double>.Create;
  for i := 0 to 10000000 do
    l.Add(Now/(-i-1)); //some double generation
  //
  t := TTaskThread.Create(True);
  t.Add(l);
  t.Start;
end;

end.

回答5:

FastMM4, by default, on thread contention, when one thread cannot acquire access to data, locked by another thread, calls Windows API function Sleep(0), and then, if the lock is still not available enters a loop by calling Sleep(1) after each check of the lock.

Each call to Sleep(0) experiences the expensive cost of a context switch, which can be 10000+ cycles; it also suffers the cost of ring 3 to ring 0 transitions, which can be 1000+ cycles. As about Sleep(1) – besides the costs associated with Sleep(0) – it also delays execution by at least 1 millisecond, ceding control to other threads, and, if there are no threads waiting to be executed by a physical CPU core, puts the core into sleep, effectively reducing CPU usage and power consumption.

That’s why, in your case, CPU use never reached 100% - because of the Sleep(1) issued by FastMM4.

This way of acquiring locks is not optimal.

A better way would have been a spin-lock of about 5000 pause instructions, and, if the lock was still busy, calling SwitchToThread() API call. If pause is not available (on very old processors with no SSE2 support) or SwitchToThread() API call was not available (on very old Windows versions, prior to Windows 2000), the best solution would be to utilize EnterCriticalSection/LeaveCriticalSection, that don’t have latency associated by Sleep(1), and which also very effectively cedes control of the CPU core to other threads. I have modified FastMM4 to use a new approach to waiting for a lock: CriticalSections instead of Sleep(). With these options, the Sleep() will never be used but EnterCriticalSection/LeaveCriticalSection will be used instead. Testing has shown that the approach of using CriticalSections instead of Sleep (which was used by default before in FastMM4) provides significant gain in situations when the number of threads working with the memory manager is the same or higher than the number of physical cores. The gain is even more evident on computers with multiple physical CPUs and Non-Uniform Memory Access (NUMA). I have implemented compile-time options to take away the original FastMM4 approach of using Sleep(InitialSleepTime) and then Sleep(AdditionalSleepTime) (or Sleep(0) and Sleep(1)) and replace them with EnterCriticalSection/LeaveCriticalSection to save valuable CPU cycles wasted by Sleep(0) and to improve speed (reduce latency) that was affected each time by at least 1 millisecond by Sleep(1), because the Critical Sections are much more CPU-friendly and have definitely lower latency than Sleep(1).

When these options are enabled, FastMM4-AVX it checks:

whether the CPU supports SSE2 and thus the "pause" instruction, and
whether the operating system has the SwitchToThread() API call, and,

and in this case uses "pause" spin-loop for 5000 iterations and then SwitchToThread() instead of critical sections; If a CPU doesn't have the "pause" instrcution or Windows doesn't have the SwitchToThread() API function, it will use EnterCriticalSection/LeaveCriticalSection. I have made available the fork called FastMM4-AVX at https://github.com/maximmasiutin/FastMM4

Here are the comparison of the Original FastMM4 version 4.992, with default options compiled for Win64 by Delphi 10.2 Tokyo (Release with Optimization), and the current FastMM4-AVX branch. Under some scenarios, the FastMM4-AVX branch is more than twice as fast comparing to the Original FastMM4. The tests have been run on two different computers: one under Xeon E6-2543v2 with 2 CPU sockets, each has 6 physical cores (12 logical threads) - with only 5 physical core per socket enabled for the test application. Another test was done under a i7-7700K CPU.

Used the "Multi-threaded allocate, use and free" and "NexusDB" test cases from the FastCode Challenge Memory Manager test suite, modified to run under 64-bit.

                     Xeon E6-2543v2 2*CPU     i7-7700K CPU
                    (allocated 20 logical  (allocated 8 logical
                     threads, 10 physical   threads, 4 physical
                     cores, NUMA)           cores)

                    Orig.  AVX-br.  Ratio   Orig.  AVX-br. Ratio
                    ------  -----  ------   -----  -----  ------
02-threads realloc   96552  59951  62.09%   65213  49471  75.86%
04-threads realloc   97998  39494  40.30%   64402  47714  74.09%
08-threads realloc   98325  33743  34.32%   64796  58754  90.68%
16-threads realloc  116708  45855  39.29%   71457  60173  84.21%
16-threads realloc  116273  45161  38.84%   70722  60293  85.25%
31-threads realloc  122528  53616  43.76%   70939  62962  88.76%
64-threads realloc  137661  54330  39.47%   73696  64824  87.96%
NexusDB 02 threads  122846  90380  73.72%   79479  66153  83.23%
NexusDB 04 threads  122131  53103  43.77%   69183  43001  62.16%
NexusDB 08 threads  124419  40914  32.88%   64977  33609  51.72%
NexusDB 12 threads  181239  55818  30.80%   83983  44658  53.18%
NexusDB 16 threads  135211  62044  43.61%   59917  32463  54.18%
NexusDB 31 threads  134815  48132  33.46%   54686  31184  57.02%
NexusDB 64 threads  187094  57672  30.25%   63089  41955  66.50%

Your code that calls FloatToStr is OK, since it allocates a result string using the memory manager, then reallocates it, etc. Even better idea would have been to explicitly deallocate it, for example:

procedure TTaskThread.Execute;
var
  i: integer;
  s: string;
begin
  for i := 0 to 1000000000 do
  begin
    s := FloatToStr(i*1.31234);
    Finalize(s);
  end;
end;

You can find better tests of the memory manager in the FastCode challenge test suite at http://fastcode.sourceforge.net/

来源：https://stackoverflow.com/questions/28079372/parallel-processing-strings-delphi-full-available-cpu-usage

标签

multithreading

Delphi

parallel-processing

delphi-xe6