Multi-threaded C program much slower in OS X than Linux

前端 未结 3 1547
佛祖请我去吃肉
佛祖请我去吃肉 2021-02-19 19:15

I wrote this for an OS class assignment that I\'ve already completed and handed in. I posted this question yesterday, but due to \"Academic Honesty\" regulations I took it off u

相关标签:
3条回答
  • 2021-02-19 19:53

    I've duplicated your result to a goodly extent (without the sweeper):

    #include <stdlib.h>
    #include <stdio.h>
    
    #include <pthread.h>
    
    pthread_mutex_t Lock;
    pthread_t       LastThread;
    int             Array[100];
    
    void *foo(void *arg)
    {
      pthread_t self  = pthread_self();
      int num_in_row  = 1;
      int num_streaks = 0;
      double avg_strk = 0.0;
      int i;
    
      for (i = 0; i < 1000000; ++i)
      {
        int p1 = (int) (100.0 * rand() / (RAND_MAX - 1));
        int p2 = (int) (100.0 * rand() / (RAND_MAX - 1));
    
        pthread_mutex_lock(&Lock);
        {
          int tmp   = Array[p1];
          Array[p1] = Array[p2];
          Array[p2] = tmp;
    
          if (pthread_equal(LastThread, self))
            ++num_in_row;
    
          else
          {
            ++num_streaks;
            avg_strk += (num_in_row - avg_strk) / num_streaks;
            num_in_row = 1;
            LastThread = self;
          }
        }
        pthread_mutex_unlock(&Lock);
      }
    
      fprintf(stdout, "Thread exiting with avg streak length %lf\n", avg_strk);
    
      return NULL;
    }
    
    int main(int argc, char **argv)
    {
      int       num_threads = (argc > 1 ? atoi(argv[1]) : 40);
      pthread_t thrs[num_threads];
      void     *ret;
      int       i;
    
      if (pthread_mutex_init(&Lock, NULL))
      {
        perror("pthread_mutex_init failed!");
        return 1;
      }
    
      for (i = 0; i < 100; ++i)
        Array[i] = i;
    
      for (i = 0; i < num_threads; ++i)
        if (pthread_create(&thrs[i], NULL, foo, NULL))
        {
          perror("pthread create failed!");
          return 1;
        }
    
      for (i = 0; i < num_threads; ++i)
        if (pthread_join(thrs[i], &ret))
        {
          perror("pthread join failed!");
          return 1;
        }
    
      /*
      for (i = 0; i < 100; ++i)
        printf("%d\n", Array[i]);
    
      printf("Goodbye!\n");
      */
    
      return 0;
    }
    

    On a Linux (2.6.18-308.24.1.el5) server Intel(R) Xeon(R) CPU E3-1230 V2 @ 3.30GHz

    [ltn@svg-dc60-t1 ~]$ time ./a.out 1
    
    real    0m0.068s
    user    0m0.068s
    sys 0m0.001s
    [ltn@svg-dc60-t1 ~]$ time ./a.out 2
    
    real    0m0.378s
    user    0m0.443s
    sys 0m0.135s
    [ltn@svg-dc60-t1 ~]$ time ./a.out 3
    
    real    0m0.899s
    user    0m0.956s
    sys 0m0.941s
    [ltn@svg-dc60-t1 ~]$ time ./a.out 4
    
    real    0m1.472s
    user    0m1.472s
    sys 0m2.686s
    [ltn@svg-dc60-t1 ~]$ time ./a.out 5
    
    real    0m1.720s
    user    0m1.660s
    sys 0m4.591s
    
    [ltn@svg-dc60-t1 ~]$ time ./a.out 40
    
    real    0m11.245s
    user    0m13.716s
    sys 1m14.896s
    

    On my MacBook Pro (Yosemite 10.10.2) 2.6 GHz i7, 16 GB memory

    john-schultzs-macbook-pro:~ jschultz$ time ./a.out 1
    
    real    0m0.057s
    user    0m0.054s
    sys 0m0.002s
    john-schultzs-macbook-pro:~ jschultz$ time ./a.out 2
    
    real    0m5.684s
    user    0m1.148s
    sys 0m5.353s
    john-schultzs-macbook-pro:~ jschultz$ time ./a.out 3
    
    real    0m8.946s
    user    0m1.967s
    sys 0m8.034s
    john-schultzs-macbook-pro:~ jschultz$ time ./a.out 4
    
    real    0m11.980s
    user    0m2.274s
    sys 0m10.801s
    john-schultzs-macbook-pro:~ jschultz$ time ./a.out 5
    
    real    0m15.680s
    user    0m3.307s
    sys 0m14.158s
    john-schultzs-macbook-pro:~ jschultz$ time ./a.out 40
    
    real    2m7.377s
    user    0m23.926s
    sys 2m2.434s
    

    It took my Mac ~12x times as much wall clock time to complete with 40 threads and that's versus a very old version of Linux + gcc.

    NOTE: I changed my code to do 1M swaps per thread.

    It looks like under contention OSX is doing a lot more work than Linux. Maybe it is interleaving them much more finely than Linux does?

    EDIT Updated code to record avg number of times a thread re-captures the lock immediately:

    Linux

    [ltn@svg-dc60-t1 ~]$ time ./a.out 10
    Thread exiting with avg streak length 2.103567
    Thread exiting with avg streak length 2.156641
    Thread exiting with avg streak length 2.101194
    Thread exiting with avg streak length 2.068383
    Thread exiting with avg streak length 2.110132
    Thread exiting with avg streak length 2.046878
    Thread exiting with avg streak length 2.087338
    Thread exiting with avg streak length 2.049701
    Thread exiting with avg streak length 2.041052
    Thread exiting with avg streak length 2.048456
    
    real    0m2.837s
    user    0m3.012s
    sys 0m16.040s
    

    Mac OSX

    john-schultzs-macbook-pro:~ jschultz$ time ./a.out 10
    Thread exiting with avg streak length 1.000000
    Thread exiting with avg streak length 1.000000
    Thread exiting with avg streak length 1.000000
    Thread exiting with avg streak length 1.000000
    Thread exiting with avg streak length 1.000000
    Thread exiting with avg streak length 1.000000
    Thread exiting with avg streak length 1.000000
    Thread exiting with avg streak length 1.000000
    Thread exiting with avg streak length 1.000000
    Thread exiting with avg streak length 1.000000
    
    real    0m34.163s
    user    0m5.902s
    sys 0m30.329s
    

    So, OSX is sharing its locks much more evenly and therefore has many more thread suspensions and resumptions.

    0 讨论(0)
  • 2021-02-19 20:01
    The OP does not mention/show any code that indicates the thread(s) sleep, wait, give up execution, etc and all the threads are at the same 'nice' level.  
    

    so an individual thread may well get the CPU and not release it until it has completed all 2mil executions.

    This would result in a minimal amount of time performing context switches, on linux.

    However, on the MAC OS, a execution is only given a 'time slice' to execute, before another 'ready to execute' thread/process is allowed to execute.

    This means many many more context switches.

    Context switches are performed in 'sys' time.

    The result is the MAC OS will take much longer to execute.

    To even the playing field, you could force context switches, by inserting a nanosleep() or a call to release execution via

    #include <sched.h>
    
    then calling
    
    int sched_yield(void);
    
    0 讨论(0)
  • 2021-02-19 20:05

    MacOSX and Linux implement pthread differently, causing this slow behavior. Specifically MacOSX does not use spinlocks (they are optional according to ISO C standard). This can lead to very, very slow code performance with examples like this one.

    0 讨论(0)
提交回复
热议问题