close() is not closing socket properly

后端 未结 3 687
既然无缘
既然无缘 2020-11-28 03:06

I have a multi-threaded server (thread pool) that is handling a large number of requests (up to 500/sec for one node), using 20 threads. There\'s a listener thread that acc

相关标签:
3条回答
  • 2020-11-28 03:45

    This sounds to me like a bug in your Linux distribution.

    The GNU C library documentation says:

    When you have finished using a socket, you can simply close its file descriptor with close

    Nothing about clearing any error flags or waiting for the data to be flushed or any such thing.

    Your code is fine; your O/S has a bug.

    0 讨论(0)
  • 2020-11-28 03:54

    Great answer from Joseph Quinsey. I have comments on the haveInput function. Wondering how likely it is that select returns an fd you did not include in your set. This would be a major OS bug IMHO. That's the kind of thing I would check if I wrote unit tests for the select function, not in an ordinary app.

    if (!(status = select(fd + 1, &fds, 0, 0, &tv)))
       return FALSE;
    else if (status > 0 && FD_ISSET(fd, &fds))
       return TRUE;
    else if (status > 0)
       FatalError("I am confused"); // <--- fd unknown to function
    

    My other comment pertains to the handling of EINTR. In theory, you could get stuck in an infinite loop if select kept returning EINTR, as this error lets the loop start over. Given the very short timeout (0.01), it appears highly unlikely to happen. However, I think the appropriate way of dealing with this would be to return errors to the caller (flushSocketBeforeClose). The caller can keep calling haveInput has long as its timeout hasn't expired, and declare failure for other errors.

    ADDITION #1

    flushSocketBeforeClose will not exit quickly in case of read returning an error. It will keep looping until the timeout expires. You can't rely on the select inside haveInput to anticipate all errors. read has errors of its own (ex: EIO).

         while (haveInput(fd, 0.01)) 
            if (!read(fd, discard, sizeof discard)) <-- -1 does not end loop
               return TRUE; 
    
    0 讨论(0)
  • 2020-11-28 04:00

    Here is some code I've used on many Unix-like systems (e.g SunOS 4, SGI IRIX, HPUX 10.20, CentOS 5, Cygwin) to close a socket:

    int getSO_ERROR(int fd) {
       int err = 1;
       socklen_t len = sizeof err;
       if (-1 == getsockopt(fd, SOL_SOCKET, SO_ERROR, (char *)&err, &len))
          FatalError("getSO_ERROR");
       if (err)
          errno = err;              // set errno to the socket SO_ERROR
       return err;
    }
    
    void closeSocket(int fd) {      // *not* the Windows closesocket()
       if (fd >= 0) {
          getSO_ERROR(fd); // first clear any errors, which can cause close to fail
          if (shutdown(fd, SHUT_RDWR) < 0) // secondly, terminate the 'reliable' delivery
             if (errno != ENOTCONN && errno != EINVAL) // SGI causes EINVAL
                Perror("shutdown");
          if (close(fd) < 0) // finally call close()
             Perror("close");
       }
    }
    

    But the above does not guarantee that any buffered writes are sent.

    Graceful close: It took me about 10 years to figure out how to close a socket. But for another 10 years I just lazily called usleep(20000) for a slight delay to 'ensure' that the write buffer was flushed before the close. This obviously is not very clever, because:

    • The delay was too long most of the time.
    • The delay was too short some of the time--maybe!
    • A signal such SIGCHLD could occur to end usleep() (but I usually called usleep() twice to handle this case--a hack).
    • There was no indication whether this works. But this is perhaps not important if a) hard resets are perfectly ok, and/or b) you have control over both sides of the link.

    But doing a proper flush is surprisingly hard. Using SO_LINGER is apparently not the way to go; see for example:

    • http://msdn.microsoft.com/en-us/library/ms740481%28v=vs.85%29.aspx
    • https://www.google.ca/#q=the-ultimate-so_linger-page

    And SIOCOUTQ appears to be Linux-specific.

    Note shutdown(fd, SHUT_WR) doesn't stop writing, contrary to its name, and maybe contrary to man 2 shutdown.

    This code flushSocketBeforeClose() waits until a read of zero bytes, or until the timer expires. The function haveInput() is a simple wrapper for select(2), and is set to block for up to 1/100th of a second.

    bool haveInput(int fd, double timeout) {
       int status;
       fd_set fds;
       struct timeval tv;
       FD_ZERO(&fds);
       FD_SET(fd, &fds);
       tv.tv_sec  = (long)timeout; // cast needed for C++
       tv.tv_usec = (long)((timeout - tv.tv_sec) * 1000000); // 'suseconds_t'
    
       while (1) {
          if (!(status = select(fd + 1, &fds, 0, 0, &tv)))
             return FALSE;
          else if (status > 0 && FD_ISSET(fd, &fds))
             return TRUE;
          else if (status > 0)
             FatalError("I am confused");
          else if (errno != EINTR)
             FatalError("select"); // tbd EBADF: man page "an error has occurred"
       }
    }
    
    bool flushSocketBeforeClose(int fd, double timeout) {
       const double start = getWallTimeEpoch();
       char discard[99];
       ASSERT(SHUT_WR == 1);
       if (shutdown(fd, 1) != -1)
          while (getWallTimeEpoch() < start + timeout)
             while (haveInput(fd, 0.01)) // can block for 0.01 secs
                if (!read(fd, discard, sizeof discard))
                   return TRUE; // success!
       return FALSE;
    }
    

    Example of use:

       if (!flushSocketBeforeClose(fd, 2.0)) // can block for 2s
           printf("Warning: Cannot gracefully close socket\n");
       closeSocket(fd);
    

    In the above, my getWallTimeEpoch() is similar to time(), and Perror() is a wrapper for perror().

    Edit: Some comments:

    • My first admission is a bit embarrassing. The OP and Nemo challenged the need to clear the internal so_error before close, but I cannot now find any reference for this. The system in question was HPUX 10.20. After a failed connect(), just calling close() did not release the file descriptor, because the system wished to deliver an outstanding error to me. But I, like most people, never bothered to check the return value of close. So I eventually ran out of file descriptors (ulimit -n), which finally got my attention.

    • (very minor point) One commentator objected to the hard-coded numerical arguments to shutdown(), rather than e.g. SHUT_WR for 1. The simplest answer is that Windows uses different #defines/enums e.g. SD_SEND. And many other writers (e.g. Beej) use constants, as do many legacy systems.

    • Also, I always, always, set FD_CLOEXEC on all my sockets, since in my applications I never want them passed to a child and, more importantly, I don't want a hung child to impact me.

    Sample code to set CLOEXEC:

       static void setFD_CLOEXEC(int fd) {
          int status = fcntl(fd, F_GETFD, 0);
          if (status >= 0)
             status = fcntl(fd, F_SETFD, status | FD_CLOEXEC);
          if (status < 0)
             Perror("Error getting/setting socket FD_CLOEXEC flags");
       }
    
    0 讨论(0)
提交回复
热议问题