Is there a good way to detect a stale NFS mount

后端未结

关注

 7  1108

别跟我提以往

I have a procedure I want to initiate only if several tests complete successfully.

One test I need is that all of my NFS mounts are alive and well.

Can I do

相关标签:

7条回答

我在风中等你

2020-12-23 15:24

Writing a C program that checks for ESTALE is a good option if you don't mind waiting for the command to finish because of the stale file system. If you want to implement a "timeout" option the best way I've found to implement it (in a C program) is to fork a child process that tries to open the file. You then check if the child process has finished reading a file successfully in the filesystem within an allocated amount of time.

Here is a small proof of concept C program to do this:

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <unistd.h>
#include <errno.h>
#include <fcntl.h>
#include <sys/wait.h>


void readFile();
void waitForChild(int pid);


int main(int argc, char *argv[])
{
  int pid;

  pid = fork();

  if(pid == 0) {
    // Child process.
    readFile();
  }
  else if(pid > 0) {
    // Parent process.
    waitForChild(pid);
  }
  else {
    // Error
    perror("Fork");
    exit(1);
  }

  return 0;
}

void waitForChild(int child_pid)
{
  int timeout = 2; // 2 seconds timeout.
  int status;
  int pid;

  while(timeout != 0) {
    pid = waitpid(child_pid, &status, WNOHANG);
    if(pid == 0) {
      // Still waiting for a child.
      sleep(1);
      timeout--;
    }
    else if(pid == -1) {
      // Error
      perror("waitpid()");
      exit(1);
    }
    else {
      // The child exited.
      if(WIFEXITED(status)) {
        // Child was able to call exit().
        if(WEXITSTATUS(status) == 0) {
          printf("File read successfully!\n");
          return;
        }
      }
      printf("File NOT read successfully.\n");
      return;
    }
  }

  // The child did not finish and the timeout was hit.
  kill(child_pid, 9);
  printf("Timeout reading the file!\n");
}

void readFile()
{
  int fd;

  fd = open("/path/to/a/file", O_RDWR);
  if(fd == -1) {
    // Error
    perror("open()");
    exit(1);
  }
  else {
    close(fd);
    exit(0);
  }
}

0 讨论(0)

长发绾君心

2020-12-23 15:25
A colleague of mine ran into your script. This doesn't avoid a "brute force" approach, but if I may in Bash:
```
while read _ _ mount _; do 
  read -t1 < <(stat -t "$mount") || echo "$mount timeout"; 
done < <(mount -t nfs)
```
mount can list NFS mounts directly. read -t (a shell builtin) can time out a command. stat -t (terse output) still hangs like an ls*. ls yields unnecessary output, risks false positives on huge/slow directory listings, and requires permissions to access - which would also trigger a false positive if it doesn't have them.
```
while read _ _ mount _; do 
  read -t1 < <(stat -t "$mount") || lsof -b 2>/dev/null|grep "$mount"; 
done < <(mount -t nfs)
```
We're using it with lsof -b (non-blocking, so it won't hang too) in order to determine the source of the hangs.

Thanks for the pointer!
- test -d (a shell builtin) would work instead of stat (a standard external) as well, but read -t returns success only if it doesn't time out and reads a line of input. Since test -d doesn't use stdout, a (( $? > 128 )) errorlevel check on it would be necessary - not worth the legibility hit, IMO.
0 讨论(0)
发布评论:

提交评论
- 加载中...
渐次进展

2020-12-23 15:26
In addition to previous answers, which hangs under some circumstances, this snippet checks all suitable mounts, kills with signal KILL, and is tested with CIFS too:
```
grep -v tracefs /proc/mounts | cut -d' ' -f2 | \
  while read m; do \
    timeout --signal=KILL 1 ls -d $m > /dev/null || echo "$m"; \
  done
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
生来不讨喜

2020-12-23 15:27

I wrote https://github.com/acdha/mountstatus which uses an approach similar to what UndeadKernel mentioned, which I've found to be the most robust approach: it's a daemon which periodically scans all mounted filesystems by forking a child process which attempts to list the top-level directory and SIGKILL it if it fails to respond in a certain timeout, with both successes and failures recorded to syslog. That avoids issues with certain client implementations (e.g older Linux) which never trigger timeouts for certain classes of error, NFS servers which are partially responsive but e.g. won't respond to actual calls like listdir, etc.

I don't publish them but the included Makefile uses fpm to build rpm and deb packages with an Upstart script.

0 讨论(0)
发布评论:

提交评论
- 加载中...

悲&欢浪女

2020-12-23 15:33

Another way, using shell script. Works good for me:

#!/bin/bash
# Purpose:
# Detect Stale File handle and remove it
# Script created: July 29, 2015 by Birgit Ducarroz
# Last modification: --
#

# Detect Stale file handle and write output into a variable and then into a file
mounts=`df 2>&1 | grep 'Stale file handle' |awk '{print ""$2"" }' > NFS_stales.txt`
# Remove : ‘ and ’ characters from the output
sed -r -i 's/://' NFS_stales.txt && sed -r -i 's/‘//' NFS_stales.txt && sed -r -i 's/’//' NFS_stales.txt

# Not used: replace space by a new line
# stales=`cat NFS_stales.txt && sed -r -i ':a;N;$!ba;s/ /\n /g' NFS_stales.txt`

# read NFS_stales.txt output file line by line then unmount stale by stale.
#    IFS='' (or IFS=) prevents leading/trailing whitespace from being trimmed.
#    -r prevents backslash escapes from being interpreted.
#    || [[ -n $line ]] prevents the last line from being ignored if it doesn't end with a \n (since read returns a non-zero exit code when it encounters EOF).

while IFS='' read -r line || [[ -n "$line" ]]; do
    echo "Unmounting due to NFS Stale file handle: $line"
    umount -fl $line
done < "NFS_stales.txt"
#EOF

0 讨论(0)

终归单人心

2020-12-23 15:34

You could write a C program and check for ESTALE.

#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <iso646.h>
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>

int main(){
    struct stat st;
    int ret;
    ret = stat("/mnt/some_stale", &st);
    if(ret == -1 and errno == ESTALE){
        printf("/mnt/some_stale is stale\n");
        return EXIT_SUCCESS;
    } else {
        return EXIT_FAILURE;
    }
}

0 讨论(0)

1 2 下一页