Attempt to access remote folder mounted with CIFS hangs when disconnected

百般思念 提交于 2019-12-06 09:11:24

问题


This question is an extension for that question.

Yet again: I'm working under CentOS 6.0 and I have a remote win7 folder, mounted with:

mount -t cifs //PC128/mnt /media/net -o "username=WORKGROUP\user,password=pwd,rw,noexec,soft,uid=user,gid=user"

When remote folder is not available (e.g. network cable is pulled out) an attempt to access the remote folder locks an application I'm working on. At first I detected that QDir::exists() caused locking for 20-90 seconds (I still can't find out why such difference), further I detected that any call to stat() function leads to application lock.

I followed an advice provided in topic above, I moved QDir::exists() call (and later - call to the stat() function) to another thread and this didn't solve the problem. The application still hangs when connection is suddenly lost. Qt trace shows that lock is somewhere in the kernel:

0   __kernel_vsyscall
1   __xstat64@GLIBC_2.1               /lib/libc.so.6
2   QFSFileEnginePrivate::doStat      stat.h

I did also tried to check if remote share is still mounted before trying to access folder itself, but it didn't help. Approaches such as:

mount | grep /media/net

show that shared folder is still mounted even is there is no active connection to the network.

Checking folder status differences such as:

stat -fc%t:%T /media/net/ != stat -fc%t:%T /media/net/..

also hangs for ~20 seconds.

So I have several questions:

  1. Is there any way to change CIFS timeouts? I did try to find out but it seems that there is no appropriate parameters and no CIFS config.
  2. How can I check if remote folder is still mounted and not get locked?
  3. How can I check is folder exists and also not get locked?

回答1:


Your problem: "An unreachable network filesystem" is a very well known example which trigger linux hung task which isn't the same of zombies process at all(killing the parent PID won't do anything)

An hung task, is task which triggered a system call that cause problem in the kernel, so that the system call never return. The major particularity is that the task is declared in the "D" state by the scheduler which mean the program is in an uninterruptible state. This mean that you can do nothing to stop you program: You can trigger all signal to the task, it would not respond. Launching hundreds of SIGTERM/SIGKILL does nothing!

This the case whith my old kernel: when my nfs server crash, I need to reboot the client to kill the tasks using the filesystem. I compiled it a long time ago (I have still the build tree on my hdd) and during the configuration I saw this in lib/Kconfig.debug:

config DETECT_HUNG_TASK
    bool "Detect Hung Tasks"
    depends on DEBUG_KERNEL
    default LOCKUP_DETECTOR
    help
      Say Y here to enable the kernel to detect "hung tasks",
      which are bugs that cause the task to be stuck in
      uninterruptible "D" state indefinitiley.

      When a hung task is detected, the kernel will print the
      current stack trace (which you should report), but the
      task will stay in uninterruptible state. If lockdep is
      enabled then all held locks will also be reported. This
      feature has negligible overhead.

It was only proposing to detect such tash or panic on detection: I don't checked if recent kernel actually can solve the problem (It seems to be the case with your question), but I think it didn't worth enabling it.

There is second problem : normally, the detection occur after 120 seconds, but I saw also a Konfig option for this:

config DEFAULT_HUNG_TASK_TIMEOUT
    int "Default timeout for hung task detection (in seconds)"
    depends on DETECT_HUNG_TASK
    default 120
    help
      This option controls the default timeout (in seconds) used
      to determine when a task has become non-responsive and should
      be considered hung.

      It can be adjusted at runtime via the kernel.hung_task_timeout_secs
      sysctl or by writing a value to
      /proc/sys/kernel/hung_task_timeout_secs.

      A timeout of 0 disables the check.  The default is two minutes.
      Keeping the default should be fine in most cases.

This also works with kernel threads: example: make a loop device to a file on a fuse filesystem. Then crash the userspace program controlling the fuse filesystem! You should a get a Ktread which name is in the form loopX (X correspond normally to your loopback device number) HUNGing!

weblinks:

https://unix.stackexchange.com/questions/5642/what-if-kill-9-does-not-work (look at the answer written by ultrasawblade)

http://www.linuxquestions.org/questions/linux-general-1/kill-a-hung-task-when-kill-9-doesn't-help-697305/

http://forums-web2.gentoo.org/viewtopic-t-811557-start-0.html

http://comments.gmane.org/gmane.linux.kernel/1189978

http://comments.gmane.org/gmane.linux.kernel.cifs/7674 (This is a case similar to yours)

In your case of the three question: you have the answer: This probably due to what is probably a well known bug in the vfs linux kernel layer! (There is no CIFS timeouts)



来源:https://stackoverflow.com/questions/18085868/attempt-to-access-remote-folder-mounted-with-cifs-hangs-when-disconnected

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!