问题
I have started seeing this issue in the last couple of days. Ganglia gemtad process gets terminated within 5 min of its start with SIGSEGV (segfault)
This was stable since last few months..so not sure what changed.
Version - gmetad 3.7.1
I don't see any core dump or anything specific to gmetad in /var/log/messages or /var/log/secure either.
System snap (from top) at the time of this event
load average: 1.97, 0.99, 0.42
Memory also looks fairly Ok
free -m
total used free shared buffers cached
Mem: 7989 3624 4364 0 333 2562
-/+ buffers/cache: 728 7260
Swap: 4095 0 4095
I have a superviord process that forks & watches the gmetad -
here is the supervisor log
2016-10-20 14:34:55,707 INFO exited: gmetad (terminated by SIGSEGV; not expected)
2016-10-20 14:34:55,707 INFO received SIGCLD indicating a child quit
2016-10-20 14:34:57,712 INFO spawned: 'gmetad' with pid 24561
2016-10-20 14:34:59,929 INFO exited: gmetad (terminated by SIGSEGV; not expected)
2016-10-20 14:34:59,929 INFO received SIGCLD indicating a child quit
2016-10-20 14:35:02,932 INFO spawned: 'gmetad' with pid 24593
2016-10-20 14:35:04,897 INFO exited: gmetad (terminated by SIGSEGV; not expected)
2016-10-20 14:35:04,897 INFO received SIGCLD indicating a child quit
2016-10-20 14:35:08,903 INFO spawned: 'gmetad' with pid 24618
2016-10-20 14:35:11,257 INFO exited: gmetad (terminated by SIGSEGV; not expected)
2016-10-20 14:35:11,257 INFO received SIGCLD indicating a child quit
2016-10-20 14:35:12,257 INFO gave up: gmetad entered FATAL state, too many start retries too quickly
Has anyone faced this kind of issue with gmetad in particular? Appreciate any pointers.
回答1:
I was able to identify the issue and resolve.
Some key steps/findings -
- Change the 'debug_level' to > 1 in gmetad.conf to run the gmetaa in foreground and spit out verbose log on what its doing.
- I found out that gmetad process was getting killed at an exact same point - when it was trying to process a file for a particular node of a particular data_source.
- You could comment out all the other 'data_source' from gmetad.conf and try to isolate which data_source->node is problematic.
- After figuring out the problematic node, I just deleted the /path/to/rrd/node_dir/file_with_issue or entire dir itself. (Need to find a better way as this is data loss)
- Change back the debug_level and Restart the gmetad!
In my case, to pin point a file name - 'part_max_used.rrd' was a file name under /path/to/ganglia/rrds/node_name was the root cause of SIGSEGV
Hope this helps -)
来源:https://stackoverflow.com/questions/40162219/ganglia-gmetad-process-is-getting-terminated-by-sigsegv