Gearman PHP Extension: Dead Job Server = Slow Response from all Workers

I started with this question: Gearman: 3 seconds between client request and worker receive. Is this normal?

Environment:

Ubuntu 12.04 Desktop
PHP 5.3.10
Gearman (libgearman 1.1.5 with PHP Extension 1.1.1)
Multiple servers on LAN

I couldn't get a worker response time of less than 3 sec and I couldn't figure out why. I narrowed it down to a wrapper class I'd built. I then narrowed it further to a specific method within the class. Long story short the real problem seems to lie in the addServer method of the GearmanWorker in the PHP extension.

My wrapper class was attempting to connect to 3 Gearman Job Servers. Only 2 are actually up and running. When I attempt to connect to all 3 I get a warning about the 3rd not being able to connect. I also get a worker response time of 3 sec. When I remove the attempt to addServer the currently down job server then voila the worker response time is about 0.003 sec.

Now you might ask, why don't you just remove the down the server from your list of servers to connect to? Well, first it won't always be down. Second, what happens when one of the servers that is currently up or was up 5 minutes ago isn't any longer? Wham all jobs now take a minimum of 3 sec. Now I figure there is probably a way to configure that timeout down to 1 sec but a better solution, IMO, is for there to be a way to remove the dead server from the list of servers the worker is attempting to get jobs from.

In my research there is an addServer method. And there is an addFunction method. Then there is an unregister method for removing worker functionality from the list for given worker. However, I see no removeServer method.

So, is there a way to cull the list of job servers in GearmanWorker or do I need to kill the object, re-instantiate it, and reconnect to the new, culled, list of available job servers? Killing and restarting the GearmanWorker seems far from ideal.

What is the best way to scan for (and connect to) all active job servers while avoiding the timeout inherent with a job server that has died?

Thanks

So ultimately it appears that I'm not the only one with this issue. No one on the Google groups for Gearman could point to a solution either. So ultimately I wrote my own code (taking pieces from Gearman Monitor) to determine which job servers were up and running and which weren't.

try {
            $cxn = @fsockopen($ip, $gHosts->ports[$host], $errCode, $errMsg, $timeout);

            /* Using the new \Net_Gearman_Manager on a dead job server kept leading to
             *  fatal error which was uncaught. Thus crashing the script and leading
             *  no update of the server status
            */
            //$gearmanManager = new \Net_Gearman_Manager($ip . ':' . $gHosts->ports[$host], 1);

            if ($cxn === FALSE) {
                write_log($fLog, 'Connection FAILED');
                $output[$host] = FAILURE;
            } else {
                write_log($fLog, 'Connection Succeeded');
                $output[$host] = SUCCESS;
            }
        } catch (Net_Gearman_Exception $e) {
            write_log($fLog, $e->getMessage());
            $output[$host] = FAILURE;
        } catch (Exception $e) {
            write_log($fLog, $e->getMessage());
            $output[$host] = FAILURE;
        } // if (@$wrkr->addServer($ip, $gHosts->ports[$host]))

The $gHosts class is a configuration class that holds the IPs and Ports for each of my potential Gearman job servers. I spin through each potential job server in $gHosts and test it.

I then write the output from this to memcache and a text file. The memcache alone worked fine until I started really trying to load the machine. Then the memcache connection would repeatedly fail. Now I use the text file as a backup and the problems have disappeared.

I store the last attempt to connect to each Gearman Job Server in an array where the key is the server's name and the value is the time stamp of the last attempt. If the attempt was successful the time stamp is positive. If the attempt failed the time stamp is negative. The time stamps allow me to determine if the data is stale or fresh.

Then in the scripts that use Gearman I have a Client and Worker wrapper class around the PHP extension classes. They handle updating the connections on the time frame I want automatically. That way Gearman Job Servers that stop responding stop being used and the script, while potentially slow for a short period of time, typically runs quite fast.

Hope this helps someone out there.

来源：https://stackoverflow.com/questions/19037958/gearman-php-extension-dead-job-server-slow-response-from-all-workers

标签

php

gearman