Highly Concurrent Apache Async HTTP Client IOReactor issues

天大地大妈咪最大 提交于 2021-02-07 06:56:20

问题


Application description :

  • I'm using Apache HTTP Async Client ( Version 4.1.1 ) Wrapped By Comsat's Quasar FiberHttpClient ( version 0.7.0 ) in order to run & execute a highly concurrent Java application that uses fibers to internally send http requests to multiple HTTP end-points
  • The Application is running on top of tomcat( however , fibers are used only for internal request dispatching. tomcat servlet requests are still handled the standard blocking way )
  • Each external request opens 15-20 Fibers internally , each fiber builds an HTTP request and uses the FiberHttpClient to dispatch it
  • I'm using a c44xlarge server ( 16 cores ) to test my application
  • The end-points i'm connecting to preempt keep-alive connections, meaning if I try to maintain by resusing sockets , conncetions get closed during requests execution attempts. Therefor , I disable connection recycling.
  • According to the above sections, here's the tunning for my fiber http client ( which of course I'm using a single instance of ):

    PoolingNHttpClientConnectionManager connectionManager = 
    new PoolingNHttpClientConnectionManager(
        new DefaultConnectingIOReactor(
            IOReactorConfig.
                custom().
                setIoThreadCount(16).
                setSoKeepAlive(false).
                setSoLinger(0).
                setSoReuseAddress(false).
                setSelectInterval(10).
                build()
                )
        );
    
    connectionManager.setDefaultMaxPerRoute(32768);
    connectionManager.setMaxTotal(131072);
    FiberHttpClientBuilder fiberClientBuilder = FiberHttpClientBuilder.
            create().
            setDefaultRequestConfig(
                    RequestConfig.
                    custom().
                    setSocketTimeout(1500).
                    setConnectTimeout(1000).
                    build()
            ).
           setConnectionReuseStrategy(NoConnectionReuseStrategy.INSTANCE).
           setConnectionManager(connectionManager).
           build();
    
  • ulimits for open-files are set super high ( 131072 for both soft and hard values )

  • Eden is set for 18GB , Total heap size is 24GB
  • OS Tcp stack is also well tuned :

kernel.printk = 8 4 1 7 kernel.printk_ratelimit_burst = 10 kernel.printk_ratelimit = 5 net.ipv4.ip_local_port_range = 8192 65535 net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 net.core.rmem_default = 16777216 net.core.wmem_default = 16777216 net.core.optmem_max = 40960 net.ipv4.tcp_rmem = 4096 87380 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216 net.core.netdev_max_backlog = 100000 net.ipv4.tcp_max_syn_backlog = 100000 net.ipv4.tcp_max_tw_buckets = 2000000 net.ipv4.tcp_tw_reuse = 1 net.ipv4.tcp_tw_recycle = 1 net.ipv4.tcp_fin_timeout = 10 net.ipv4.tcp_slow_start_after_idle = 0 net.ipv4.tcp_sack = 0 net.ipv4.tcp_timestamps = 1

Problem description

  • Under low-medium load all is well , connections are leased , cloesd and the pool replenishes
  • Beyond some concurrency point , the IOReactor Threads ( 16 of them ) seem to stop functioning properly, prior to dying.
  • I've written a small thread to get the pool stats and print them each second. At around 25K leased connections , actual data is not sent anymore over the socket connections , The Pending stat clibms to a sky-rocketing 30K pending connection requests as well
  • This situation persists and basically renders the application useless. At some point the I/O Reactor threads die, not sure when and I haven't been able to catch the exceptions so far
  • lsofing the java process , I can see it has tens of thousands of file descriptors , almost all of them are in CLOSE_WAIT ( which makes sense , as the I/O reactor thread die/stop functioning and never get to actually closing them
  • During the time the application breaks, the server is not heavily overloaded/cpu stressed

Questions

  • I'm guessing I am reaching some sort of boundary somewhere , though I'm rather clueless as to what or where it may reside. Except from the following
  • Is it possible I'm reaching an OS port ( all applicative requests are originating from a single internal IP after all) limits and creates an error that sends IO Reactor threads to die ( something similar to open files limit errors ) ?

回答1:


Forgot to answer this, but I got what's going on roughly a week after posting the question :

  1. There was some sort of miss-configuration that caused the io-reactor to spawn with only 2 threads.

  2. Even after providing more reactor threads, the issue persisted. It turns out that our outgoing requests were mostly SSL. Apache SSL connection handling propagates the core handling to the JVM's SSL facilities which simply - are not efficient enough for handling thousands of SSL connections requests per second. Being more specific, some methods inside SSLEngine(If I recall correctly) are synchronized. doing thread-dumps under high loads shows the IORecator threads blocking each-other while trying to open SSL connections.

  3. Even trying to create a pressure release valve in the form of connection lease-timeout didn't work because the backlogs created were to large, rendering the application useless.

  4. Offloading SSL outgoing requests handling to nginx performed even worse - because the remote endpoints are terminating the requests preemptively, SSL client session cache could not be used ( same goes for the JVM implementation ).

Wound up putting a semaphore in-front of the entire module, limiting the whole thing to ~6000 at any given moment, which solved the issue.



来源:https://stackoverflow.com/questions/40180877/highly-concurrent-apache-async-http-client-ioreactor-issues

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!