How to handle a Thread Issue in ZeroMQ + Ruby?

前端 未结 2 899
悲&欢浪女
悲&欢浪女 2020-12-10 16:02

Stumble upon reading ZeroMQ FAQ about a Thread safety.

My multi-threaded program keeps crashing in weird places inside the ZeroMQ library. What am I d

相关标签:
2条回答
  • 2020-12-10 16:12

    No one ought risk the application robustness by putting it on thin ice

    Forgive this story to be a rather long read, but authors life-long experience shows that reasons why are far more important than any few SLOCs of ( potentially doubtful or mystically-looking or root-cause-ignorant ) attempts to experimentally find how

    Initial note

    While ZeroMQ has for several decades been promoted as Zero-Sharing ( Zero-Blocking, ( almost )-Zero-Latency and a few more design-maxims. The best place to read about pros & cons are Pieter HINTJENS' books, not just the fabulous "Code Connected, Volume 1", but also the advanced design & engineering in real social-domain ones ) philosophy, the very recent API documentation has introduced and advertises some IMHO features with relaxed relation to these corner-stone principles for distributed-computing, that do not so sharp whistle on Zero-Sharing so loud. This said, I still remain a Zero-Sharing guy, so kindly view the rest of this post in this light.

    Answer 1:
    No, sir. -- or better -- Yes and No, sir.

    ZeroMQ does not ask one to use Mutex/Semaphore barriers. This is something contradicting the ZeroMQ design maxims.

    Yes, recent API changes started to mention that ( under some additional conditions ) one may start using shared-sockets ... with ( many ) additional measures ... so the implication was reversed. If one "wants", the one also takes all the additional steps and measures ( and pays all the initally hidden design & implementation costs for "allowing" shared toys to ( hopefully ) survive the principal ( un-necessary ) battle with the rest of the uncontrollable distributed-system environment -- thus suddenly also bearing a risk of failing ( which was for many wise reasons not the case in the inital ZeroMQ Zero-sharing evangelisation ) -- so, user decides on which path to go. That is fair. ).

    Sound & robust designs IMHO still had better develop as per initial ZeroMQ API & evangelism, where Zero-sharing was a principle.

    Answer 2:
    There is by-design always a principal uncertainty about ZeroMQ data-flow ordering, one of ZeroMQ design-maxims keeps designers not to rely on unsupported assumptions on message ordering and many others ( exceptions apply ). There is just a certainty that any message dispatched into the ZeroMQ infrastructure is either delivered as a complete-message, or not delivered at all. So one can be sure just about the fact, that no fragmented wrecks ever appear on delivery. For furhter details, read below.


    ThreadId does not prove anything ( unless inproc transport-class used )

    Given the internal design of ZeroMQ data-pumping engines, the instantiation of a
    zmq.Context( number_of_IO_threads ) decides on how many threads get spawned for handling the future data-flows. This could be anywhere { 0, 1: default, 2, .. } up to almost depleting the kernel-fixed max-number-of-threads. The value of 0 gives a reasonable choice not to waste resources in case, where inproc:// transport-class is actually a direct-memory region mapped handling of data-flow ( that actually never flow ang get nailed down directly into the landing-pad of the receiving socket-abstraction :o) ) and no thread is ever needed for such job.
    Next to this, the <aSocket>.setsockopt( zmq.AFFINITY, <anIoThreadEnumID#> ) permits to fine-tune the data-related IO-"hydraulics", so as to prioritise, load-balance, performance-tweak the thread-loads onto the enumerated pool of zmq.Context()-instance's IO-threads and gain from better and best settings in the above listed design & data-flow operations aspects.


    The cornerstone-element is the Context()s' instance,
    not a Socket()'s one

    Once a Context()'s instance got instantiated and configured ( ref. above why and how ), it is ( almost ) free-to-be-shared ( if design cannot resist from sharing or has a need to avoid a setup of a fully fledged distributed-computing infrastructure ).

    In other words, the brain is always inside the zmq.Context()'s instance - all the socket-related dFSA-engines are setup / configured / operated there ( yes, even though the syntax is <aSocket>.setsockopt(...) the effect of such is implemented inside The Brain -- in the respective zmq.Context - not in some wire-from-A-to-B.

    Better never share <aSocket> ( even if API-4.2.2+ promises you could )

    So far, one might have seen a lot of code-snippets, where ZeroMQ Context and it's sockets get instantiated and disposed off in a snap, serving just a few SLOC-s in a row, but -- this does not mean, that such practice is wise or adjusted by any other need than a that very academic example ( that was made in just a need to get printed in as few SLOCs as possible, because of the book publisher's policies ).

    Even in such cases a fair warning about indeed immense costs of zmq.Context infrastructure setup / tear-down ought be present, thus to avoid any generalisation, the less any copy/paste replicas of such the code, that was used short-handedly just for such illustrative purposes.

    Just imagine the realistic setups needed to take place for any single Context instance -- to get ready a pool of respective dFSA-engines, maintaining all their respective configuration setups plus all the socket-end-point pools related transport-class specific hardware + external O/S-services handlers, round-robin event-scanners, buffer-memory-pools allocations + their dynamic-allocators etc, etc. This all takes both time and O/S resources, so handle these ( natural ) costs wisely and with care for adjusted overheads, if performance is not to suffer.

    If still in doubt why to mention this, just imagine if anybody would insist of tearing down all the LAN-cables right after a packet was sent and having a need to wait until a new cabling gets installed right before a need to sent the next packet appears. Hope this "reasonable-instantiation" view could be now better percepted and an argument to share ( if at all ) a zmq.Context()-instance(s), without any further fights for trying to share ZeroMQ socket-instances ( even if newly becoming ( almost ) thread-safe per-se ).

    The ZeroMQ philosophy is robust if taken as an advanced design evangelism for high performance distributed-computing infrastructures. Tweaking just one ( minor ) aspect typically does not adjust all the efforts and costs as on the global view on how to design safe and performant systems, the result would not move a single bit better ( and even the absolutely-share-able risk-free ( if that were ever possible ) socket-instances will not change this, whereas all the benefits for sound-design, clean-code and reasonably achievable test-ability & debugging will get lost ) if just this one detail gets changed -- So, rather pull another wire from an existing brain to such a new thread, or equip a new thread with it's own brain, that will locally handle it's resources and allow it to connect own wires back to all other brains -- as necessary to communicate to -- in the distributed-system ).

    If still in doubts, try to imagine what would happen to your national olympic hockey-team, if it were sharing just one single hockey-stick during the tournament. Or how would you like, if all neighbours in your home-town would share the same phone number to answer all the many incoming calls ( yes, with ringing all the phones and mobiles, sharing the same number, at the same time ). How well would that work?


    Language bindings need not reflect all the API-features available

    Here, one can raise, and in some cases being correct, that not all ZeroMQ language-bindings or all popular framework-wrappers keep all API-details exposed to user for application-level programming ( author of this post has struggled for a long time with such legacy conflicts, that remained unresolvable right to this reason and had to scratch his head a lot to find any feasible way to get around this fact - so it is ( almost ) always doable )


    Epilogue:

    It is fair to note, that recent versions of ZeroMQ API 4.2.2+ started to creep the inital evangelisated principles.

    Nevertheless, worth to remember the anxient memento mori

    ( emphases added, capitalisation not )

    Thread safety

    ØMQ has both thread safe socket type and not thread safe socket types. Applications MUST NOT use a not thread safe socket from multiple threads except after migrating a socket from one thread to another with a "full fence" memory barrier.

    Following are the thread safe sockets: * ZMQ_CLIENT * ZMQ_SERVER * ZMQ_DISH * ZMQ_RADIO * ZMQ_SCATTER * ZMQ_GATHER

    While this text might sound to some ears as a promising, calling barriers to service is the worst thing one can do in designing advanced distributed-computing systems, where performance is a must.

    The last thing one would like to see is to block one's own code, as such agent gets into a principally uncontrollable blocking-state, where no-one can heel it from ( neither the agent per-se internally, nor anyone from outside ), in case a remote agent never delivers a-just-expected event ( which in distributed-systems can happen by so many reasons or under so many circumstances that are outside of one's control).

    Building a system that is prone to hang itself ( with a broad smile of supported ( but naively employed ) syntax-possibility ) is indeed nothing happy to do, the less a serious design job.

    One would also not become surprised here, that many additional ( initially not visible ) restrictions apply down the line of the new moves into using shared-{ hockey-stick | telephones } API:

    ZMQ_CLIENT sockets are threadsafe. They do not accept the ZMQ_SNDMORE option on sends not ZMQ_RCVMORE on receives. This limits them to single part data. The intention is to extend the API to allow scatter/gather of multi-part data.

    c/a

    Celluloid::ZMQ does not report any of these new-API-( a sin of sharing almost forgiving ) socket types in its section on supported socket typed so no good news to be expected a-priori and Celluloid::ZMQ master activity seems to have faded out somewhere in 2015, so expectations ought be somewhat realistic from this corner.

    This said, one interesting point might be found behind a notice:

    before you go building your own distributed Celluloid systems with Celluloid::ZMQ, be sure to give DCell a look and decide if it fits your purposes.


    Last but not least, combining event-loop system inside another event-loop is a painful job. Trying to integrate an embedded hard-real-time system into another hard-real-time system could even mathematically prove itself to be impossible.

    Similarly, building multi-agent system using another agent-based component brings additional kinds of collisions and race-conditions, if meeting the same resources, that are harnessed ( be it knowingly or by "just" some functional side-effect ) from both ( multiple ) agent-based frameworks.

    Un-salvageable mutual dead-locks are just one kind of these collisions, that introduce initally un-seen troubles down the line of un-aware design attempts. The very first step outside of a single-agent system design makes one lose many more warranties, that were un-noticed in place before going multi-agent ( distributed ), so open minds and being ready to learn many "new" concepts and concentration on many new concerns to be carefully watched for and fought to avoid are quite an important prerequisite, so as not to ( un-knowingly ) introduce patterns, that are now actually anti-patterns in distributed-systems ( multi-agent ) domain.

    At least
    You have been warned
    :o)

    0 讨论(0)
  • 2020-12-10 16:30

    This answer isn't a good solution to your problem, and definitely go with what user3666197 suggests. I think this solution has potential to work, but also at a large scale there may be performance costs due to mutex congestion.

    Question 1: Assuming that async spawn new Thread(every time) and write_socket is shared between the all threads and zeromq says that their socket is not threaded safe. I certainly see write_socket running into threads safety issue. (Btw hasn't faced this issue in all end to end testing thus far.) Is my understanding correct on this?

    From my understanding of the documentation, yes, this could be an issue because the sockets are not thread safe. Even if you are not experiencing the issue, it could pop up later.

    Question 2: Context Switching can happen(anywhere) inside(even on critical section)

    Yes, so one way we could potentially get around this is with a mutex/semaphore to make sure we don't have a context switch happen at the wrong time.

    I would do something like this, but there might be a slightly better approach depending on what methods being called are not thread safe:

    Celluloid::ZMQ.init
    module Scp
      module DataStore
        class DataSocket
          include Celluloid::ZMQ
    
          def initialize
            @mutex = Mutex.new
          end
    
          def pull_socket(socket)
            Thread.new do
              @mutex.synchronize do
                @read_socket = Socket::Pull.new.tap do |read_socket|
                  ## IPC socket
                  read_socket.connect(socket)
                end
              end
            end.join
          end
    
          def push_socket(socket)
            Thread.new do
              @mutex.synchronize do
                @write_socket = Socket::Push.new.tap do |write_socket|
                  ## IPC socket
                  write_socket.connect(socket)
                end
              end
            end.join
          end
    
          def run
            # Missing socket arguments here
            pull_socket and push_socket and loopify!
          end
    
          def loopify!
            Thread.new do
              @mutex.synchronize do
                loop {
                  async.evaluate_response(read_socket.read_multipart)
                }
              end
            end.join
          end
    
          def evaluate_response(data)
            return_response(message_id,routing,Parser.parser(data))
          end
    
          def return_response(message_id,routing,object)
            data = object.to_response
            write_socket.send([message_id,routing,data])
          end
        end
      end
    end
    
    DataSocket.new.run
    
    0 讨论(0)
提交回复
热议问题