infiniband

How to: Azure OpenMPI with Infiniband - Linux

不想你离开。 提交于 2020-01-04 02:02:23
问题 I am new to using Microsoft Azure for scientific computing purposes and have encountered a few issues whilst setting up. I have a jump box set-up that acts as a license server for the software that I whish to use, is also has a common drive to store all of the software. 6 compute nodes are also set-up (16 core/node) and I can 'ssh' from the jump box to the compute nodes without issue. The jump box and compute nodes are using CentOS with OpenMPI 1.10.3 I have created a script that is stored on

Hadoop: File … could only be replicated to 0 nodes, instead of 1

匆匆过客 提交于 2019-12-25 06:37:30
问题 I am trying to deploy Hadoop-RDMA on 8 node IB (OFED-1.5.3-4.0.42) cluster and got into the following problem (a.k.a File ... could only be replicated to 0 nodes, instead of 1): frolo@A11:~/hadoop-rdma-0.9.8> ./bin/hadoop dfs -copyFromLocal ../pg132.txt /user/frolo/input/pg132.txt Warning: $HADOOP_HOME is deprecated. 14/02/05 19:06:30 WARN hdfs.DFSClient: DataStreamer Exception: java.lang.reflect.UndeclaredThrowableException at com.sun.proxy.$Proxy1.addBlock(Unknown Source) at sun.reflect

Hadoop: File … could only be replicated to 0 nodes, instead of 1

不想你离开。 提交于 2019-12-25 06:35:32
问题 I am trying to deploy Hadoop-RDMA on 8 node IB (OFED-1.5.3-4.0.42) cluster and got into the following problem (a.k.a File ... could only be replicated to 0 nodes, instead of 1): frolo@A11:~/hadoop-rdma-0.9.8> ./bin/hadoop dfs -copyFromLocal ../pg132.txt /user/frolo/input/pg132.txt Warning: $HADOOP_HOME is deprecated. 14/02/05 19:06:30 WARN hdfs.DFSClient: DataStreamer Exception: java.lang.reflect.UndeclaredThrowableException at com.sun.proxy.$Proxy1.addBlock(Unknown Source) at sun.reflect

Java Sockets on RDMA (JSOR) vs jVerbs performance in Infiniband

一个人想着一个人 提交于 2019-12-18 18:34:39
问题 I have basic understanding of both JSOR and jVerbs. Both handle limitations of JNI and use fast path to reduce latency. Both of them use user Verbs RDMA interface for avoiding context switch and providing fast path access. Both also have options for zero-copy transfer. The difference is that JSOR still uses the Java Socket interface. jVerbs provides a new interface. jVerbs also has something called Stateful Verbs Call to avoid repeat serialization of RDMA requests which they say reduces

MPI_SEND takes huge part of virtual memory

北慕城南 提交于 2019-12-18 16:45:26
问题 Debugging my program on big counts of kernels, I faced with very strange error of insufficient virtual memory . My investigations lead to peace of code, where master sends small messages to each slave. Then I wrote small program, where 1 master simply send 10 integers with MPI_SEND and all slaves receives it with MPI_RECV . Comparison of files /proc/self/status before and after MPI_SEND showed, that difference between memory sizes is huge! The most interesting thing (which crashes my program)

How to use GPUDirect RDMA with Infiniband

若如初见. 提交于 2019-12-18 06:55:16
问题 I have two machines. There are multiple Tesla cards on each machine. There is also an InfiniBand card on each machine. I want to communicate between GPU cards on different machines through InfiniBand. Just point to point unicast would be fine. I surely want to use GPUDirect RDMA so I could spare myself of extra copy operations. I am aware that there is a driver available now from Mellanox for its InfiniBand cards. But it doesn't offer a detailed development guide. Also I am aware that OpenMPI

Infiniband in Java

时光总嘲笑我的痴心妄想 提交于 2019-12-10 15:59:23
问题 As you all know, OFED's Socket Direct protocol is deprecated and OFED's 3.x releases do not come with SDP at all. Hence, Java's SDP also fails to work. I was wondering what is the proper method to program infiniband in Java? Is there any portable solution other than just writing JNI code? My requirement is achieve RDMA among collection of infiniband powered machines. 回答1: jVerbs might be what you're looking for. Here's a little bit of documentation. 回答2: jVerbs looks interesting otherwise you

RDMA API for Linux kernel

删除回忆录丶 提交于 2019-12-08 07:08:44
问题 Is there an API for RDMA (Infiniband) that can be used in kernel Space? Most of the API's that I have found are user Space. kDAPL and kAL can be used in the linux kernel; however, I have not yet found sample code to use these API's. Can somebody help me with sample code for RDMA in kernel space? 回答1: You can check the "krping" test - it is just what you need. It uses RDMA-CM to establish connection and run some RDMA traffic. Download it from OpenFabrics website 来源: https://stackoverflow.com

RDMA program randomly hangs

我的未来我决定 提交于 2019-12-06 11:19:00
问题 Anyone out there who has done RDMA programming using the RDMA_CM library? I'm having a hard time finding even simple examples to study. There's an rdma_client & rdma_server example in librdmacm, but it doesn't run in a loop (rping does loop, but it's written using IB verbs directly instead of rdma_cm functions). I've put together a trivial ping-pong program, but it locks up anywhere after 1 - 100 bounces. I found adding a sleep inside the client makes it work longer before hanging, which

Infiniband addressing - host names to IB address without IBoIP

风格不统一 提交于 2019-12-06 01:57:13
问题 I've just started getting familiar with infiniband and I'm wanting to understand the methods you can use to address the infiniband nodes. Based on the code is the example from: RDMA read and write with IB verbs I can address individual nodes by IP or hostname using IPoIB. Another way is to use a port GUID address directly. But it looks like you'd have to look those up and is more similar to ethernet mac addressing. Then then is something called an LID address, a 16bit local address assigned