[Pkg-ofed-commits] [pkg-ofed] 01/02: add howto from subversion

Ana Beatriz Guerrero López ana at moszumanska.debian.org
Mon Jun 30 13:15:29 UTC 2014


This is an automated email from the git hooks/post-receive script.

ana pushed a commit to branch master
in repository pkg-ofed.

commit 6baca2b0e7d27dc5f5cedce37a19bea594ea86af
Author: Ana Guerrero López <ana at ekaia.org>
Date:   Mon Jun 30 15:14:24 2014 +0200

    add howto from subversion
---
 howto/Makefile              |   9 +
 howto/footer.html           |  11 +
 howto/infiniband-howto.sgml | 998 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 1018 insertions(+)

diff --git a/howto/Makefile b/howto/Makefile
new file mode 100644
index 0000000..fd2aff4
--- /dev/null
+++ b/howto/Makefile
@@ -0,0 +1,9 @@
+all: html txt
+
+html:
+	sgml2html infiniband-howto.sgml
+txt:
+	sgml2txt infiniband-howto.sgml
+
+analytics:
+	sgml2html -F footer.html infiniband-howto.sgml
diff --git a/howto/footer.html b/howto/footer.html
new file mode 100644
index 0000000..0ee5f71
--- /dev/null
+++ b/howto/footer.html
@@ -0,0 +1,11 @@
+<script type="text/javascript">
+var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
+document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
+</script>
+<script type="text/javascript">
+try {
+var pageTracker = _gat._getTracker("UA-9207539-2");
+pageTracker._trackPageview();
+} catch(err) {}</script>
+</body>
+</html>
diff --git a/howto/infiniband-howto.sgml b/howto/infiniband-howto.sgml
new file mode 100644
index 0000000..0aa6f47
--- /dev/null
+++ b/howto/infiniband-howto.sgml
@@ -0,0 +1,998 @@
+<!doctype linuxdoc system>
+<article>
+<title>Infiniband HOWTO
+<author>Guy Coates 
+
+<abstract>
+This document describes how to install and configure the OFED infiniband software on Debian.
+</abstract>
+
+<toc>
+
+<sect>Introduction
+<p>
+This document describes how to install and configure the OFED infiniband software on Debian. This document is intended
+to show you how to configure a simple  Infiniband network as quickly as possible. It is not a replacement
+for the details documentation provided in the ofed-docs package!
+
+<sect1> The latest version
+<p>
+The latest version of the howto can be found on the pkg-ofed alioth webite:
+
+<url url="http://pkg-ofed.alioth.debian.org/howto/infiniband-howto.html" 
+name="http://pkg-ofed.alioth.debian.org/howto/infiniband-howto.html">
+
+Source is kept in the SVN repository:
+
+<url url="http://svn.debian.org/wsvn/pkg-ofed/"
+name="http://svn.debian.org/wsvn/pkg-ofed/">
+
+
+<sect1>What is OFED?
+<p>
+OFED (OpenFabric's Enterprise Distribution) is the defacto Infiniband software stack on Linux. OFED 
+provides a consistent set of kernel modules and userspace libraries which have been tested together.
+
+Further details of the Openfabrics Alliance and OFED can be found here <url url="http://www.openfabrics.org/" 
+name="http://www.openfabrics.org">
+
+
+<sect>Installing the OFED Software
+<p>
+Before you can use your infiniband network you will need to install the OFED software on your infiniband client machines.
+You can choose to use the pre-build packages on alioth, or build your own packages straight from the alioth SVN repository.
+<sect1>Installing prebuilt packages
+<p>
+
+Add the following lines to your sources.list file:
+<tscreen>
+<verb>
+deb http://pkg-ofed.alioth.debian.org/apt/ofed ./
+deb-src http://pkg-ofed.alioth.debian.org/apt/ofed ./
+</verb>
+</tscreen>
+and run:
+<tscreen>
+<verb>
+aptitude update
+aptitude install ofed
+</verb>
+</tscreen>
+
+<sect1>Building packages from source
+<p>
+If you wish to build the OFED packages from the alioth svn repository, use the following procedure.
+
+<sect2>Install the prerequisites development packages
+<p>
+<tscreen>
+<verb>
+aptitude install svn-buildpackage build-essential devscripts
+</verb>
+</tscreen>
+<sect2> Checkout the svn tree
+<p>
+<tscreen>
+svn co svn://svn.debian.org/pkg-ofed/
+</tscreen>
+
+<sect2>Install the upstream source (optional)
+<p>
+The upstream source tarballs need to be available if you
+want to build pukka debian packages suitable for inclusion
+upstream. If you are simply building packages for your own use,
+you can ignore this step.
+<tscreen>
+<verb>
+cd pkg-ofed
+mkdir tarballs
+</verb>
+</tscreen>
+
+Original source tarballs can be downloaded from the repository:
+<tscreen>
+<verb>
+  apt-get source libibverbs
+</verb>
+</tscreen>
+
+Alternatively, you can grab the source code directly from upstream.
+
+http://www.openfabrics.org/downloads/OFED/
+
+Upstream source is distributed via SRPMS; you can use alien to convert them into tarballs.
+
+<sect2> Build the packages.
+<p>
+cd into the package you wish to build. eg for libibcommon,
+<tscreen>
+ cd pkg-ofed/libibcommon
+</tscreen>
+Link in the upstream tarballs directory (optional)
+<tscreen>
+ ln -s -f ../tarballs .
+</tscreen>
+Run svn-buildpackage from within the trunk directory.
+<tscreen><verb>
+ cd pkg-ofed/libibcommon/trunk
+ svn-buildpackage -uc -us -rfakeroot 
+</verb>
+</tscreen>
+The build process will generate a deb in the build-area directory. 
+
+Repeat the process for the rest of the packages. Note that some packages have build dependancies on other OFED packages. The suggested build order is:
+<tscreen>
+<verb>
+ libibverbs  
+ libnes       
+ libcxgb3  
+ libipathverbs  
+ libmlx4  
+ libmthca  
+ librdmacm  
+ libibcm
+ libibcommon
+ libibumad
+ libibmad
+ libsdp
+ dapl
+ opensm
+ infiniband-diags
+ ibutils
+ mstflint
+ perftest
+ qlvnictools
+ qperf
+ rds-tools
+ sdpnetstat
+ srptools
+ tvflash
+ ibsim
+ mpitests
+ ofed-docs
+ ofa_kernel
+ ofed
+</verb>
+</tscreen>
+
+
+
+<sect>Install the kernel modules
+<p>
+You now need to build a set of OFED kernel modules which match the version of the OFED software you have installed.
+
+The Debian kernel contains a set of OFED infiniband drivers, but they may not match the OFED userspace version have installed.
+Consult the table below to determine what OFED version the Debian kernel contains. 
+
+<tscreen>
+<verb>
+Debian Kernel Version      OFED Version
+<=2.6.26                       1.3
+>=2.6.27                       1.4
+</verb>
+</tscreen>
+
+
+If the debian kernel modules are the incorrect version, you can build a new set of modules using the ofa-kernel-source package.
+If your kernel already includes the correct OFED kernel modules you can skip the rest of this section. If you are in doubt, you should
+build a new set of modules rather than relying on the modules shipped with the kernel.
+
+<sect1>Building new kernel modules
+<p>
+You can build new kernel modules using module-assistant.
+<tscreen>
+<verb>
+aptitude install module-assistant
+</verb>
+</tscreen>
+
+Ensure you have the ofa-kernel-source package installed, and then run:
+<tscreen>
+ <verb>
+ module-assistant prepare
+ module-assistant clean ofa-kernel
+ module-assistant build ofa-kernel
+</verb>
+</tscreen>
+
+This procedure will create an ofa-kernel-modules deb in /usr/src. You can the install the deb using dpkg or by running:
+<tscreen>
+<verb>
+ module-assistant install ofa-kernel
+</verb>
+</tscreen>
+The deb can also be copied to your other infiniband hosts and installed using dpkg.
+
+As the deb contains replacements for existing kernel modules you will need to either manually remove 
+any infiniband modules which have already been loaded, or reboot the machine, before you can use the new modules. 
+
+The new kernel modules will be installed into /usr/lib/<kernel-version&gt/updates. They will not overwrite the original kernel modules, but the module
+loader will pick up the modules from the updates directory in preference. You can verify that the system is using the new kernel modules by running the 
+modinfo command.
+
+<tscreen>
+<verb>
+# modinfo ib_core
+filename:       /lib/modules/2.6.22.19/updates/kernel/drivers/infiniband/core/ib_core.ko
+author:         Roland Dreier
+description:    core kernel InfiniBand API
+license:        Dual BSD/GPL
+vermagic:       2.6.22.19 SMP mod_unload 
+</verb>
+</tscreen>
+
+Note that if you wish to rebuild the kernel modules for any reason, (eg for a new kernel version or to continue an interrupted build) then you must issue
+the "module-assistant clean" command before trying a new build.
+
+
+<sect>Setting up a basic infiniband network   
+<p>
+This sections describes how to set up a basic infiniband network and test its functionality.
+
+<sect1>Upgrade your Infiniband card and switch firmware
+<p>
+Before proceeding you should ensure that the firmware in your switches and infiniband cards is at the latest release. 
+Older firmware versions may cause interoperability and fabric stability issues. Do not assume that just because your 
+hardware has come fresh from the factory that it has the latest firmware on it. 
+
+You should follow the documentation from your vendor as to how the firmware should be updated.
+
+<sect1>Physically Connect the network
+<p>
+Connect up to your hosts and switches.
+
+<sect1>Choose a Subnet Manager
+<p>
+Each infiniband network requires a subnet manager.  You can choose to run the OFED opensm subnet manager on one of the
+Linux clients, or you may choose to use an embedded subnet manager running on one of the switches in your fabric. Note
+that not all switches come with a subnet manager; check your switch documentation.
+
+
+<sect1>Load the kernel modules
+<p>
+Infiniband kernel modules are not loaded automatically. You should  adding them to /etc/modules so that they are automatically loaded on machine
+bootup. You will need to include the hardware specific modules and the protocol modules.
+
+
+/etc/modules:
+<verb>
+# Hardware drivers
+# Choose the apropriate modules from
+# /lib/modules/<kernel-version&gt/updates/kernel/drivers/infiniband/hw
+#
+#mlx4_ib  # Mellanox ConnectX cards
+#ib_mthca # some mellanox cards
+#iw_cxgb3 # Chelsio T3 cards
+#iw_nes # NetEffect cards
+#
+# Protocol modules
+# Common modules
+rdma_ucm
+ib_umad
+ib_uverbs
+# IP over IB
+ib_ipoib
+# scsi over IB 
+ib_srp
+# IB SDP protocol
+ib_sdp
+</verb>
+
+
+<sect1>(optional) Start opensm
+<p>
+If you are going to use the opensm suetnet manager, edit /etc/default/opensm and add the port 
+GUIDs of the interfaces on which you wish to start opensm. 
+
+You can find the port GUIDs of your cards with the ibstat -p command:
+<tscreen>
+<verb>
+# ibstat -p
+0x0002c9030002fb05
+0x0002c9030002fb06
+</verb>
+</tscreen>
+
+/etc/default/opensm:
+<tscreen>
+<verb>
+PORTS="0x0002c9030002fb05 0x0002c9030002fb06"
+</verb>
+</tscreen>
+
+Note if you want to start opensm on all ports you can use the PORTS="ALL" keyword.
+
+Start opensm:
+
+<verb>
+#/etc/init.d/opensm start
+</verb>
+
+If opensm has started correctly you should see SUBNET UP messages in the opensm logfile (/var/log/opensm.<PORTID>.log).
+
+<verb>
+Mar 04 14:56:06 600685 [4580A960] 0x02 -> SUBNET UP
+</verb>
+
+Note that you can start opensm on multiple nodes; one node will be the active subnet manager and the others will put themselves into standby.
+
+
+<sect1>Check network health
+<p>
+You can now check the status of the local IB link with the ibstat command.  Connected links should be in the "LinkUp" state. The following
+output is from a dual ported card, only one of which (port1) is connected.
+
+<tscreen><verb>
+# ibstat
+CA 'mlx4_0'
+        CA type: MT25418
+        Number of ports: 2
+        Firmware version: 2.3.0
+        Hardware version: a0
+        Node GUID: 0x0002c9030002fb04
+        System image GUID: 0x0002c9030002fb07
+        Port 1:
+                State: Active
+                Physical state: LinkUp
+                Rate: 20
+                Base lid: 2
+                LMC: 0
+                SM lid: 1
+                Capability mask: 0x02510868
+                Port GUID: 0x0002c9030002fb05
+        Port 2:
+                State: Down
+                Physical state: Polling
+                Rate: 10
+                Base lid: 0
+                LMC: 0
+                SM lid: 0
+                Capability mask: 0x02510868
+                Port GUID: 0x0002c9030002fb06
+</verb></tscreen>
+
+<sect1>Check the extended network connectivity
+<p>
+Once the host is connected to the infiniband network you can check the health of all of the other network components with the ibhosts, ibswitches and iblinkinfo commands.
+
+ibhosts displays all of the hosts visible on the network.
+
+<tscreen><verb>
+# ibhosts
+Ca      : 0x0008f1040399d3d0 ports 2 "Voltaire HCA400Ex-D"
+Ca      : 0x0008f1040399d370 ports 2 "Voltaire HCA400Ex-D"
+Ca      : 0x0008f1040399d3fc ports 2 "Voltaire HCA400Ex-D"
+Ca      : 0x0008f1040399d3f4 ports 2 "Voltaire HCA400Ex-D"
+Ca      : 0x0002c9030002faf4 ports 2 "MT25408 ConnectX Mellanox Technologies"
+Ca      : 0x0002c9030002fc0c ports 2 "MT25408 ConnectX Mellanox Technologies"
+Ca      : 0x0002c9030002fc10 ports 2 "MT25408 ConnectX Mellanox Technologies"
+</verb></tscreen>
+
+ibswitches will display all of the switches in the network.
+<tscreen><verb>
+# ibswitches
+Switch  : 0x0008f104004121fa ports 24 "ISR9024D-M Voltaire" enhanced port 0 lid 1 lmc 0
+</verb></tscreen>
+
+iblinkinfo will show the status and speed of all of the links in the network.
+<tscreen><verb>
+#iblinkinfo.pl 
+Switch 0x0008f104004121fa ISR9024D-M Voltaire:
+      1    1[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>       2    1[  ] "MT25408 ConnectX Mellanox Technologies" (  )
+      1    2[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>      13    1[  ] "MT25408 ConnectX Mellanox Technologies" (  )
+      1    3[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>       4    1[  ] "MT25408 ConnectX Mellanox Technologies" (  )
+      1    4[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>      26    1[  ] "MT25408 ConnectX Mellanox Technologies" (  )
+      1    5[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>      27    1[  ] "MT25408 ConnectX Mellanox Technologies" (  )
+      1    6[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>      24    1[  ] "MT25408 ConnectX Mellanox Technologies" (  )
+      1    7[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>      28    1[  ] "MT25408 ConnectX Mellanox Technologies" (  )
+      1    8[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>      25    1[  ] "MT25408 ConnectX Mellanox Technologies" (  )
+      1    9[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>      31    1[  ] "MT25408 ConnectX Mellanox Technologies" (  )
+      1   10[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>      32    1[  ] "MT25408 ConnectX Mellanox Technologies" (  )
+      1   11[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>      33    1[  ] "MT25408 ConnectX Mellanox Technologies" (  )
+      1   12[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>      29    1[  ] "MT25408 ConnectX Mellanox Technologies" (  )
+      1   13[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>      30    1[  ] "MT25408 ConnectX Mellanox Technologies" (  )
+          14[  ]  ==( 4X 2.5 Gbps   Down /  Polling)==>             [  ] "" (  )
+      1   15[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>       3    1[  ] "Voltaire HCA400Ex-D" (  )
+      1   16[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>      10    1[  ] "Voltaire HCA400Ex-D" (  )
+          17[  ]  ==( 4X 2.5 Gbps   Down /  Polling)==>             [  ] "" (  )
+          18[  ]  ==( 4X 2.5 Gbps   Down /  Polling)==>             [  ] "" (  )
+      1   19[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>       7    2[  ] "Voltaire HCA400Ex-D" (  )
+      1   20[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>       6    2[  ] "Voltaire HCA400Ex-D" (  )
+      1   21[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>       5    2[  ] "Voltaire HCA400Ex-D" (  )
+      1   22[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>      21    1[  ] "Voltaire HCA400Ex-D" (  )
+      1   23[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>       9    2[  ] "Voltaire HCA400Ex-D" (  )
+      1   24[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>       8    1[  ] "Voltaire HCA400Ex-D" (  )
+</verb></tscreen>
+
+<sect1>testing connectivity with ibping
+<p>
+ibping is an infiniband equivalent to the icmp ping command. Choose a node on  the fabric and run a ibping server:
+<tscreen>
+#ibping -S
+</tscreen>
+
+Choose another node on your network, and then ping the port GUID of the server. (ibstat on the server will list the port GUID).
+
+<tscreen>
+<verb>
+#ibping -G 0x0002c9030002fc1d
+Pong from test.example.com (Lid 13): time 0.072 ms
+Pong from test.example.com (Lid 13): time 0.043 ms
+Pong from test.example.com (Lid 13): time 0.045 ms
+Pong from test.example.com (Lid 13): time 0.045 ms
+</verb>
+</tscreen>
+
+<sect1>Testing RDMA performance
+<p>
+
+You can test the latency and bandwidth of a link with the ib_rdma_lat commands.
+
+To test the latency, start the server on a node:
+<tscreen>
+#ib_rdma_lat
+</tscreen>
+and then start a client on another node, giving it the hostname of the server.
+<tscreen>
+<verb>
+#ib_rdma_lat  hostname-of-server
+   local address: LID 0x0d QPN 0x18004a PSN 0xca58c4 RKey 0xda002824 VAddr 0x00000000509001
+  remote address: LID 0x02 QPN 0x7c004a PSN 0x4b4eba RKey 0x82002466 VAddr 0x00000000509001
+Latency typical: 1.15193 usec
+Latency best   : 1.13094 usec
+Latency worst  : 5.48519 usec
+</verb>
+</tscreen>
+
+You can test the bandwith of the link using the ib_rdma_bw command.
+<tscreen>
+#ib_rdma_bw
+</tscreen>
+and then start a client on another node, giving it the hostname of the server.
+<tscreen>
+<verb>
+#ib_rdma_bw  hostname-of-server
+855: | port=18515 | ib_port=1 | size=65536 | tx_depth=100 | iters=1000 | duplex=0 | cma=0 |
+855: Local address:  LID 0x0d, QPN 0x1c004a, PSN 0xbf60dd RKey 0xde002824 VAddr 0x002aea4092b000
+855: Remote address: LID 0x02, QPN 0x004a, PSN 0xaad03c, RKey 0x86002466 VAddr 0x002b8a4e191000
+
+
+855: Bandwidth peak (#0 to #955): 1486.85 MB/sec
+855: Bandwidth average: 1486.47 MB/sec
+855: Service Demand peak (#0 to #955): 1970 cycles/KB
+855: Service Demand Avg  : 1971 cycles/KB
+
+</verb>
+</tscreen>
+
+The perftest package contains a number of other similar benchmarking programs to test various aspects of your network.
+
+
+<sect>IP over Infiniband (IPoIB)
+<p>
+The OFED stack allows you to run TCP/IP over your infiniband network, allowing you to run non-infiniband aware applications across
+your network. Several native infiniband applications also use IPoIB for host resolution (eg Lustre and SDP).
+
+<sect1>List the network devices
+<p>
+Check that the IBoIP modules is loaded.
+
+<tscreen>
+#modprobe ib_ipoib 
+</tscreen>
+You will now have an "ib" network interface for each of your infiniband cards.
+<tscreen>
+<verb>
+#ifconfig -a
+
+<snip>
+ib0       Link encap:UNSPEC  HWaddr 80-06-00-48-FE-80-00-00-00-00-00-00-00-00-00-00  
+          BROADCAST MULTICAST  MTU:2044  Metric:1
+          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
+          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
+          collisions:0 txqueuelen:256 
+          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
+
+ib1       Link encap:UNSPEC  HWaddr 80-06-00-49-FE-80-00-00-00-00-00-00-00-00-00-00  
+          BROADCAST MULTICAST  MTU:2044  Metric:1
+          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
+          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
+          collisions:0 txqueuelen:256 
+          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
+<snip>
+</verb>
+</tscreen>
+
+<sect1>IP Configuration
+<p>
+You can now configure the ib network devices using /etc/network/interfaces.
+<tscreen>
+<verb>
+auto ib0
+iface ib0 inet static
+  address 172.31.128.50
+  netmask 255.255.240.0
+  broadcast 172.31.143.255
+</verb>
+</tscreen>
+Bring the network device up, as normal.
+<tscreen>
+ifup ib0
+</tscreen>
+
+<sect1>Connected vs Unconnected Mode
+<p>
+IPoIB can run over two infiniband transports, Unreliable Datagram (UD) mode or Connected mode (CM). The difference between
+these two modes are described in:
+<verb>
+RFC4392 - IP over InfiniBand (IPoIB) Architecture
+RFC4391 - Transmission of IP over InfiniBand (IPoIB) (UD mode)
+RFC4755 - IP over InfiniBand: Connected Mode
+</verb>
+ADDME: Pro/cons of these two methods?
+
+You can switch between these two mode at runtime with:
+<tscreen>
+<verb>  
+ echo datagram > /sys/class/net/ibX/mode 
+ echo connected > /sys/class/net/ibX/mode
+</verb>
+</tscreen>
+
+The default is datagram (UD) mode. If you with to use CM then you can add a  script to /etc/network/interfaces/if-up.d to
+automatically set CM mode on your interfaces when they are configured.
+
+
+<sect1>TCP tuning
+<p>
+In order to obtain maximum IPoIB throughput you may need to tweak the MTU and various kernel TCP buffer and window settings. 
+See the details in the ipoib_release_notes.txt document in the ofed-docs package.
+
+<sect1>ARP and dual ported cards
+<p>
+If you have a dual ported card with both ports on the same IB subnet, but different IP subnets, you
+will need to tweak the ARP settings for the IPoIB interfaces.  See  ipoib_release_notes.txt in the ofed-docs package for a full
+discussion of this issue.
+
+<tscreen>
+<verb>
+   sysctl -w net.ipv4.conf.ib0.arp_ignore=1
+   sysctl -w net.ipv4.conf.ib1.arp_ignore=1
+</verb>
+</tscreen>
+
+<sect>OpenMPI
+<p>
+
+The section describes how to configure OpenMPI to use Infiniband.
+
+<sect1>Configure IPoIB
+<p>
+OpenMPI uses IPoIB for job startup and tear-down. You should configure IPoIB on all of your hosts.
+
+<sect1>Load the modules
+<p>
+Ensure the rdma_ucm module is loaded.
+<tscreen>
+modprobe rdma_ucm
+</tscreen>
+
+<sect1>Check permissions and limits
+<p>
+Uses who want to run MPI jobs will need to have write permissions for the following devices:
+<tscreen>
+<verb>
+ /dev/infiniband/uverbs*
+/dev/infiniband/rdma_cm*
+</verb>
+</tscreen>
+The simplest way to do this is to add the users to the rdma group. If that is not suitiable for
+your site, you can change the permissions and ownership of these devices by editing the following
+udev rules:
+<tscreen>
+<verb>
+/etc/udev/rules.d/50-udev.rules
+/etc/udev/rules.d/91-permissions.rules
+</verb>
+</tscreen>
+
+<p>
+OpenMPI will need to pin memory. Edit /etc/security/limits.conf and add the line:
+<tscreen>
+*               hard    memlock         unlimited
+</tscreen>
+
+<sect1>Install the mpi test programs
+<p>
+Check the mpitests package is installed. 
+<tscreen>
+aptitude install mpitests
+</tscreen>
+
+<sect1>Configure Host Access
+<p>
+OpenMPI uses ssh to spawn jobs on remote hosts. You should configure a public/private keypair to ensure that  you 
+can ssh between hosts without entering a password. You should also ensure that your login process is silent.
+
+<sect1>Run the MPI PingPong benchmark
+<p>
+
+We will use the MPI PingPong benchmark for our testing.  By default, openmpi should use inifiniband networks in preference to any tcp networks it finds. However, we will force mpi to ignore tcp networks to ensure that is using the infiniband network.
+
+
+<verb>
+#!/bin/bash
+#Infiniband MPI test program
+#Edit the hosts below to match your test hosts
+cat > /tmp/hostfile.$$.mpi <<EOF
+hostA slots=1
+HostB slots=1
+EOF
+
+mpirun --mca btl_openib_verbose 1 --mca btl ^tcp -n 2 -hostfile /tmp/hostfile.$$.mpi IMB-MPI1 PingPong
+</verb>
+
+If all goes well you should see openib debugging messages from both hosts, together with the job output.
+
+<tscreen>
+<verb>
+<snip>
+# PingPong
+[HostB][0,1,1][btl_openib_endpoint.c:992:mca_btl_openib_endpoint_qp_init_query] Set MTU to IBV value 4 (2048 bytes)
+[HostB][0,1,1][btl_openib_endpoint.c:992:mca_btl_openib_endpoint_qp_init_query] Set MTU to IBV value 4 (2048 bytes)
+[HostA][0,1,0][btl_openib_endpoint.c:992:mca_btl_openib_endpoint_qp_init_query] Set MTU to IBV value 4 (2048 bytes)
+[HostA][0,1,0][btl_openib_endpoint.c:992:mca_btl_openib_endpoint_qp_init_query] Set MTU to IBV value 4 (2048 bytes)
+
+#---------------------------------------------------
+# Benchmarking PingPong 
+# #processes = 2 
+#---------------------------------------------------
+       #bytes #repetitions      t[usec]   Mbytes/sec
+            0         1000         1.53         0.00
+            1         1000         1.44         0.66
+            2         1000         1.42         1.34
+            4         1000         1.41         2.70
+            8         1000         1.48         5.15
+           16         1000         1.50        10.15
+           32         1000         1.54        19.85
+           64         1000         1.79        34.05
+          128         1000         3.01        40.56
+          256         1000         3.56        68.66
+          512         1000         4.46       109.41
+         1024         1000         5.37       181.92
+         2048         1000         8.13       240.25
+         4096         1000        10.87       359.48
+         8192         1000        15.97       489.17
+        16384         1000        30.54       511.68
+        32768         1000        55.01       568.12
+        65536          640       122.20       511.46
+       131072          320       207.20       603.27
+       262144          160       377.10       662.96
+       524288           80       706.21       708.00
+      1048576           40      1376.93       726.25
+      2097152           20      1946.00      1027.75
+      4194304           10      3119.29      1282.34
+</verb>
+</tscreen>
+
+If you encounter any errors read the excellent OpenMPI troubleshooting guide. <url url="http://www.openmpi.org"
+name="http://www.openmpi.org">
+
+If you want to compare infiniband performance with your ethernet/TCP networks, you can re-run the tests using flags to tell openmpi to use your ethernet network. (The example below assumes that your test nodes are connected via eth0).
+
+<verb>
+#!/bin/bash
+#TCP MPI test program
+#Edit the hosts below to match your test hosts
+cat > /tmp/hostfile.$$.mpi <<EOF
+hostA slots=1
+HostB slots=1
+EOF
+mpirun --mca btl ^openib --mca btl_tcp_if_include eth0 --hostfile hostfile -n 2 IMB-MPI1 -benchmark PingPong
+</verb>
+
+You should notice signficantly higher latencies than for the infiniband test.
+
+<sect>SDP
+<p>
+Sockets Direct Protocol (SDP) is a network protocol which provides an RDMA accelerated
+alternative to TCP over infiniband networks. OFED provides an LD_PRELOADable library 
+(libsdp.so) which allows programs which use TCP to use the more efficient SDP protocol instead.  
+The use of an LD_PRELOADable libary means that the switch in protocol is transparent, 
+and does not require the application to be recompiled.
+
+
+<sect1>Configuration
+<p>
+SDP used IPoIB for address resolution, so you must configure IPoIB before using SDP. 
+
+You should also ensure the ib_sdp kernel module is installed.
+<verb>
+modprobe ib_sdp
+</verb>
+
+
+You can use libsdp in two ways; you can either manually LD_PRELOAD the library whilst invoking your application, or
+create a config file which specifies which applications will use SDP.
+
+To manually LD_PRELOAD a library, simply set the LD_PRELOAD variable before invoking your application.
+<verb>
+LD_PRELOAD=libsdp.so ./path/to/your/application ...
+</verb>
+If you which to choose which programs will use SDP you can edit /etc/sdp.conf and specify which programs, ports and
+addresses are eligible for use.
+
+
+<sect1>Example Using SDP with Netpipe
+<p>
+The following example shows how to use libsdp to make the TCP benchmarking application, netpipe, use SDP rather than TCP.
+NodeA is the server and NodeB is the client. IPoIB is configured on both nodes, and NodeA's IPoIB address is 10.0.0.1
+
+Install netpipe on both nodes.
+<verb>
+aptitude install netpipe-tcp
+</verb>
+
+First, run the netpipe benchmark over TCP in order to obtain a baseline number.
+
+<tscreen>
+<verb>
+nodeA# NPtcp
+nodeB# NPtcp -h 10.0.0.1
+Send and receive buffers are 16384 and 87380 bytes
+(A bug in Linux doubles the requested buffer sizes)
+Now starting the main loop
+  0:       1 bytes   2778 times -->      0.22 Mbps in      34.04 usec
+  1:       2 bytes   2937 times -->      0.45 Mbps in      33.65 usec
+  2:       3 bytes   2971 times -->      0.69 Mbps in      33.41 usec
+<snip>
+121: 8388605 bytes      3 times -->   2951.89 Mbps in   21680.99 usec
+122: 8388608 bytes      3 times -->   3008.08 Mbps in   21276.00 usec
+123: 8388611 bytes      3 times -->   2941.76 Mbps in   21755.66 usec
+</verb>
+</tscreen>
+
+Now repeat the test, but force netpipe to use SDP rather than TCP.
+
+<tscreen>
+<verb>
+nodeA# LD_PRELOAD=libsdp.so NPtcp 
+nodeB# LD_PRELOAD=libsdp.so  NPtcp -h 10.0.0.1
+Send and receive buffers are 16384 and 87380 bytes
+(A bug in Linux doubles the requested buffer sizes)
+Now starting the main loop
+  0:       1 bytes   9765 times -->      1.45 Mbps in       5.28 usec
+  1:       2 bytes  18946 times -->      2.80 Mbps in       5.46 usec
+  2:       3 bytes  18323 times -->      4.06 Mbps in       5.63 usec
+<snip>
+121: 8388605 bytes      5 times -->   7665.51 Mbps in    8349.08 usec
+122: 8388608 bytes      5 times -->   7668.62 Mbps in    8345.70 usec
+123: 8388611 bytes      5 times -->   7629.04 Mbps in    8389.00 usec
+</verb>
+</tscreen>
+You should see a significant increase in performance when using SDP.
+
+<sect>SRP
+<p>
+SRP (SCSI Remote protocol or SCSI RDMA protocol) is a protocol that allows the use of SCSI devices across
+infiniband. If you have infiniband storage, use can use SRP to acess the devices.
+<sect1>Configuration
+<p>
+Ensure that your infiniband storage is presented to the host in question. Check your storage controller documentation.
+Ensure that the ib_srp kernel module is loaded and that the srptools package is installed.
+
+<tscreen>
+modprobe ib_srp
+</tscreen>
+
+<sect1>SRP daemon configuration
+<p>
+srp_daemon is responsible for discovering and connecting to SRP targets. The default configuration shipped with srp_daemon is to ignore all presented
+devices; this is a failsafe to prevent devices from being mounted by accident on the wrong hosts.
+
+The srp_daemon config file /etc/srp_daemon.conf has a simply syntax, and is described in the  srp_daemon(1) manpage. Each line in this file is a rule which can be either 
+to allow connection or to disallow connection according to the first character in the line (a or d accordingly) and ID of the storage device.
+
+<sect2>Determine the IDs of presented devices
+<p>
+You can determine the IDs of SRP devices presented to your hosts by running the ibsrpdm -c command.
+<tscreen>
+<verb>
+# ibsrpdm -c
+id_ext=50001ff10005052a,ioc_guid=50001ff10005052a,dgid=fe8000000000000050001ff10005052a,pkey=ffff,service_id=2a050500f11f0050
+</verb>
+</tscreen>
+
+<sect2>Configure srp_deamon to connect to the devices
+<p>
+Once we have the IDs of the devices, we can add them to  /etc/srp_daemon.conf. You can also specify other srp related
+options for the target, such as max_cmd_per_lun and Max_sect.  These are storage specific; check your vendor documentation 
+for reccomended values.
+<tscreen>
+<verb>
+# This rule allows connection to our target
+a id_ext=50001ff10005052a,ioc_guid=50001ff10005052a,max_cmd_per_lun=32,max_sect=65535
+# This rule disallows everything else
+d
+</verb>
+</tscreen>
+Restart the srp_daemon and the storage target should now become visible;  check the kernel log to see if the disk has been detected.
+
+
+<verb>
+/etc/init.d/srptools restart
+</verb>
+
+In the example kernel log output the disk has been descovered as scsi device sdb.
+<tscreen>
+<verb>
+scsi 3:0:0:1: Direct-Access     IBM      DCS9900          5.03 PQ: 0 ANSI: 5
+sd 3:0:0:1: [sdb] 1953458176 4096-byte hardware sectors (8001365 MB)
+sd 3:0:0:1: [sdb] Write Protect is off
+sd 3:0:0:1: [sdb] Mode Sense: 97 00 10 08
+sd 3:0:0:1: [sdb] Write cache: disabled, read cache: enabled, supports DPO and FUA
+sd 3:0:0:1: [sdb] 1953458176 4096-byte hardware sectors (8001365 MB)
+sd 3:0:0:1: [sdb] Write Protect is off
+sd 3:0:0:1: [sdb] Mode Sense: 97 00 10 08
+sd 3:0:0:1: [sdb] Write cache: disabled, read cache: enabled, supports DPO and FUA
+ sdb:<6>scsi4 : SRP.T10:50001FF10005052A
+ unknown partition table
+sd 3:0:0:1: [sdb] Attached SCSI disk
+sd 3:0:0:1: Attached scsi generic sg5 type 0
+</verb>
+</tscreen>
+
+<sect1>Multipathing, LVM and formatting
+<p>
+The newly detected SRP device can be treated as an other scsi device. If you have multiple infiniband adapters you can use multipath-tools 
+on top of the SRP devices to protects against a network failure.  If you are not using multipathed IO you can simply format the device as normal.
+
+<sect>Building Lustre against OFED
+<p>
+Lustre is a scalable cluster filesystem popular on high performance compute clusters. See <url url="http://www.lustre.org" name="http://www.lustre.org">
+for more information. lustre can use infiniband as one of its network transports in order to increase performance. The section describes how to compile lustre 
+against the OFED infiniband stack.
+<sect1>Check Compatibility
+<p>
+Not all lustre versions are compatible with all OFED or kernel versions. Read the lustre release notes for which versions are supported.
+
+
+<sect1>Build a lustre patched kernel
+<p>
+Build a lustre patched kernel as per the instructions on the lustre wiki. Once you have build the kernel keep the configured source tree.
+It is required for the next step.
+
+
+<sect1>Build OFED modules for the lustre patched kernel
+<p>
+Build OFED modules against the newly build lustre patched kernel.
+
+<tscreen>
+<verb>
+ module-assistant prepare
+ module-assistant clean ofa-kernel
+ module-assistant -k/path/to/lustre/patched/kernel build ofa-kernel
+</verb>
+</tscreen>
+
+Do not issue a "module-assistant clean" command after the build. The ofa-kernel-module source tree is needed for the
+next step.
+
+
+<sect1>Configure lustre
+<p>
+
+You can now configure lustre to build against the lustre patched kernel and the ofa-kernel-module sources.
+
+<tscreen>
+<verb>
+ cd lustre-source
+ ./configure --with-o2ib=/usr/src/modules/ofa-kernel  --with-linux=/path/to/patched/linux/source \
+ --other-options
+</verb>
+</tscreen>
+
+
+<sect>Troubleshooting
+<p>
+This section covers general troubleshooting and commonly reported problems.
+<sect1>Genernal fabric troubleshooting
+<p>
+The ibdiagnet program can be used to troubleshoot potential issues with your infiniband fabric.
+<tscreen>
+ibdiagnet -r
+</tscreen>
+
+<sect1>ib_query_gid() failed errors on mlx4 platforms
+<p>
+ibstat or opensm hangs and the following kernel messages are printed:
+
+<tscreen>
+<verb>
+kernel: [   78.170077] ib0: ib_query_gid() failed
+kernel: [   89.272789] ib0: ib_query_port failed
+</verb>
+</tscreen>
+
+Fix: Load the mlx4_core module with the msi_x=0 option.
+
+<tscreen>
+<verb>
+cat > /etc/modprobe.d/mlx4_core <<EOF
+options mlx4_core msi_x=0
+EOF
+
+update-initramfs -u
+
+</verb>
+</tscreen>
+
+<sect1>Missing XRC support
+<p>
+If you see error messages pertaining to missing support for XRC, it means you have mis-matched kernel modules and userspace libraries.
+<tscreen>
+<verb>
+mlx4: There is a mismatch between the kernel and the userspace  
+libraries: Kernel does not support XRC. Exiting.
+</verb>
+</tscreen>
+Fix: Make sure that you build and install the OFED kernel modules as described in section X.
+
+
+<sect>Tips and Tricks
+<p>
+This section details an assortment of miscellaneous tips.
+<sect1>Descriptive node names
+<p>
+You can give your hosts descriptive names by echoing text to the following file:
+<tscreen>
+<verb>
+echo `uname -n` > /sys/class/infiniband/<driver>/node_desc
+</verb>
+</tscreen>
+
+<sect>Further Information
+<p>
+Extensive documentation on the OFED software is present in the ofed-docs package.
+
+The openfabrics alliance webpage can be found here:
+
+<url url="http://www.openfabrics.org/" name="http://www.openfabrics.org/">
+
+
+The following mailing lists are also useful:
+
+<url url="http://lists.alioth.debian.org/mailman/listinfo/pkg-ofed-devel" name="http://lists.alioth.debian.org/mailman/listinfo/pkg-ofed-devel">:
+pkg-ofed-devel: Discussion of debian specific problem or issues.
+
+
+<url url="http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general" name="http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general">:
+ofa-general: General discussion of the OFED software.
+
+Books:
+<verb>
+Infiniband Network Architecture
+by MindShare, Inc.; Tom Shanley
+Publisher: Addison-Wesley Professional
+Pub Date: October 31, 2002
+Print ISBN-10: 0-321-11765-4
+</verb>
+</article>
+
+
+<!--  LocalWords:  infiniband ofed svn ibping Pong RDMA ib rdma lat hostname bw
+ -->
+<!--  LocalWords:  tx iters cma QPN PSN RKey VAddr sec KB Avg IP IPoIB SDP inet
+ -->
+<!--  LocalWords:  IBoIP modprobe ipoib iface netmask ifup UD MTU txt OpenMPI
+ -->
+<!--  LocalWords:  ucm MPI memlock openmpi dev mpicc filesystem ssh keypair mca
+ -->
+<!--  LocalWords:  hostnames hostfile mpirun btl openib tcp sdp LD PRELOAD Mbps
+ -->
+<!--  LocalWords:  config netpipe nodeA NPtcp nodeB usec SRP srp IDs ibsrpdm ff
+ -->
+<!--  LocalWords:  ext ioc guid cmd scsi DCS PQ sd sdb DPO FUA sg Multipathing
+ -->
+<!--  LocalWords:  LVM multipath multipathed wiki
+ -->

-- 
Alioth's /usr/local/bin/git-commit-notice on /srv/git.debian.org/git/pkg-ofed/pkg-ofed.git



More information about the Pkg-ofed-commits mailing list