i40iw Linux* Driver for Intel(R) Ethernet Connection X722
===============================================================================

April 6, 2018

===============================================================================

Contents
--------

- Prerequisites
- Building and Installation
- Testing
- Virtualization
- Interoperability
- RDMA Statistics
- Known Issues


================================================================================


Prerequisites
-------------

- A supported kernel configuration, choose from the following:
    1) A Linux distribution supported by OFED 3.18-3 or OFED 4.8(recommended).
       Use OFED if it is required by software you wish to run.
    2) An upstream kernel v4.8-v4.14, if you require fixes not in OFED.
       For example, NVMe over Fabrics (NVMeoF).
    3) RHEL 7.4 or SLES 12 SP3 with infiniband support installed, if you
       do not want to install OFED or upstream kernel.
- For OFED 3.18-3/OFED 4.8, it should be install with ./install.pl --all
- For OFED 4.8 or Linux Kernel v4.8-v4.14, download and install the
  latest rdma_core from https://github.com/linux-rdma/rdma-core/releases

NOTE: Internet Wide Area RDMA Protocol (iWARP) is not supported with the
i40iwvf driver running on Microsoft* Hyper-V.


Building and Installation
-------------------------

OFED 3.18-3
-----------
1. Untar i40iw-<version>.tar.gz, i40iwvf-<version>.tar.gz and libi40iw-
  <version>.tar.gz.
2. Install the PF driver as follows:
  cd i40iw-<version> directory
  ./build.sh <absolute path to i40e driver directory> 3
  For example: ./build.sh /opt/i40e-2.3.3
3. Install the VF driver as follows:
  cd i40iwvf-<version> directory
  ./build.sh <absolute path to i40evf driver directory> 3
  For example: ./build.sh /opt/i40evf-3.2.3
4. Install user-space library as follows:
  cd libi40iw-<version>
  ./build.sh

OFED 4.8
--------
1. Untar i40iw-<version>.tar.gz and i40iwvf-<version>.tar.gz.
2. Install the PF driver as follows:
  cd i40iw-<version> directory
  ./build.sh <absolute path to i40e driver directory> 4
  For example: ./build.sh /opt/i40e-2.3.3 4
3. Install the VF driver as follows:
  cd i40iwvf-<version> directory
  ./build.sh <absolute path to i40evf driver directory> 4
  For example: ./build.sh /opt/i40evf-3.2.3 4
4. OFED 4.8 comes with an older version of rdma_core user-space package
  Please download the latest from https://github.com/linux-rdma/rdma-core/
  releases and follow its installation procedure.

Linux Kernel v4.8-v4.14/RHEL 7.4/SLES 12 SP3
--------------------------------------------
1. Untar i40iw-<version>.tar.gz and i40iwvf-<version>.tar.gz.
2. Install the PF driver as follows:
  cd i40iw-<version> directory
  ./build.sh <absolute path to i40e driver directory> k
  For example: ./build.sh /opt/i40e-2.3.3 k
3. Install the VF driver as follows:
  cd i40iwvf-<version> directory
  ./build.sh <absolute path to i40evf driver directory> k
  For example: ./build.sh /opt/i40evf-3.2.3 k
4. Please download the latest rdma_core user-space package from
  https://github.com/linux-rdma/rdma-core/releases and follow its
  installation procedure.


Adapter and Switch Flow Control Setting
---------------------------------------
We recommend enabling link-level flow control (both TX and RX)
on X722 and connected switch.

To enable flow control on X722 use ethtool -A command. For
example:
  ethtool -A p4p1 rx on tx on
Confirm the setting with ethtool -a command. For example:
  ethtool -a p4p1
You should see this output:
  Pause parameters for p4p1:
  Autonegotiate: off
  X: on
  TX: on

To enable link-level flow control on the switch, please consult your switch
vendor's documentation. Look for flow-control and make sure both TX and RX are
set. Here is an example for a generic switch to enable both TX and RX flow
control on port 45:

  enable flow-control tx-pause ports 45
  enable flow-control rx-pause ports 45


================================================================================


Virtualization
--------------

To enable SR-IOV support, load i40iw with the following parameters
and then create VFs with i40e.
Note: This may have performance and scaling impacts as the number of
queue pairs and other RDMA resources are decreased.

  resource_profile=2 max_rdma_vfs=<number of VFs with RDMA support (0-32)>

  For example:
  modprobe i40iw resource_profile=2 max_rdma_vfs=32

NOTE: Once the VFs are running, do not change the PF configuration.


Interoperability
----------------

To interoperate with Chelsio iWARP devices with OFED 4.8 or Linux
Kernels v4.8-v4.14:

Load Chelsio T4/T5 RDMA driver (iw_cxgb4) with parameter
dack_mode set to 0.

modprobe iw_cxgb4 dack_mode=0

If iw_cxgb4 is loaded on system boot, create /etc/modprobe.d/iw_cxgb4.conf
file with the following entry:

options iw_cxgb4 dack_mode=0

Reload iw_cxgb4 for the new parameters to take effect.


RDMA Statistics
---------------

Use the following command to read RDMA Protocol statistics:
  cd /sys/class/infiniband/i40iw0/proto_stats; for f in *; do echo -n
  "$f: "; cat "$f"; done; cd

The following counters will increment when RDMA applications are
transferring data over the network:
  - ipInReceives
  - tcpInSegs
  - tcpOutSegs


Memory Requirements:
--------------------

Default i40iw load requires a minimum of 6GB of memory for initialization.

For applications where the amount of memory is constrained, you can
decrease the required memory by lowering the available resources to
the i40iw driver. To do this, load the driver with the following profile
setting.

Note: This can have performance and scaling impacts as the number of
queue pairs and other RDMA resources are decreased in order to lower
memory usage to approximately 1.2 GB.

  modprobe i40iw resource_profile=2


Scaling Limits
--------------

Intel(R) Ethernet Connection X722 has limited RDMA resources, including
the number of Queue Pairs (QPs), Completion Queues (CQs) and Memory
Regions (MRs). In highly scaled environments or highly interconnected
HPC-style applications such as all-to-all, users may experience QP failure
errors once they reach the RDMA resource limits.

Below are the per-physical port limits for 4-port devices for the three
resources associated with the default i40iw driver load:
  QPs: 16384
  CQs: 32768
  MRs: 2453503

Other resource profiles allocate resources differently. If the i40iw is
loaded with resource_profile 2, then resources will be more limited.

The example below shows the resource limit per-physical port when you
use modprobe i40iw resource_profile 2. (Note that this may increase if you
load fewer than 32 VFs using the max_rdma_vfs module parameter.)
  QPs: 2048
  CQs:3584
  MRs: 6143


Flow Control Recommendation
---------------------------

For better performance, enable flow control on all the nodes and on the
switch they are connected to.

To enable flow control on a node, run:
  ethtool -A <iwarp_interface> rx on tx on

========================================
Recommended Settings for Intel MPI 2017.0.x
 ========================================
Note: The following instructions assume that Intel MPI is installed using
default locations. Refer to Intel MPI documentation for further details
on parameters and general instructions.

1. Add or modify the following line in /etc/dat.conf, changing
<iwarp_interface> to match your interface name:
  ofa-v2-iwarp u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0
  "<iwarp_interface> 0" ""

2. To select the iWARP device, add the following to mpiexec command:

  -genv I_MPI_FALLBACK_DEVICE disable
  -genv I_MPI_DEVICE rdma:ofa-v2-iwarp

Example
  mpiexec command line for uDAPL-2.0:
  mpiexec -machinefile <pathto>mpd.hosts_impi
  -genv I_MPI_FALLBACK_DEVICE disable
  -genv I_MPI_DEVICE rdma:ofa-v2-iwarp
  -ppn <number of processes per node> -n <number of nodes>
  <path to mpi application> <optional_parameters>

Note: mpd.hosts_impi is a text file with a list of the nodes' qualified
hostnames or IP addresses, one per line, in the MPI ring.

Note: Recommended optional_parameters if running IMB-MPI1 benchmark:
  -time 1000000 (specifies that a benchmark will run at most that many
  seconds per message size) -mem 2GB (specifies that at most that many
  GBytes are allocated per process for the message buffers)

========================================
Recommended Settings for Open MPI 3.x.x
========================================
Note: The following instructions assume that Open MPI is installed using
default locations. Refer to Open MPI documentation at open-mpi.org for
further details on parameters and general instructions.

Note: There is more than one way to specify MCA parameters in OpenMPI.
Please visit this link and use the best method for your environment:
  http://www.open-mpi.org/faq/?category=tuning#setting-mca-params

Necessary parameters to mpirun command: -mca btl openib,self,vader
Use openib (Open Fabrics device), send to self semantics and shared memory.

  -mca_btl_openib_receive_queues P,128,256,192,128:P,65536,256,192,128
Set the receive queue sizes. This is especially useful for interop between
iWARP RDMA vendors, because the queue sizes could be different per vendor
in the file “<path>openmpi/mca-btl-openib-device-params.ini"

  -mca oob ^ud
  Do not use UD QPs

Example mpirun command line:
  mpirun -np <number of processes per node> -hostfile <pathto>mpd.hosts_ompi
    --map-by node --allow-run-as-root --display-map -v -tag-output
    -mca_btl_openib_receive_queues P,128,256,192,128:P,65536,256,192,128
    -mca btl openib,self,vader
    -mca btl_mpi_leave_pinned 0
    -mca oob ^ud
    <path>/openmpi_benchmarks/3.x.x/benchmark [optional_parameters]

Note: mpd.hosts_ompi is a text file with a list of the nodes' qualified
hostnames or IP addresses and "slots=<total number of logical cores per
node>", one per line, in the MPI ring. The slots parameter is required
for <total number of logical cores per node> greater than 72. Refer
to openMPI documentation for more details.

Note: underscores are not allowed in hostnames.
  Example:
  QA0094-1-0 slots=72
  QA0096-1-0 slots=72

Recommended optional_parameters for IMB-MPI1 benchmark:
  -time 1000000 (specifies that a benchmark will run at most that
  many seconds per message size)


================================================================================


Known Issues/Troubleshooting
----------------------------


* You may experience a kernel crash using OFED 3.18-3 under heavy load.
It is fixed in upstream kernel with commit dafb558717.


Incompatible Drivers in initramfs
---------------------------------

There may be incompatible drivers in the initramfs image. You can either
update the image or remove the drivers from initramfs.

Specifically look for i40e, ib_addr, ib_cm, ib_core, ib_mad, ib_sa, ib_ucm,
ib_uverbs, iw_cm, rdma_cm, rdma_ucm in the output of the following command:
  lsinitrd |less
If you see any of those modules, rebuild initramfs with the following
command and include the name of the module in the "" list. Below is an
example:
  dracut --force --omit-drivers "i40e ib_addr ib_cm ib_core ib_mad ib_sa
  ib_ucm ib_uverbs iw_cm rdma_cm rdma_ucm"


================================================================================


Support
-------
For general information, go to the Intel support website at:
http://www.intel.com/support/

or the Intel Wired Networking project hosted by Sourceforge at:
http://sourceforge.net/projects/e1000
If an issue is identified with the released source code on a supported kernel
with a supported adapter, email the specific information related to the issue
to e1000-rdma@lists.sourceforge.net


================================================================================


License
-------

This software is available to you under a choice of one of two
licenses. You may choose to be licensed under the terms of the GNU
General Public License (GPL) Version 2, available from the file
COPYING in the main directory of this source tree, or the
OpenFabrics.org BSD license below:

  Redistribution and use in source and binary forms, with or
  without modification, are permitted provided that the following
  conditions are met:

  - Redistributions of source code must retain the above
    copyright notice, this list of conditions and the following
    disclaimer.

  - Redistributions in binary form must reproduce the above
    copyright notice, this list of conditions and the following
    disclaimer in the documentation and/or other materials
    provided with the distribution.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.


================================================================================


Trademarks
----------
Intel and Itanium are trademarks or registered trademarks of Intel Corporation
or its subsidiaries in the United States and/or other countries.

* Other names and brands may be claimed as the property of others.


