				       README Notes
				Broadcom bnxtroce VMWare Driver

				  Broadcom Inc.
				15101 Alton Parkway,
				   Irvine, CA 92618

			  Copyright (c) 2024 Broadcom Limited
				   All rights reserved


Introduction
============
This file describes the bnxtroce VMWare RoCE driver for the Broadcom NXE Ethernet
Network Controllers.

Limitations
===========
-The current version of the driver will run on ESXi 7.0/8.0 GA and respective Update builds.
-The current version of driver is not supporting NVMe over RDMA. This feature will be
enabled in subsequent releases.
-Error Recovery feature is not supported in the current version, driver can't recover from a
firmware crash or error.

Enable RDMA in NVM
===================
Set nvm configurations 161 and 506 to 1 to enable RDMA on the controller.

BNXT RoCE Driver Dependencies
==============================
BNXTNET L2 driver must be installed with 'disable_roce=0' kernel parameter before
installing the driver. By default, disable_roce is set to 0 for ESXi 7.0 and 8.0 versions.
If Customer wants to use RoCE support, It is recommended to upgrade compatible L2 and
RoCE drivers. Otherwise, RoCE functionality may not be working as expected.

BNXT RoCE Driver Installation
=============================
1. Copy the <bnxtroce>-<driver version>.vib file in /var/log/vmware
2. $ cd /var/log/vmware
3. $ esxcli software vib install --no-sig-check -v <bnxtroce>-<driver version>.vib
4. Reboot the machine
5. Verify that whether drivers are correctly installed:
6. esxcli software vib list | grep bnxtroce
7. To disable ECN (enabled by default) for RoCE traffic , Use "tos_ecn=0" module parameter for bnxtroce.
8. To enable delayed arming (disabled by default) for some application to work in event mode, use 'delayed_arm=1" module parameter

PFC / ECN configuration:
========================
   1. VLAN-PCP based PFC only
   2. DSCP based PFC + CC mode
   3. DSCP based CC only
   4. DSCP based PFC only
   5. VLAN PCP based PFC + CC mode - Currently not supported in ESX.

   Common settings:
   ----------------
      a. Enable FW based DCBx for all modes listed below.
            #esxcfg-module  -s "disable_dcb=0 enable_host_dcbd=0" bnxtnet
      b. Crosscheck the DCBx settings:
            Ex: #esxcli network  nic dcb status get -n vmnic2
      c. In case if the switch doesn't support RoCE-App-TLV, specify the same PRI on module-param.
            Ex: #esxcfg-module  -s "roce_pri=3" bnxtnet
      d. Enable RoCE-App-TLV & ECN on the external switch ports for all modes listed below.

   1. VLAN-PCP based PFC only:
   ---------------------------
      a. Do not enable DSCP
            #esxcfg-module  -s "dscp_trust=-1" bnxtroce
                   (OR)
            #esxcfg-module  -s "" bnxtroce
      b. Enable PFC on the switch.

   2. DSCP based PFC + CC mode:
   ----------------------------
      a. Configure DSCP with a desired value
         esxcfg-module  -s "dscp_trust=48" bnxtroce
      b. Check & configure the DSCP-to-PRI mapping, DSCP of CNP using key-val interface.
         Ex:
            /usr/lib/vmware/vmkmgmt_keyval/vmkmgmt_keyval -i "key_val_vmrdma0/broadcom" -k dscp_to_pri_map -g
            /usr/lib/vmware/vmkmgmt_keyval/vmkmgmt_keyval -i "key_val_vmrdma0/broadcom" -k dscp_to_pri_map -s "25 5"
      c. Check and configure the CNP settings.
         Ex:
           #esxcli bnxtroce cnp get -n vmrdma0
           #esxcli bnxtroce cnp  set -p 3 -C 1 -P 34 -E 0x1 -n vmrdma0
      d. Enable DSCP, PFC & CC / ECN on the external switch ports.

   3. DSCP based CC only:
   ----------------------
         The same configuration as mentioned on #2, but disable PFC on the external switch ports.
         For Example, Mellanox switch command:  no dcb priority-flow-control enable force

   4. DSCP based PFC only:
   -----------------------
         The same configuration as mentioned on #2, but disable CC / ECN on the external switch ports.
         For Example, Mellanox switch command: no interface ethernet 1/5 traffic-class 1 congestion-control

KNOWN Issues:
-------------
1. Can create upto a max of 512 QPs (vRDMA Limit).
2. Applications like NFS-RDMA, vMotion are not yet supported (vRDMA issue).
3. vrdma driver in the esx host should be reloaded after host bootup or bnxtroce driver reload. (vRDMA - DCPN : 00042503).
4. pVRDMA interface is not persistent after reboots (VRDMA Issue).
        Workaround:
        -----------
        1. Reload pvrdma driver in the guest VM each time after reboot.
        2. or Add below lines to the file: /etc/rc.d/rc.local in the guest VM.
                modprobe rdma_ucm
                insmod <path to pvdma.ko>
                Set executable permissions to the file /etc/rc.d/rc.local
5. Speed for 1 Gbps will be displayed as 2.5 Gpbs as minimum link speed in IB is 2.5G which is 'Single Data Rate' (SDR).
6. CTRL-23708, 23536: ifconfig down or mtu change of the pvrdma interface in the guest VM will result
   in interface down unless pvrdma driver is reloaded.
7. Maximum of 256 QPs can be created for VF when 64 VFs are enabled.
8. vrdma driver in the esx host should be reloaded after host bootup or bnxtroce driver reload. (vRDMA - DCPN : 00042503).
9. UD QP over pvrdma has issues and is not completely supported (DCPN 00044079).
10. pvrdma max_mr_size is 0x10000000. To test the bigger message size with multiple QPs it is recommended to use mr per QP option.
    For pertest it can be tested with the option --mr_per_qp.
11. When NPAR and SRIOV are enabled, With 16 NPAR partitions, RocE can be supported on the VF only when the number
    of VFs created per partition is less than 6.
    This is because the L2 drivers in the PF and VF will consume all the available Completion Rings.
    To support RoCE on a partition or VF, Atleast 2 free completion rings should be available on that partition/VF
    after L2 driver is loaded.
	Example (1):
	------------
	With 16 PFs and 8 VFs : RoCE cannot be supported.
		esxcfg-module -s 'max_vfs=8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8 disable_roce=0' bnxtnet

	Example (2):
	-----------
	With 16 PFs and 5 VFs : RoCE cannot be supported (Provided the host has required number of CPUs).
		esxcfg-module -s 'max_vfs=5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5 disable_roce=0' bnxtnet.
12. RoCE test tools like krdma/srdma don’t support the operations mentioned below when those modules are loaded.
    The workaround is to unload krdma/srdma module before trying these operations.
	A. bnxtnet / bnxtroce driver unload
	B. Link-down
	C. Remove uplink from vSwitch (esxcfg-vswitch -U vmnicX vSwitch)
	D. Any other operation that triggers above.  For Example, operations like enable VFs / disable VFs /
	   changes_num_VFs from host-client (GUI way) can cause driver reload.
    Applications like nvmerdma have well-designed mechanism to handle any rdma driver events (e.g. link down, reload).
13. PSOD or QUERY_FUNC HWRM command timeout observed with "AMD Rome" platforms with IOMMU IR enabled.
    This issue can be observed on AMD platforms with IOMMU IR enabled and the platform requires >=512 interrupts.
    Symptoms:
	1. bnxtnet driver: No warning messages seen during driver load. But, few PFs may fail to run I/O due to
           interrupts not getting routed to driver. As a side effect of this, RING_FREE hwrm timeouts can be
	   seen during bnxtnet driver unload leading to PSOD.
	2. bnxtroce driver:  QUERY_FUNC HWRM timeout messages can be seen in vmkernel logs during the driver load for few PFs.
    Workaround suggested by VMware:
	Disable IR (interrupt remapper) using boot option ‘iovDisableIR=TRUE' or using esxcli command.
	## esxcli system settings kernel set -s iovDisableIR -v true
	Please contact VMware if there are any other side effects observed because of this workaround.
14. RoCE V1 is not supported and disable_rocev2 module param has been deprecated.
    The driver ignores disable_rocev2 module parameter setting specified by the user and always enables V2.
    disable_rocev2 module parameter will be removed in next release.
15. Below warning messages can be observed on Link down during RoCE traffic
    L2 driver notifies all the Link down events to RoCE driver. On Link down,
    RoCE driver notify the stack VMK_RDMA_EVENT_PORT_ERR and Set the driver
    state to QUIESCE. The stack is still trying to invoke verb calls even after
    VMK_RDMA_EVENT_PORT_ERR reported. User can ignore these warning messages
    and no functional impact.
    Example Warning Messages:
    2023-09-20T16:10:14.510Z cpu3:2098457)WARNING: bnxtroce-throttled: bnxtroce_dev_ah_create:901:
	                     [vmrdma1 : 0x4520a7ad0000] Create AH failed, RoCE device is quiesced
    2023-09-20T16:10:14.515Z cpu17:2098460)WARNING: bnxtroce-throttled: bnxtroce_dev_ah_create:901:
      	                     [vmrdma1 : 0x4520a7ad0000] Create AH failed, RoCE device is quiesced
