Nak Errors

Introduction

Under certain error conditions, the responder QP's RQ Logic returns a Nak (Negative Ack) rather than a positive Ack, an RDMA Read response, or an Atomic response. The following types of RC-related Naks are currently defined in the specification:

  • PSN Sequence Error Nak. May be retried.

  • Remote Access Error Nak. Results in an error completion and may not be retried.

  • Invalid Request Nak. Results in an error completion and may not be retried.

  • Remote Operational Error Nak. Results in an error completion and may not be retried.

  • Receiver Not Ready (RNR) Nak. May be retried.

Nak Packet Format

A Nak packet consists of Ack packet (see Figure 17-6 on page 369) containing a Nak error code (see Table 17-3 on page 370 and Table 17-4 on page 371).

General Rules

The following rules apply to Nak errors:

  • When Nak'ing an RDMA Read:

    - An Ack packet containing the appropriate Nak error code is substituted for the RDMA Read response packet being Nak'd.

    - The Nak packet's PSN is the one that would have been in the corresponding RDMA Read response packet if it had been returned.

    - None of the bad data payload is returned.

  • When generating a RNR (Receiver Not Ready) Nak, the Nak packet PSN = PSN of the request packet that the responder QP's RQ Logic is not ready to handle.

  • After the responder returns an RNR Nak, it returns to waiting for the requester to send a request packet that contains the ePSN.

  • The responder must follow these rules:

    - Once a Nak is returned indicating a PSN Sequence Error, the responder QP's RQ Logic ignores all subsequently received new requests until it receives a valid request with a PSN = ePSN. In the interim, however, it does respond to duplicate requests.

    - The responder must not return any other Nak packets, except in response to a valid request packet with a PSN = ePSN.

    - The responder must continue to respond to duplicate requests. However, it must not return a Nak in response to an error condition that may occur while it's processing a duplicate request.

PSN Sequence Error Nak

Before Transmitting PSN Sequence Error Nak

Any request packets received prior to the errant request must be executed, completed, and responded to before the Nak is issued. This is important since the Nak effectively coalesces responses to earlier outstanding requests and acts as an implicit response for prior outstanding Sends, RDMA Writes, Atomic operations or RDMA Read requests.

Reason for the PSN Sequence Error Nak

A PSN Sequence Error Nak is returned when the responder QP's RQ Logic detects a request packet with a PSN that is neither equal to the ePSN nor within the duplicate packet range (see Figure 17-16 on page 395). The PSN returned in the Nak packet = the responder's ePSN value.

When Nak'ing an RDMA Read, an Ack packet containing the PSN Sequence Error Nak is substituted for the RDMA Read response packet being Nak'd. The Nak packet's PSN is the one that would have been in the corresponding RDMA Read response packet if it had been returned.

RQ Logic Behavior after Returning PSN Sequence Error Nak

The responder QP's RQ Logic behavior after returning a PSN Sequence Error Nak is as follows:

  • The sequence error has no impact on the responder QP's RQ and no RQ WQEs are used. After returning the Nak, the RQ Logic resumes waiting for an inbound request packet with the correct ePSN.

  • After returning the Nak, the RQ Logic discards all subsequently received request packets with a PSN that is not equal to the ePSN (except for valid duplicate requests).

If the responder receives any duplicate requests, they are handled as described earlier in “Effects of Retry on Requester and Responder” on page 393.

Requester's Reaction on Receipt of PSN Sequence Error Nak

Upon receipt of a PSN Sequence Error Nak, the requester QP's SQ Logic rewinds its request packet output pointer to at least the point of failure (in preparation for resending those request packets). It is also legal to start resending request packets from a point earlier than the point of failure. Those request packets are treated as duplicate requests by the responder QP's RQ Logic. The SQ Logic begins resending request packets from that point forward.

SQ Logic Retries until Successful or Exhausted
Before Exhaustion

Before its Retry Count is exhausted, the requester QP's SQ Logic behaves as follows each time that it receives a PSN Sequence Error Nak for a given request:

- The SQ Logic decrements its 3-bit Retry Counter each time it receives a PSN Sequence Error Nak for a given request packet. Assuming that the count is not exhausted, the SQ Logic starts retransmitting request packets from the point to which it has rewound.

- The counter is reloaded with its initial retry count whenever a given outstanding request is cleared (by receiving a proper response).

On Exhaustion

There are three possible scenarios:

  1. Automatic Path Migration (APM) is supported and the Retry Count is exhausted. In this case, the CA must attempt migration. After migration, the requester QP's SQ Logic reloads its Retry Counter and begins the process over again.

  2. APM is supported and the migration has already occurred, and then Retry Count exhaustion is experienced again.

  3. APM is not supported and the Retry Count has been exhausted.

In cases 2 and 3, the SQ Logic takes the following actions:

- Retires the SQ WQE and creates a SQ CQE reporting a locally detected “Transport Retry Counter Exceeded” error.

- The requester QP is transitioned to the Error state to prevent a race condition that could occur if software were to post any further WQEs to the SQ before it (i.e., software) discovers that the error has occurred.

- All WQEs that were posted to the SQ after the failed WQE are retired and corresponding SQ CQEs are created indicating that they have “Completed - Flushed in Error” status.

Remote Access Error Nak

Before Transmitting Remote Access Error Nak

Any request packets received prior to the errant request must be executed, completed, and responded to before the Nak is issued. This is important since the Nak effectively coalesces responses to earlier outstanding requests and acts as an implicit response for prior outstanding Sends, RDMA Writes, Atomic operations, or RDMA Read requests.

Reason for the Remote Access Error Nak

The responder QP's RQ Logic returns a Remote Access Error Nak when any or all of the following conditions is detected for either an RDMA Read, an RDMA Write, or an Atomic operation:

  • The request packet's R_Key field is invalid.

  • The virtual memory start address (VA), the transfer length, or the type of access (read or write) is not permitted using the specified R_Key.

When the RQ Logic returns a Remote Access Error Nak, the Nak packet's PSN must be the same as the PSN of the request packet that caused the Remote Access Error.

RQ Logic Behavior after Returning Remote Access Error Nak

The responder QP's RQ Logic behavior after returning a Remote Access Error Nak is as follows:

  • The responder does not update its ePSN.

  • The responder QP transitions to the Error state.

  • Any new inbound request packets are dropped.

  • There are two cases:

    1. If the current request does not use a RQ WQE (i.e., it is not a Send or an RDMA Write With Immediate), then the responder QP generates an Affiliated Asynchronous Error (see “Affiliated Asynchronous Errors” on page 293) and the Affiliated Asynchronous Event Handler is called (see “Registering a Handler” on page 292).

    2. If the current request does use a RQ WQE (i.e., it is a Send or an RDMA Write With Immediate), then:

      - The currently active RQ WQE is retired.

      - A RQ CQE is created reporting the error completion status as indicated in Table 17-5 on page 406.

      - All WQEs that were posted to the RQ after the failed WQE are retired and corresponding RQ CQEs are created indicating that they have “Completed - Flushed in Error” status.

It should be noted that some of the WQEs posted to the SQ after the failed WQE may have begun execution and their respective request packet(s) may have already been transmitted and may even have been executed and completed by the responder QP's RQ Logic. This possibility cannot be prevented, so the responder QP's local state must be considered unknown.

Requester's Reaction on Receipt of Remote Access Error Nak

The requester QP's SQ Logic is not permitted to retry a request packet that results in this error. Upon receipt of a Remote Access Error Nak, the requester QP's SQ Logic takes the following actions:

  • It retires the SQ WQE and creates a SQ CQE reporting a “Processing-Remote Access” error.

  • The requester QP is transitioned to the Error state to prevent a race condition that could occur if software were to post any further WQEs to the SQ before it (i.e., software) discovers that the error has occurred.

  • All WQEs that were posted to the SQ after the failed WQE are retired and corresponding SQ CQEs are created indicating that they have “Completed - Flushed in Error” status.

Invalid Request Nak

Before Transmitting Invalid Request Nak

Any request packets received prior to the errant request must be executed, completed, and responded to before the Nak is issued. This is important since the Nak effectively coalesces responses to earlier outstanding requests and acts as an implicit response for prior outstanding Sends, RDMA Writes, Atomic operations, or RDMA Read requests.

Reason for Invalid Request Nak

The PSN returned in the Nak packet must be the responder's ePSN (i.e., the PSN of the invalid request). The responder QP's RQ Logic returns the Invalid Request Nak under the following circumstances:

  • The BTH:Opcode is not supported by responder.

  • An RDMA request packet was received and the responder QP doesn't support RDMAs.

  • The BTH:Opcode is reserved.

  • A Send request was received with a length that exceeds the local memory buffer space defined by the top RQ WQE's Scatter Buffer List.

  • An invalid Opcode sequence was detected. For example, the previously received request packet was a “middle” and the current request packet is a “first.”

  • The virtual address in an Atomic request packet must be a quadword-aligned address. An Invalid Request Nak is returned if it isn't.

  • If the BTH:Opcode indicates that this is a “first” or a “middle” request packet, then BTH:PadCnt must be 00b, indicating no pad bytes are present. If the pad count bits are nonzero, an Invalid Request Nak is returned.

  • If the request is an RDMA Read or an Atomic request and the responder QP's RQ Logic has insufficient space in the queue into which it latches this type of operation, an Invalid Request Nak is returned.

  • For an RDMA Write request, the responder QP's RQ Logic may optionally check the RETH:DMALen field to ensure that it does not specify a transfer length of greater than 231 bytes. It may also, at the end of the transfer, verify that the sum of the packet payloads equalled the specified DMALen. If the responder detects either of these conditions, it may treat the request as an invalid request.

  • For an inbound RDMA Read request, the DMALen field is checked. If the request is for greater than 231 bytes, then an Invalid Request Nak is returned.

RQ Logic Behavior after Returning Invalid Request Nak

After returning the Invalid Request Nak, the responder QP's RQ Logic behaves as follows:

  • The request is not executed.

  • The responder does not update its ePSN.

  • The responder QP transitions to the Error state.

  • Any new inbound request packets are dropped.

  • There are two cases:

    1. If the current request does not use a RQ WQE (i.e., it is not a Send or an RDMA Write With Immediate), then the QP generates an Affiliated Asynchronous Error (see “Affiliated Asynchronous Errors” on page 293) and the Affiliated Asynchronous Event Handler is called (see “Registering a Handler” on page 292).

    2. If the current request does use a RQ WQE (i.e., it is a Send or an RDMA Write With Immediate), then:

      - The currently active RQ WQE is retired.

      - A RQ CQE is created reporting the error completion status as indicated in Table 17-5 on this page.

      - All WQEs that were posted to the RQ after the failed WQE are retired and corresponding RQ CQEs are created indicating that they have “Completed - Flushed in Error” status.

It should be noted that some of the WQEs posted to the SQ after the failed WQE may have begun execution and their respective request packet(s) may have already been transmitted and may even have been executed and completed by the responder QP's RQ Logic. This possibility cannot be prevented, so the responder QP's local state must be considered unknown.

Table 17-5. RQ CQE Completion Status
CauseCompletion Status
The sum of the Scatter Buffer lengths is too small to receive a valid incoming Send message.Local Length Error
  • The operation is not supported by this RQ (e.g., an RDMA or Atomic operation).

  • The RQ Logic has insufficient buffer space to accept an incoming RDMA Read or Atomic request.

  • The BTH:DMALen specified in an RDMA request is greater than 231 bytes.

  • In an RDMA access or an Atomic operation the size of the local memory region or memory window identified by the R_Key is insufficient to handle the request. The number of bytes transferred into the buffer is indeterminate. However, the CA must not write beyond the region or window bounds.

Remote Invalid Request Error
A protection error (bad R_Key, or a read/write violation) occurred on a remote data buffer to be read by an RDMA Read, written to by an RDMA Write, or accessed by an Atomic operation. This error is reported only on RDMA operations or atomic operations.Remote Access Error

Requester's Reaction on Receipt of Invalid Request Nak

The requester QP's SQ Logic is not permitted to retry a request packet that results in this error. Upon receipt of a Invalid Request Nak, the requester QP's SQ Logic takes the following actions:

  • It retires the SQ WQE and creates a SQ CQE reporting a “Processing-Remote Invalid Request” error.

  • The requester QP is transitioned to the Error state to prevent a race condition that could occur if software were to post any further WQEs to the SQ before it (i.e., software) discovers that the error has occurred.

  • All WQEs that were posted to the SQ after the failed WQE are retired and corresponding SQ CQEs are created indicating that they have “Completed - Flushed in Error” status.

Remote Operational Error Nak

A remote operational error occurs when the responder QP's RQ Logic encounters a situation that prevents its RQ from completing the current request. The error conditions detectable by the responder and reportable as a Remote Operational Error are implementation-specific. Remote operational errors cannot be caused by anything the requester may have done. Rather, they reflect a problem within the responder.

Before Transmitting Remote Operational Error Nak

Any request packets received prior to the errant request must be executed, completed, and responded to before the Nak is issued. This is important since the Nak effectively coalesces responses to earlier outstanding requests and acts as an implicit response for prior outstanding Sends, RDMA Writes, Atomic operations or RDMA Read requests.

Reason for Remote Operational Error Nak

The Nak packet's BTH:PSN must contain the PSN of the request packet being executed at the time that the responder QP's RQ Logic detected the operational error.

Possible causes include:

  • The responder QP's RQ Logic detected a malformed RQ WQE while processing the current request packet.

  • The responder QP's RQ Logic detected a QP-related error while executing the current request packet. The error prevented the responder from completing the request.

RQ Logic Behavior after Returning Remote Operational Error Nak

After returning the Remote Operational Error Nak, the responder QP's RQ Logic behaves as follows:

  • The request is not executed.

  • The responder QP transitions to the Error state.

  • Any new inbound request packets are dropped.

  • The currently active RQ WQE is retired.

  • A RQ CQE is created reporting the error completion status as an interface type Internal Consistency error. This indicates that an internal QP consistency error was detected while processing this WQE.

  • All WQEs that were posted to the RQ after the failed WQE, as well as all WQEs currently posted to the SQ, are retired and corresponding CQEs are created indicating that they have “Completed - Flushed in Error” status.

Requester's Reaction on Receipt of Remote Operational Error Nak

The requester QP's SQ Logic is not permitted to retry a request packet that results in this error. Upon receipt of a Remote Operational Error Nak, the requester QP's SQ Logic takes the following actions:

  • It retires the SQ WQE and creates a SQ CQE reporting a “Processing-Remote Operation Error.”

  • The requester QP is transitioned to the Error state to prevent a race condition that could occur if software were to post any further WQEs to the SQ before it (i.e., software) discovers that the error has occurred.

  • All WQEs that were posted to the SQ after the failed WQE are retired and corresponding SQ CQEs are created indicating that they have “Completed - Flushed in Error” status.

Receiver Not Ready (RNR) Nak

Reason for RNR Nak

On some occasions, the responder QP's RQ Logic may be temporarily unable to accept an inbound request packet:

  • As an example, the inbound message may be a Send or an RDMA Write With Immediate, both of which require a RQ WQE to handle the inbound message. The absence of a RQ WQE would render the RQ Logic unable to handle the inbound message.

  • It is also possible that the responder may be temporarily unable to handle other types of requests, as well.

Under such circumstances, the RQ Logic is permitted to issue an RNR Nak in response to the current request packet. Note that the issuance of an RNR Nak in response to a request packet should only be used rarely.

RNR Nak Packet Contains Minimum Retry Delay Period
RNR Nak Packet Format

When returning an RNR Nak, the Nak packet's PSN is the PSN of the request packet being RNR Nak'd.

Refer to Figure 17-6 on page 369 and Table 17-3 on page 370. When the Nak packet's Syndrome field indicates that an RNR Nak is being returned, the lower five bits of the Syndrome contains a Timer value (see Table 17-6 on page 411). This is the minimum amount of time that the requester QP's SQ Logic must wait before retransmitting (i.e., retrying) the current request packet.

Timeout Supplied During QP Setup

During QP setup, software supplies the QP Context with the timeout value to be returned when an RNR Nak is issued due to the lack of RQ WQE to handle the current request. If the RNR Nak is being returned for any other reason, the timeout value returned in the Nak packet's Syndrome field is implementation-specific.

On QP Setup, RNR Nak Retry Count Is Set

During the connection establishment process, the CMs in the two CAs exchanged RNR Nak Retry Counts in the REQ and REP messages. This value is stored in the QP Context of each of the two QPs. Note that a count of 7d (111b) indicates infinite retries.

After Delay, Requester May Retransmit the Request Packet

On receiving an RNR Nak, the requester may, after waiting for at least the interval specified in the RNR Nak, retry the same request packet (with the same PSN returned by the responder in the RNR Nak packet). If the requester fails to wait for at least the indicated time interval before retransmitting the same request packet, the responder may silently drop the packet.

During Timeout, RQ Logic Prepares to Receive Request

After issuing the RNR Nak packet containing the minimum RNR Retry timeout, the RQ Logic takes action to prepare for the retransmission of the request packet. As an example, the RNR Nak may have been issued in response to an incoming Send or RDMA Write With Immediate request operation. Both of these operations require the use of a RQ WQE. If none were currently present in the RQ, the RNR Nak is transmitted and the RQ Logic takes action to get a WR posted to the RQ to handle the operation about to be retried. Typically, this would consists of requesting that a software application associated with the CA post a WR to the QP's RQ.

RQ Logic Behavior after Returning RNR Nak

The responder QP's RQ Logic behavior after returning a RNR Nak is as follows:

  • There is no impact on the responder QP's RQ and no RQ WQEs are used. After returning the Nak, the RQ Logic resumes waiting for an inbound request packet with the correct ePSN.

  • After returning the Nak, the RQ Logic discards all subsequently received request packets with a PSN that is not equal to the ePSN (except for valid duplicate requests).

If the responder receives any duplicate requests, they are handled as described earlier in “Effects of Retry on Requester and Responder” on page 393.

Requester Receives RNR Nak and Retries Request

When an RNR Nak response is received, if the requester QP's SQ Logic RNR Nak Retry Counter is not equal to seven (indicating that infinite retries are permitted), the requester decrements the RNR Nak Retry Counter. If the Retry Counter is still non-zero, the requester may reissue the request.

If when the request is retried, a proper response (rather than an RNR Nak) is received, the Retry Counter is reloaded with its initial value.

When Requester Has Exhausted Retries

If the requester has exhausted the RNR Nak Retry Counter and still has not successfully sent the request packet in question, it takes the following actions:

  • It retires the SQ WQE and creates a SQ CQE reporting a “Processing-RNR Retry Counter Exceeded” error.

  • The requester QP is transitioned to the Error state to prevent a race condition that could occur if software were to post any further WQEs to the SQ before it (i.e., software) discovers that the error has occurred.

  • All WQEs that were posted to the SQ after the failed WQE are retired and corresponding SQ CQEs are created indicating that they have “Completed - Flushed in Error” status.

Table 17-6. RNR Nak's Syndrome Timer Encoding
Value (binary)Delay
00000655.36ms
00001.01ms
00010.02ms
00011.03ms
00100.04ms
00101.06ms
00110.08ms
00111.12ms
01000.16ms
01001.24ms
01010.32ms
01011.48ms
01100.64ms
01101.96ms
011101.28ms
011111.92ms
100002.56ms
100013.84ms
100105.12ms
100117.68ms
1010010.24ms
1010115.36ms
1011020.48ms
1011130.72ms
1100040.96ms
1100161.44ms
1101081.92ms
11011122.88ms
11100163.84ms
11101245.76ms
11110327.68ms
11111491.52ms

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset