VNX LUN Trespassing
Brief description:
LUNs on a storage system are allocated to a server. A storage
admin creates a LUN on a RAID Group or Storage Pool and assigns it to a server.
The platform team discovers this LUN, formats it, mounts it or assigns a drive letter
and starts to use it. One important aspect is LUN ownership: which storage
processor will process the I/O for that specific LUN ?
The newly created LUN will access through the default SP owner. We
can change the ownership from one SP to another. This is processing is known as
Trespassing.
Failover
A procedure by which a system automatically transfers
control to a duplicate system when it detects a fault or failure.
Failover modes are 4 types
Failover Mode 0 – LUN Based Trespass
Mode
This failover mode is the default and
works in conjunction with the Auto-trespass feature. Auto-trespass is a mode of
operation that is set on a LUN by LUN basis. If Auto-Trespass is enabled
on the LUN, the non-owning SP will report that the LUN exists and is available
for access. The LUN will trespass to the SP where the I/O request is
sent. Every time the LUN is trespassed a Unit Attention message is
recorded. If Auto-trespass is disabled, the non-owning SP will report
that the LUN exists but it is not available for access.
Failover Mode 1 – Passive Not Ready Mode
In this mode of operation the non-owning
SP will report that all non-owned LUNs exist and are available for
access. Any I/O request that is made to the non-owning SP will be
rejected.
Failover Mode 2 – DMP Mode
In this mode of operation the non-owning
SP will report that all non-owned LUNs exist and are available for access. This
is similar to Failover Mode 0 with Auto-trespass Enabled. Any I/O requests made
to the non-owning SP will cause the LUN to be trespassed to the SP that is
receiving the request.
Failover Mode 3 – Passive Always Ready
Mode
In this mode of operation the non-owning
SP will report that all non-owned LUNs exist and are available for
access. Any I/O requests sent to the Non-owning SP will be
rejected. This is similar to Failover Mode 1. However, any Test Unit
Ready command sent from the server will return with a success message, even to
the non-owning SP.
How trespassing works using ALUA (Failover mode 4) on a
VNX/CLARiiON storage system?
Resolution:
Since FLARE 26, Asymmetric Active/Active has provided a new way
for CLARiiON arrays to present LUNs to hosts, eliminating the need for hosts to
deal with the LUN ownership model. Prior to FLARE 26, all CLARiiON arrays used
the standard active/passive presentation feature which one SP "owns"
the LUN and all I/O to that LUN is sent only to that SP. If all paths to that SP
fail, the ownership of the LUN was 'trespassed' to the other SP and the
host-based path management software adjusted the I/O path accordingly.
Asymmetric Active/Active introduces a new initiator Failover Mode
(Failover mode 4) where initiators are permitted to send I/O to a LUN
regardless of which SP actually owns the LUN.
Manual trespass:
When a manual trespass is issued (using Navisphere Manager or CLI) to a LUN on a SP that is accessed by a host with Failover Mode 1, subsequent I/O for that LUN is rejected over the SP on which the manual trespass was issued. The failover software redirects I/O to the SP that owns the LUN.
A manual trespass operation causes the ownership of a given LUN owned by a given SP to change. If this LUN is accessed by an ALUA host (Failover Mode is set to 4), and I/O is sent to the SP that does not currently own the LUN, this would cause I/O redirection. In such a situation, the array based on how many I/Os (threshold of 64000 +/- I/Os) a LUN processes on each SP will change the ownership of the LUN.
Path, HBA, switch failure:
If a host is configured with Failover Mode 1 and all the paths to the SP that owns a LUN fail, the LUN is trespassed to the other SP by the host’s failover software.
With Failover Mode 4, in the case of a path, HBA, or switch failure, when I/O routes to the non-owning SP, the LUN may not trespass immediately (depending on the failover software on the host). If the LUN is not trespassed to the owning SP, FLARE will trespass the LUN to the SP that receives the most I/O requests to that LUN. This is accomplished by the array keeping track of how many I/Os a LUN processes on each SP. If the non-optimized SP processes 64,000 or more I/Os than the optimal SP, the array will change the ownership to the non-optimal SP, making it optimal.
If a host is configured with Failover Mode 1 and all the paths to the SP that owns a LUN fail, the LUN is trespassed to the other SP by the host’s failover software.
With Failover Mode 4, in the case of a path, HBA, or switch failure, when I/O routes to the non-owning SP, the LUN may not trespass immediately (depending on the failover software on the host). If the LUN is not trespassed to the owning SP, FLARE will trespass the LUN to the SP that receives the most I/O requests to that LUN. This is accomplished by the array keeping track of how many I/Os a LUN processes on each SP. If the non-optimized SP processes 64,000 or more I/Os than the optimal SP, the array will change the ownership to the non-optimal SP, making it optimal.
SP failure:
In case of an SP failure for a host configured as Failover Mode 1, the failover software trespasses the LUN to the surviving SP.
With Failover Mode 4, if an I/O arrives from an ALUA initiator on the surviving SP (non-optimal), FLARE initiates an internal trespass operation. This operation changes ownership of the target LUN to the surviving SP since its peer SP is dead. Hence, the host (failover software) must have access to the secondary SP so that it can issue an I/O under these circumstances.
In case of an SP failure for a host configured as Failover Mode 1, the failover software trespasses the LUN to the surviving SP.
With Failover Mode 4, if an I/O arrives from an ALUA initiator on the surviving SP (non-optimal), FLARE initiates an internal trespass operation. This operation changes ownership of the target LUN to the surviving SP since its peer SP is dead. Hence, the host (failover software) must have access to the secondary SP so that it can issue an I/O under these circumstances.
Single backend failure:
Before FLARE Release 26, if the failover software was misconfigured (for example, a single attach configuration), a single back-end failure (for example, an LCC or BCC failure) would generate an I/O error since the failover software would not be able to try the alternate path to the other SP with a stable backend.
With release 26 of FLARE, regardless of the Failover Mode for a given host, when the SP that owns the LUN cannot access that LUN due to a back-end failure, I/O is redirected through the other SP by the lower redirector. In this situation, the LUN is trespassed by FLARE to the SP that can access the LUN. After the failure is corrected, the LUN is trespassed back to the SP that previously owned the LUN. See the “Enabler for masking back-end failures” section for more information.
Before FLARE Release 26, if the failover software was misconfigured (for example, a single attach configuration), a single back-end failure (for example, an LCC or BCC failure) would generate an I/O error since the failover software would not be able to try the alternate path to the other SP with a stable backend.
With release 26 of FLARE, regardless of the Failover Mode for a given host, when the SP that owns the LUN cannot access that LUN due to a back-end failure, I/O is redirected through the other SP by the lower redirector. In this situation, the LUN is trespassed by FLARE to the SP that can access the LUN. After the failure is corrected, the LUN is trespassed back to the SP that previously owned the LUN. See the “Enabler for masking back-end failures” section for more information.
Note: Information in this solution is taken from the White
Paper "EMC CLARiiON. Asymmetric Active/Active Feature"