I'm troubleshooting an issue we're having with three ESX 3.01 hosts connected to a FAS3050c via Fibre Channel.
After a takeover of the filer node that has active VM/LUN's running on it, the following appears in the /var/log/vmkernel file of the ESX host that owns the active VM's:
Device vmhba2:0:2 has disappeared but is currently in use and could not be removed. Device vmhba2:0:3 has disappeared but is currently in use and could not be removed.
On the other two ESX hosts, this appears in the /var/log/vmkernel file:
Device vmhba2:0:2 has disappeared and has been removed. Device vmhba2:0:3 has disappeared and has been removed.
Now, the VM's survive the takeover and are still up and running, but if I attempt a giveback, all three hosts lose access to these LUN's and the VM's go down.
It appears to me that the ESX host that owns the active VM's has a SCSI reservation to these LUN's and a takeover is not enough to cause it to reset and remove that reservation(even though the paths failover properly).
This type of behavior is mentioned in http://www.vmware.com/community/thread.jspa?messageID=752442.
All of the timeout values have been verified etc. . .
Thanks in advance for the help,
--Carl
Hi Carl,
I assume you have host attach kit etc. installed on you ESX hosts... best practice implementation...
So: To get rid of the reservations a short term workaround would possibly be to shut down the respective fc interface towards the frontend SAN for a few minutes - e.g. for 0c: "fcp config 0c down" ... "fcp config 0c up" Or even to restart (brute force approach) the fcp service (fcp stop / start).
The better and proactive approach to avoid this issue could be to set the fc ports on the FC switch to "f-port" (if they are set to "auto").
I'd be glad to get feedback from you if those hints helped to solve the issue (or not). In any case it's wise to involve our support and open a case in parallel.
Best Regards, Chris
-----Original Message----- From: Carl Howell [mailto:chowell@uwf.edu] Sent: Mittwoch, 19. September 2007 23:52 To: toasters@mathworks.com Subject: ESX FC Give & Take
I'm troubleshooting an issue we're having with three ESX 3.01 hosts connected to a FAS3050c via Fibre Channel.
After a takeover of the filer node that has active VM/LUN's running on it, the following appears in the /var/log/vmkernel file of the ESX host that owns the active VM's:
Device vmhba2:0:2 has disappeared but is currently in use and could not be removed. Device vmhba2:0:3 has disappeared but is currently in use and could not be removed.
On the other two ESX hosts, this appears in the /var/log/vmkernel file:
Device vmhba2:0:2 has disappeared and has been removed. Device vmhba2:0:3 has disappeared and has been removed.
Now, the VM's survive the takeover and are still up and running, but if I attempt a giveback, all three hosts lose access to these LUN's and the VM's go down.
It appears to me that the ESX host that owns the active VM's has a SCSI reservation to these LUN's and a takeover is not enough to cause it to reset and remove that reservation(even though the paths failover properly).
This type of behavior is mentioned in http://www.vmware.com/community/thread.jspa?messageID=752442.
All of the timeout values have been verified etc. . .
Thanks in advance for the help,
--Carl
I have a client with the exact same problem. VMWAre's response:
" Please try the following changes:
On your Netapp system, enter the following commands
"fcp config 0c down" and wait a few minutes.
Enable this port manually by entering the "fcp config 0c up" command.
Alternatively, executing an "fcp stop" and "fcp start" in short succession (no waiting time required) also resolves the issue.
To ensure proper operation, you must reset the FC connection after entering the "cf giveback" command. This can be done on the VMWare ESX host by entering the following commands:
esxcfg-module -s qlport_down_retry=60 <HBA-name> esxcfg-advcfg -s 0 /Disk/UseLunReset esxcfg-advcfg -s 0 /Disk/UseDeviceReset
These settings will result in the following type message on the console after a giveback:
Thu Oct 12 09:26:24 CEST [lk-san1a: scsitarget.ispfct.targetReset:CRITICAL]: FCP Target: Target Reset (from port 210000e08b0e922a), aborting all SCSI commands Once the connection is reset, it should work properly. Last Updated: 13 OCT 2006 "
-----Original Message----- From: owner-toasters@mathworks.com [mailto:owner- toasters@mathworks.com] On Behalf Of Carl Howell Sent: Wednesday, September 19, 2007 5:52 PM To: toasters@mathworks.com Subject: ESX FC Give & Take
I'm troubleshooting an issue we're having with three ESX 3.01 hosts connected to a FAS3050c via Fibre Channel.
After a takeover of the filer node that has active VM/LUN's running on it, the following appears in the /var/log/vmkernel file of the ESX
host
that owns the active VM's:
Device vmhba2:0:2 has disappeared but is currently in use and could
not
be removed. Device vmhba2:0:3 has disappeared but is currently in use and could
not
be removed.
On the other two ESX hosts, this appears in the /var/log/vmkernel
file:
Device vmhba2:0:2 has disappeared and has been removed. Device vmhba2:0:3 has disappeared and has been removed.
Now, the VM's survive the takeover and are still up and running, but
if
I attempt a giveback, all three hosts lose access to these LUN's and the VM's go down.
It appears to me that the ESX host that owns the active VM's has a
SCSI
reservation to these LUN's and a takeover is not enough to cause it to reset and remove that reservation(even though the paths failover properly).
This type of behavior is mentioned in http://www.vmware.com/community/thread.jspa?messageID=752442.
All of the timeout values have been verified etc. . .
Thanks in advance for the help,
--Carl
From VMWare's "San Configuration Guide" P.118, it says you should set Disk.UseDeviceReset=1. Has this changed?
Also from this discussion http://www.vmware.com/community/thread.jspa?messageID=749388 it appears that this might be a known issue that can't be fixed?
--Carl
-----Original Message----- From: Glenn Dekhayser [mailto:gdekhayser@voyantinc.com] Sent: Thursday, September 20, 2007 8:24 AM To: Carl Howell Cc: toasters@mathworks.com Subject: RE: ESX FC Give & Take
I have a client with the exact same problem. VMWAre's response:
" Please try the following changes:
On your Netapp system, enter the following commands
"fcp config 0c down" and wait a few minutes.
Enable this port manually by entering the "fcp config 0c up" command.
Alternatively, executing an "fcp stop" and "fcp start" in short succession (no waiting time required) also resolves the issue.
To ensure proper operation, you must reset the FC connection after entering the "cf giveback" command. This can be done on the VMWare ESX host by entering the following commands:
esxcfg-module -s qlport_down_retry=60 <HBA-name> esxcfg-advcfg -s 0 /Disk/UseLunReset esxcfg-advcfg -s 0 /Disk/UseDeviceReset
These settings will result in the following type message on the console after a giveback:
Thu Oct 12 09:26:24 CEST [lk-san1a: scsitarget.ispfct.targetReset:CRITICAL]: FCP Target: Target Reset (from port 210000e08b0e922a), aborting all SCSI commands Once the connection is reset, it should work properly. Last Updated: 13 OCT 2006 "
-----Original Message----- From: owner-toasters@mathworks.com [mailto:owner- toasters@mathworks.com] On Behalf Of Carl Howell Sent: Wednesday, September 19, 2007 5:52 PM To: toasters@mathworks.com Subject: ESX FC Give & Take
I'm troubleshooting an issue we're having with three ESX 3.01 hosts connected to a FAS3050c via Fibre Channel.
After a takeover of the filer node that has active VM/LUN's running on it, the following appears in the /var/log/vmkernel file of the ESX
host
that owns the active VM's:
Device vmhba2:0:2 has disappeared but is currently in use and could
not
be removed. Device vmhba2:0:3 has disappeared but is currently in use and could
not
be removed.
On the other two ESX hosts, this appears in the /var/log/vmkernel
file:
Device vmhba2:0:2 has disappeared and has been removed. Device vmhba2:0:3 has disappeared and has been removed.
Now, the VM's survive the takeover and are still up and running, but
if
I attempt a giveback, all three hosts lose access to these LUN's and the VM's go down.
It appears to me that the ESX host that owns the active VM's has a
SCSI
reservation to these LUN's and a takeover is not enough to cause it to reset and remove that reservation(even though the paths failover properly).
This type of behavior is mentioned in http://www.vmware.com/community/thread.jspa?messageID=752442.
All of the timeout values have been verified etc. . .
Thanks in advance for the help,
--Carl
Did you try what Glenn or Chris suggested? It would seem like establishing a new fcp connection would reestablish a new session.
James
-----Original Message----- From: owner-toasters@mathworks.com [mailto:owner-toasters@mathworks.com] On Behalf Of Carl Howell Sent: Thursday, September 20, 2007 7:20 AM To: Glenn Dekhayser Cc: toasters@mathworks.com Subject: RE: ESX FC Give & Take
From VMWare's "San Configuration Guide" P.118, it says you should set Disk.UseDeviceReset=1. Has this changed?
Also from this discussion http://www.vmware.com/community/thread.jspa?messageID=749388 it appears that this might be a known issue that can't be fixed?
--Carl
-----Original Message----- From: Glenn Dekhayser [mailto:gdekhayser@voyantinc.com] Sent: Thursday, September 20, 2007 8:24 AM To: Carl Howell Cc: toasters@mathworks.com Subject: RE: ESX FC Give & Take
I have a client with the exact same problem. VMWAre's response:
" Please try the following changes:
On your Netapp system, enter the following commands
"fcp config 0c down" and wait a few minutes.
Enable this port manually by entering the "fcp config 0c up" command.
Alternatively, executing an "fcp stop" and "fcp start" in short succession (no waiting time required) also resolves the issue.
To ensure proper operation, you must reset the FC connection after entering the "cf giveback" command. This can be done on the VMWare ESX host by entering the following commands:
esxcfg-module -s qlport_down_retry=60 <HBA-name> esxcfg-advcfg -s 0 /Disk/UseLunReset esxcfg-advcfg -s 0 /Disk/UseDeviceReset
These settings will result in the following type message on the console after a giveback:
Thu Oct 12 09:26:24 CEST [lk-san1a: scsitarget.ispfct.targetReset:CRITICAL]: FCP Target: Target Reset (from port 210000e08b0e922a), aborting all SCSI commands Once the connection is reset, it should work properly. Last Updated: 13 OCT 2006 "
-----Original Message----- From: owner-toasters@mathworks.com [mailto:owner- toasters@mathworks.com] On Behalf Of Carl Howell Sent: Wednesday, September 19, 2007 5:52 PM To: toasters@mathworks.com Subject: ESX FC Give & Take
I'm troubleshooting an issue we're having with three ESX 3.01 hosts connected to a FAS3050c via Fibre Channel.
After a takeover of the filer node that has active VM/LUN's running on it, the following appears in the /var/log/vmkernel file of the ESX
host
that owns the active VM's:
Device vmhba2:0:2 has disappeared but is currently in use and could
not
be removed. Device vmhba2:0:3 has disappeared but is currently in use and could
not
be removed.
On the other two ESX hosts, this appears in the /var/log/vmkernel
file:
Device vmhba2:0:2 has disappeared and has been removed. Device vmhba2:0:3 has disappeared and has been removed.
Now, the VM's survive the takeover and are still up and running, but
if
I attempt a giveback, all three hosts lose access to these LUN's and the VM's go down.
It appears to me that the ESX host that owns the active VM's has a
SCSI
reservation to these LUN's and a takeover is not enough to cause it to reset and remove that reservation(even though the paths failover properly).
This type of behavior is mentioned in http://www.vmware.com/community/thread.jspa?messageID=752442.
All of the timeout values have been verified etc. . .
Thanks in advance for the help,
--Carl
We are also experiencing this issue with a NetApp 920c Filer configuration running Data Ontap 7.2.3 in SSI mode. We currently have tickets opened with VMWare and NetApp. It is good to know that others are experiencing this issue rather than trying to figure out if there is a configuration error on my end.
-----Original Message----- From: owner-toasters@mathworks.com [mailto:owner-toasters@mathworks.com] On Behalf Of Carl Howell Sent: Thursday, September 20, 2007 8:20 AM To: Glenn Dekhayser Cc: toasters@mathworks.com Subject: RE: ESX FC Give & Take
From VMWare's "San Configuration Guide" P.118, it says you should set Disk.UseDeviceReset=1. Has this changed?
Also from this discussion http://www.vmware.com/community/thread.jspa?messageID=749388 it appears that this might be a known issue that can't be fixed?
--Carl
-----Original Message----- From: Glenn Dekhayser [mailto:gdekhayser@voyantinc.com] Sent: Thursday, September 20, 2007 8:24 AM To: Carl Howell Cc: toasters@mathworks.com Subject: RE: ESX FC Give & Take
I have a client with the exact same problem. VMWAre's response:
" Please try the following changes:
On your Netapp system, enter the following commands
"fcp config 0c down" and wait a few minutes.
Enable this port manually by entering the "fcp config 0c up" command.
Alternatively, executing an "fcp stop" and "fcp start" in short succession (no waiting time required) also resolves the issue.
To ensure proper operation, you must reset the FC connection after entering the "cf giveback" command. This can be done on the VMWare ESX host by entering the following commands:
esxcfg-module -s qlport_down_retry=60 <HBA-name> esxcfg-advcfg -s 0 /Disk/UseLunReset esxcfg-advcfg -s 0 /Disk/UseDeviceReset
These settings will result in the following type message on the console after a giveback:
Thu Oct 12 09:26:24 CEST [lk-san1a: scsitarget.ispfct.targetReset:CRITICAL]: FCP Target: Target Reset (from port 210000e08b0e922a), aborting all SCSI commands Once the connection is reset, it should work properly. Last Updated: 13 OCT 2006 "
-----Original Message----- From: owner-toasters@mathworks.com [mailto:owner- toasters@mathworks.com] On Behalf Of Carl Howell Sent: Wednesday, September 19, 2007 5:52 PM To: toasters@mathworks.com Subject: ESX FC Give & Take
I'm troubleshooting an issue we're having with three ESX 3.01 hosts connected to a FAS3050c via Fibre Channel.
After a takeover of the filer node that has active VM/LUN's running on
it, the following appears in the /var/log/vmkernel file of the ESX
host
that owns the active VM's:
Device vmhba2:0:2 has disappeared but is currently in use and could
not
be removed. Device vmhba2:0:3 has disappeared but is currently in use and could
not
be removed.
On the other two ESX hosts, this appears in the /var/log/vmkernel
file:
Device vmhba2:0:2 has disappeared and has been removed. Device vmhba2:0:3 has disappeared and has been removed.
Now, the VM's survive the takeover and are still up and running, but
if
I attempt a giveback, all three hosts lose access to these LUN's and the VM's go down.
It appears to me that the ESX host that owns the active VM's has a
SCSI
reservation to these LUN's and a takeover is not enough to cause it to
reset and remove that reservation(even though the paths failover properly).
This type of behavior is mentioned in http://www.vmware.com/community/thread.jspa?messageID=752442.
All of the timeout values have been verified etc. . .
Thanks in advance for the help,
--Carl
Thanks for the feedback Brian. Is there anyone else suffering from this as well? Does anyone know if there is already a BURT(?) at NetApp to address this?
--Carl
-----Original Message----- From: Wilkinson, Brent [mailto:BWilkinson@CoBank.com] Sent: Thursday, September 20, 2007 10:48 AM To: Carl Howell; Glenn Dekhayser Cc: toasters@mathworks.com Subject: RE: ESX FC Give & Take
We are also experiencing this issue with a NetApp 920c Filer configuration running Data Ontap 7.2.3 in SSI mode. We currently have tickets opened with VMWare and NetApp. It is good to know that others are experiencing this issue rather than trying to figure out if there is a configuration error on my end.
-----Original Message----- From: owner-toasters@mathworks.com [mailto:owner-toasters@mathworks.com] On Behalf Of Carl Howell Sent: Thursday, September 20, 2007 8:20 AM To: Glenn Dekhayser Cc: toasters@mathworks.com Subject: RE: ESX FC Give & Take
From VMWare's "San Configuration Guide" P.118, it says you should set Disk.UseDeviceReset=1. Has this changed?
Also from this discussion http://www.vmware.com/community/thread.jspa?messageID=749388 it appears that this might be a known issue that can't be fixed?
--Carl
-----Original Message----- From: Glenn Dekhayser [mailto:gdekhayser@voyantinc.com] Sent: Thursday, September 20, 2007 8:24 AM To: Carl Howell Cc: toasters@mathworks.com Subject: RE: ESX FC Give & Take
I have a client with the exact same problem. VMWAre's response:
" Please try the following changes:
On your Netapp system, enter the following commands
"fcp config 0c down" and wait a few minutes.
Enable this port manually by entering the "fcp config 0c up" command.
Alternatively, executing an "fcp stop" and "fcp start" in short succession (no waiting time required) also resolves the issue.
To ensure proper operation, you must reset the FC connection after entering the "cf giveback" command. This can be done on the VMWare ESX host by entering the following commands:
esxcfg-module -s qlport_down_retry=60 <HBA-name> esxcfg-advcfg -s 0 /Disk/UseLunReset esxcfg-advcfg -s 0 /Disk/UseDeviceReset
These settings will result in the following type message on the console after a giveback:
Thu Oct 12 09:26:24 CEST [lk-san1a: scsitarget.ispfct.targetReset:CRITICAL]: FCP Target: Target Reset (from port 210000e08b0e922a), aborting all SCSI commands Once the connection is reset, it should work properly. Last Updated: 13 OCT 2006 "
-----Original Message----- From: owner-toasters@mathworks.com [mailto:owner- toasters@mathworks.com] On Behalf Of Carl Howell Sent: Wednesday, September 19, 2007 5:52 PM To: toasters@mathworks.com Subject: ESX FC Give & Take
I'm troubleshooting an issue we're having with three ESX 3.01 hosts connected to a FAS3050c via Fibre Channel.
After a takeover of the filer node that has active VM/LUN's running on
it, the following appears in the /var/log/vmkernel file of the ESX
host
that owns the active VM's:
Device vmhba2:0:2 has disappeared but is currently in use and could
not
be removed. Device vmhba2:0:3 has disappeared but is currently in use and could
not
be removed.
On the other two ESX hosts, this appears in the /var/log/vmkernel
file:
Device vmhba2:0:2 has disappeared and has been removed. Device vmhba2:0:3 has disappeared and has been removed.
Now, the VM's survive the takeover and are still up and running, but
if
I attempt a giveback, all three hosts lose access to these LUN's and the VM's go down.
It appears to me that the ESX host that owns the active VM's has a
SCSI
reservation to these LUN's and a takeover is not enough to cause it to
reset and remove that reservation(even though the paths failover properly).
This type of behavior is mentioned in http://www.vmware.com/community/thread.jspa?messageID=752442.
All of the timeout values have been verified etc. . .
Thanks in advance for the help,
--Carl
How do I get off this fun list? Sent from my Verizon Wireless BlackBerry
-----Original Message----- From: "Carl Howell" chowell@uwf.edu
Date: Thu, 20 Sep 2007 10:51:56 To:"Wilkinson, Brent" BWilkinson@CoBank.com, "Glenn Dekhayser" gdekhayser@voyantinc.com Cc:toasters@mathworks.com Subject: RE: ESX FC Give & Take
Thanks for the feedback Brian. Is there anyone else suffering from this as well? Does anyone know if there is already a BURT(?) at NetApp to address this?
--Carl
-----Original Message----- From: Wilkinson, Brent [mailto:BWilkinson@CoBank.com] Sent: Thursday, September 20, 2007 10:48 AM To: Carl Howell; Glenn Dekhayser Cc: toasters@mathworks.com Subject: RE: ESX FC Give & Take
We are also experiencing this issue with a NetApp 920c Filer configuration running Data Ontap 7.2.3 in SSI mode. We currently have tickets opened with VMWare and NetApp. It is good to know that others are experiencing this issue rather than trying to figure out if there is a configuration error on my end.
-----Original Message----- From: owner-toasters@mathworks.com [mailto:owner-toasters@mathworks.com] On Behalf Of Carl Howell Sent: Thursday, September 20, 2007 8:20 AM To: Glenn Dekhayser Cc: toasters@mathworks.com Subject: RE: ESX FC Give & Take
From VMWare's "San Configuration Guide" P.118, it says you should set Disk.UseDeviceReset=1. Has this changed?
Also from this discussion http://www.vmware.com/community/thread.jspa?messageID=749388 it appears that this might be a known issue that can't be fixed?
--Carl
-----Original Message----- From: Glenn Dekhayser [mailto:gdekhayser@voyantinc.com] Sent: Thursday, September 20, 2007 8:24 AM To: Carl Howell Cc: toasters@mathworks.com Subject: RE: ESX FC Give & Take
I have a client with the exact same problem. VMWAre's response:
" Please try the following changes:
On your Netapp system, enter the following commands
"fcp config 0c down" and wait a few minutes.
Enable this port manually by entering the "fcp config 0c up" command.
Alternatively, executing an "fcp stop" and "fcp start" in short succession (no waiting time required) also resolves the issue.
To ensure proper operation, you must reset the FC connection after entering the "cf giveback" command. This can be done on the VMWare ESX host by entering the following commands:
esxcfg-module -s qlport_down_retry=60 <HBA-name> esxcfg-advcfg -s 0 /Disk/UseLunReset esxcfg-advcfg -s 0 /Disk/UseDeviceReset
These settings will result in the following type message on the console after a giveback:
Thu Oct 12 09:26:24 CEST [lk-san1a: scsitarget.ispfct.targetReset:CRITICAL]: FCP Target: Target Reset (from port 210000e08b0e922a), aborting all SCSI commands Once the connection is reset, it should work properly. Last Updated: 13 OCT 2006 "
-----Original Message----- From: owner-toasters@mathworks.com [mailto:owner- toasters@mathworks.com] On Behalf Of Carl Howell Sent: Wednesday, September 19, 2007 5:52 PM To: toasters@mathworks.com Subject: ESX FC Give & Take
I'm troubleshooting an issue we're having with three ESX 3.01 hosts connected to a FAS3050c via Fibre Channel.
After a takeover of the filer node that has active VM/LUN's running on
it, the following appears in the /var/log/vmkernel file of the ESX
host
that owns the active VM's:
Device vmhba2:0:2 has disappeared but is currently in use and could
not
be removed. Device vmhba2:0:3 has disappeared but is currently in use and could
not
be removed.
On the other two ESX hosts, this appears in the /var/log/vmkernel
file:
Device vmhba2:0:2 has disappeared and has been removed. Device vmhba2:0:3 has disappeared and has been removed.
Now, the VM's survive the takeover and are still up and running, but
if
I attempt a giveback, all three hosts lose access to these LUN's and the VM's go down.
It appears to me that the ESX host that owns the active VM's has a
SCSI
reservation to these LUN's and a takeover is not enough to cause it to
reset and remove that reservation(even though the paths failover properly).
This type of behavior is mentioned in http://www.vmware.com/community/thread.jspa?messageID=752442.
All of the timeout values have been verified etc. . .
Thanks in advance for the help,
--Carl