ESX FC Give & Take

List overview All Threads
Download

newer

older

snapmirror status / lag

SCSI LTO3 Drive for NDMP Backup...

Carl Howell

19 Sep 2007 19 Sep '07

9:51 p.m.

I'm troubleshooting an issue we're having with three ESX 3.01 hosts connected to a FAS3050c via Fibre Channel.

After a takeover of the filer node that has active VM/LUN's running on it, the following appears in the /var/log/vmkernel file of the ESX host that owns the active VM's:

Device vmhba2:0:2 has disappeared but is currently in use and could not be removed. Device vmhba2:0:3 has disappeared but is currently in use and could not be removed.

On the other two ESX hosts, this appears in the /var/log/vmkernel file:

Device vmhba2:0:2 has disappeared and has been removed. Device vmhba2:0:3 has disappeared and has been removed.

Now, the VM's survive the takeover and are still up and running, but if I attempt a giveback, all three hosts lose access to these LUN's and the VM's go down.

It appears to me that the ESX host that owns the active VM's has a SCSI reservation to these LUN's and a takeover is not enough to cause it to reset and remove that reservation(even though the paths failover properly).

This type of behavior is mentioned in http://www.vmware.com/community/thread.jspa?messageID=752442.

All of the timeout values have been verified etc. . .

Thanks in advance for the help,

--Carl

Show replies by date

Pinsker, Christian

20 Sep 20 Sep

1:02 p.m.

Hi Carl,

I assume you have host attach kit etc. installed on you ESX hosts... best practice implementation...

So: To get rid of the reservations a short term workaround would possibly be to shut down the respective fc interface towards the frontend SAN for a few minutes - e.g. for 0c: "fcp config 0c down" ... "fcp config 0c up" Or even to restart (brute force approach) the fcp service (fcp stop / start).

The better and proactive approach to avoid this issue could be to set the fc ports on the FC switch to "f-port" (if they are set to "auto").

I'd be glad to get feedback from you if those hints helped to solve the issue (or not). In any case it's wise to involve our support and open a case in parallel.

Best Regards, Chris

-----Original Message----- From: Carl Howell [mailto:chowell@uwf.edu] Sent: Mittwoch, 19. September 2007 23:52 To: toasters@mathworks.com Subject: ESX FC Give & Take

I'm troubleshooting an issue we're having with three ESX 3.01 hosts connected to a FAS3050c via Fibre Channel.

After a takeover of the filer node that has active VM/LUN's running on it, the following appears in the /var/log/vmkernel file of the ESX host that owns the active VM's:

Device vmhba2:0:2 has disappeared but is currently in use and could not be removed. Device vmhba2:0:3 has disappeared but is currently in use and could not be removed.

On the other two ESX hosts, this appears in the /var/log/vmkernel file:

Device vmhba2:0:2 has disappeared and has been removed. Device vmhba2:0:3 has disappeared and has been removed.

Now, the VM's survive the takeover and are still up and running, but if I attempt a giveback, all three hosts lose access to these LUN's and the VM's go down.

This type of behavior is mentioned in http://www.vmware.com/community/thread.jspa?messageID=752442.

All of the timeout values have been verified etc. . .

Thanks in advance for the help,

--Carl

Glenn Dekhayser

1:24 p.m.

I have a client with the exact same problem. VMWAre's response:

" Please try the following changes:

On your Netapp system, enter the following commands

"fcp config 0c down" and wait a few minutes.

Enable this port manually by entering the "fcp config 0c up" command.

Alternatively, executing an "fcp stop" and "fcp start" in short succession (no waiting time required) also resolves the issue.

To ensure proper operation, you must reset the FC connection after entering the "cf giveback" command. This can be done on the VMWare ESX host by entering the following commands:

esxcfg-module -s qlport_down_retry=60 <HBA-name> esxcfg-advcfg -s 0 /Disk/UseLunReset esxcfg-advcfg -s 0 /Disk/UseDeviceReset

These settings will result in the following type message on the console after a giveback:

Thu Oct 12 09:26:24 CEST [lk-san1a: scsitarget.ispfct.targetReset:CRITICAL]: FCP Target: Target Reset (from port 210000e08b0e922a), aborting all SCSI commands Once the connection is reset, it should work properly. Last Updated: 13 OCT 2006 "

...

-----Original Message----- From: owner-toasters@mathworks.com [mailto:owner- toasters@mathworks.com] On Behalf Of Carl Howell Sent: Wednesday, September 19, 2007 5:52 PM To: toasters@mathworks.com Subject: ESX FC Give & Take

I'm troubleshooting an issue we're having with three ESX 3.01 hosts connected to a FAS3050c via Fibre Channel.

After a takeover of the filer node that has active VM/LUN's running on it, the following appears in the /var/log/vmkernel file of the ESX

host

...

that owns the active VM's:

Device vmhba2:0:2 has disappeared but is currently in use and could

not

...

be removed. Device vmhba2:0:3 has disappeared but is currently in use and could

not

...

be removed.

On the other two ESX hosts, this appears in the /var/log/vmkernel

file:

...

Device vmhba2:0:2 has disappeared and has been removed. Device vmhba2:0:3 has disappeared and has been removed.

Now, the VM's survive the takeover and are still up and running, but

...

I attempt a giveback, all three hosts lose access to these LUN's and the VM's go down.

It appears to me that the ESX host that owns the active VM's has a

SCSI

...

reservation to these LUN's and a takeover is not enough to cause it to reset and remove that reservation(even though the paths failover properly).

This type of behavior is mentioned in http://www.vmware.com/community/thread.jspa?messageID=752442.

All of the timeout values have been verified etc. . .

Thanks in advance for the help,

--Carl

Carl Howell

2:20 p.m.

From VMWare's "San Configuration Guide" P.118, it says you should set Disk.UseDeviceReset=1. Has this changed?

Also from this discussion http://www.vmware.com/community/thread.jspa?messageID=749388 it appears that this might be a known issue that can't be fixed?

--Carl

-----Original Message----- From: Glenn Dekhayser [mailto:gdekhayser@voyantinc.com] Sent: Thursday, September 20, 2007 8:24 AM To: Carl Howell Cc: toasters@mathworks.com Subject: RE: ESX FC Give & Take

I have a client with the exact same problem. VMWAre's response:

" Please try the following changes:

On your Netapp system, enter the following commands

"fcp config 0c down" and wait a few minutes.

Enable this port manually by entering the "fcp config 0c up" command.

Alternatively, executing an "fcp stop" and "fcp start" in short succession (no waiting time required) also resolves the issue.

To ensure proper operation, you must reset the FC connection after entering the "cf giveback" command. This can be done on the VMWare ESX host by entering the following commands:

esxcfg-module -s qlport_down_retry=60 <HBA-name> esxcfg-advcfg -s 0 /Disk/UseLunReset esxcfg-advcfg -s 0 /Disk/UseDeviceReset

These settings will result in the following type message on the console after a giveback:

...

-----Original Message----- From: owner-toasters@mathworks.com [mailto:owner- toasters@mathworks.com] On Behalf Of Carl Howell Sent: Wednesday, September 19, 2007 5:52 PM To: toasters@mathworks.com Subject: ESX FC Give & Take

I'm troubleshooting an issue we're having with three ESX 3.01 hosts connected to a FAS3050c via Fibre Channel.

After a takeover of the filer node that has active VM/LUN's running on it, the following appears in the /var/log/vmkernel file of the ESX

host

...

that owns the active VM's:

Device vmhba2:0:2 has disappeared but is currently in use and could

not

...

be removed. Device vmhba2:0:3 has disappeared but is currently in use and could

not

...

be removed.

On the other two ESX hosts, this appears in the /var/log/vmkernel

file:

...

Device vmhba2:0:2 has disappeared and has been removed. Device vmhba2:0:3 has disappeared and has been removed.

Now, the VM's survive the takeover and are still up and running, but

...

I attempt a giveback, all three hosts lose access to these LUN's and the VM's go down.

It appears to me that the ESX host that owns the active VM's has a

SCSI

...

reservation to these LUN's and a takeover is not enough to cause it to reset and remove that reservation(even though the paths failover properly).

This type of behavior is mentioned in http://www.vmware.com/community/thread.jspa?messageID=752442.

All of the timeout values have been verified etc. . .

Thanks in advance for the help,

--Carl

Johnson, James A [HDS]

3:38 p.m.

Did you try what Glenn or Chris suggested? It would seem like establishing a new fcp connection would reestablish a new session.

James

-----Original Message----- From: owner-toasters@mathworks.com [mailto:owner-toasters@mathworks.com] On Behalf Of Carl Howell Sent: Thursday, September 20, 2007 7:20 AM To: Glenn Dekhayser Cc: toasters@mathworks.com Subject: RE: ESX FC Give & Take

From VMWare's "San Configuration Guide" P.118, it says you should set Disk.UseDeviceReset=1. Has this changed?

Also from this discussion http://www.vmware.com/community/thread.jspa?messageID=749388 it appears that this might be a known issue that can't be fixed?

--Carl

I have a client with the exact same problem. VMWAre's response:

" Please try the following changes:

On your Netapp system, enter the following commands

"fcp config 0c down" and wait a few minutes.

Enable this port manually by entering the "fcp config 0c up" command.

Alternatively, executing an "fcp stop" and "fcp start" in short succession (no waiting time required) also resolves the issue.

To ensure proper operation, you must reset the FC connection after entering the "cf giveback" command. This can be done on the VMWare ESX host by entering the following commands:

esxcfg-module -s qlport_down_retry=60 <HBA-name> esxcfg-advcfg -s 0 /Disk/UseLunReset esxcfg-advcfg -s 0 /Disk/UseDeviceReset

These settings will result in the following type message on the console after a giveback:

...

-----Original Message----- From: owner-toasters@mathworks.com [mailto:owner- toasters@mathworks.com] On Behalf Of Carl Howell Sent: Wednesday, September 19, 2007 5:52 PM To: toasters@mathworks.com Subject: ESX FC Give & Take

I'm troubleshooting an issue we're having with three ESX 3.01 hosts connected to a FAS3050c via Fibre Channel.

After a takeover of the filer node that has active VM/LUN's running on it, the following appears in the /var/log/vmkernel file of the ESX

host

...

that owns the active VM's:

Device vmhba2:0:2 has disappeared but is currently in use and could

not

...

be removed. Device vmhba2:0:3 has disappeared but is currently in use and could

not

...

be removed.

On the other two ESX hosts, this appears in the /var/log/vmkernel

file:

...

Device vmhba2:0:2 has disappeared and has been removed. Device vmhba2:0:3 has disappeared and has been removed.

Now, the VM's survive the takeover and are still up and running, but

...

I attempt a giveback, all three hosts lose access to these LUN's and the VM's go down.

It appears to me that the ESX host that owns the active VM's has a

SCSI

...

reservation to these LUN's and a takeover is not enough to cause it to reset and remove that reservation(even though the paths failover properly).

This type of behavior is mentioned in http://www.vmware.com/community/thread.jspa?messageID=752442.

All of the timeout values have been verified etc. . .

Thanks in advance for the help,

--Carl

Wilkinson, Brent

3:47 p.m.

We are also experiencing this issue with a NetApp 920c Filer configuration running Data Ontap 7.2.3 in SSI mode. We currently have tickets opened with VMWare and NetApp. It is good to know that others are experiencing this issue rather than trying to figure out if there is a configuration error on my end.

-----Original Message----- From: owner-toasters@mathworks.com [mailto:owner-toasters@mathworks.com] On Behalf Of Carl Howell Sent: Thursday, September 20, 2007 8:20 AM To: Glenn Dekhayser Cc: toasters@mathworks.com Subject: RE: ESX FC Give & Take

From VMWare's "San Configuration Guide" P.118, it says you should set Disk.UseDeviceReset=1. Has this changed?

Also from this discussion http://www.vmware.com/community/thread.jspa?messageID=749388 it appears that this might be a known issue that can't be fixed?

--Carl

I have a client with the exact same problem. VMWAre's response:

" Please try the following changes:

On your Netapp system, enter the following commands

"fcp config 0c down" and wait a few minutes.

Enable this port manually by entering the "fcp config 0c up" command.

Alternatively, executing an "fcp stop" and "fcp start" in short succession (no waiting time required) also resolves the issue.

To ensure proper operation, you must reset the FC connection after entering the "cf giveback" command. This can be done on the VMWare ESX host by entering the following commands:

esxcfg-module -s qlport_down_retry=60 <HBA-name> esxcfg-advcfg -s 0 /Disk/UseLunReset esxcfg-advcfg -s 0 /Disk/UseDeviceReset

These settings will result in the following type message on the console after a giveback:

...

-----Original Message----- From: owner-toasters@mathworks.com [mailto:owner- toasters@mathworks.com] On Behalf Of Carl Howell Sent: Wednesday, September 19, 2007 5:52 PM To: toasters@mathworks.com Subject: ESX FC Give & Take

I'm troubleshooting an issue we're having with three ESX 3.01 hosts connected to a FAS3050c via Fibre Channel.

After a takeover of the filer node that has active VM/LUN's running on

...

it, the following appears in the /var/log/vmkernel file of the ESX

host

...

that owns the active VM's:

Device vmhba2:0:2 has disappeared but is currently in use and could

not

...

be removed. Device vmhba2:0:3 has disappeared but is currently in use and could

not

...

be removed.

On the other two ESX hosts, this appears in the /var/log/vmkernel

file:

...

Device vmhba2:0:2 has disappeared and has been removed. Device vmhba2:0:3 has disappeared and has been removed.

Now, the VM's survive the takeover and are still up and running, but

...

I attempt a giveback, all three hosts lose access to these LUN's and the VM's go down.

It appears to me that the ESX host that owns the active VM's has a

SCSI

...

reservation to these LUN's and a takeover is not enough to cause it to

...

reset and remove that reservation(even though the paths failover properly).

This type of behavior is mentioned in http://www.vmware.com/community/thread.jspa?messageID=752442.

All of the timeout values have been verified etc. . .

Thanks in advance for the help,

--Carl

Carl Howell

3:51 p.m.

Thanks for the feedback Brian. Is there anyone else suffering from this as well? Does anyone know if there is already a BURT(?) at NetApp to address this?

--Carl

-----Original Message----- From: Wilkinson, Brent [mailto:BWilkinson@CoBank.com] Sent: Thursday, September 20, 2007 10:48 AM To: Carl Howell; Glenn Dekhayser Cc: toasters@mathworks.com Subject: RE: ESX FC Give & Take

From VMWare's "San Configuration Guide" P.118, it says you should set Disk.UseDeviceReset=1. Has this changed?

Also from this discussion http://www.vmware.com/community/thread.jspa?messageID=749388 it appears that this might be a known issue that can't be fixed?

--Carl

I have a client with the exact same problem. VMWAre's response:

" Please try the following changes:

On your Netapp system, enter the following commands

"fcp config 0c down" and wait a few minutes.

Enable this port manually by entering the "fcp config 0c up" command.

Alternatively, executing an "fcp stop" and "fcp start" in short succession (no waiting time required) also resolves the issue.

To ensure proper operation, you must reset the FC connection after entering the "cf giveback" command. This can be done on the VMWare ESX host by entering the following commands:

esxcfg-module -s qlport_down_retry=60 <HBA-name> esxcfg-advcfg -s 0 /Disk/UseLunReset esxcfg-advcfg -s 0 /Disk/UseDeviceReset

These settings will result in the following type message on the console after a giveback:

...

-----Original Message----- From: owner-toasters@mathworks.com [mailto:owner- toasters@mathworks.com] On Behalf Of Carl Howell Sent: Wednesday, September 19, 2007 5:52 PM To: toasters@mathworks.com Subject: ESX FC Give & Take

I'm troubleshooting an issue we're having with three ESX 3.01 hosts connected to a FAS3050c via Fibre Channel.

After a takeover of the filer node that has active VM/LUN's running on

...

it, the following appears in the /var/log/vmkernel file of the ESX

host

...

that owns the active VM's:

Device vmhba2:0:2 has disappeared but is currently in use and could

not

...

be removed. Device vmhba2:0:3 has disappeared but is currently in use and could

not

...

be removed.

On the other two ESX hosts, this appears in the /var/log/vmkernel

file:

...

Device vmhba2:0:2 has disappeared and has been removed. Device vmhba2:0:3 has disappeared and has been removed.

Now, the VM's survive the takeover and are still up and running, but

...

I attempt a giveback, all three hosts lose access to these LUN's and the VM's go down.

It appears to me that the ESX host that owns the active VM's has a

SCSI

...

reservation to these LUN's and a takeover is not enough to cause it to

...

reset and remove that reservation(even though the paths failover properly).

This type of behavior is mentioned in http://www.vmware.com/community/thread.jspa?messageID=752442.

All of the timeout values have been verified etc. . .

Thanks in advance for the help,

--Carl

dritschel62＠yahoo.com

4:49 p.m.

How do I get off this fun list? Sent from my Verizon Wireless BlackBerry

-----Original Message----- From: "Carl Howell" chowell@uwf.edu

Date: Thu, 20 Sep 2007 10:51:56 To:"Wilkinson, Brent" BWilkinson@CoBank.com, "Glenn Dekhayser" gdekhayser@voyantinc.com Cc:toasters@mathworks.com Subject: RE: ESX FC Give & Take

Thanks for the feedback Brian. Is there anyone else suffering from this as well? Does anyone know if there is already a BURT(?) at NetApp to address this?

--Carl

From VMWare's "San Configuration Guide" P.118, it says you should set Disk.UseDeviceReset=1. Has this changed?

Also from this discussion http://www.vmware.com/community/thread.jspa?messageID=749388 it appears that this might be a known issue that can't be fixed?

--Carl

I have a client with the exact same problem. VMWAre's response:

" Please try the following changes:

On your Netapp system, enter the following commands

"fcp config 0c down" and wait a few minutes.

Enable this port manually by entering the "fcp config 0c up" command.

Alternatively, executing an "fcp stop" and "fcp start" in short succession (no waiting time required) also resolves the issue.

To ensure proper operation, you must reset the FC connection after entering the "cf giveback" command. This can be done on the VMWare ESX host by entering the following commands:

esxcfg-module -s qlport_down_retry=60 <HBA-name> esxcfg-advcfg -s 0 /Disk/UseLunReset esxcfg-advcfg -s 0 /Disk/UseDeviceReset

These settings will result in the following type message on the console after a giveback:

...

-----Original Message----- From: owner-toasters@mathworks.com [mailto:owner- toasters@mathworks.com] On Behalf Of Carl Howell Sent: Wednesday, September 19, 2007 5:52 PM To: toasters@mathworks.com Subject: ESX FC Give & Take

I'm troubleshooting an issue we're having with three ESX 3.01 hosts connected to a FAS3050c via Fibre Channel.

After a takeover of the filer node that has active VM/LUN's running on

...

it, the following appears in the /var/log/vmkernel file of the ESX

host

...

that owns the active VM's:

Device vmhba2:0:2 has disappeared but is currently in use and could

not

...

be removed. Device vmhba2:0:3 has disappeared but is currently in use and could

not

...

be removed.

On the other two ESX hosts, this appears in the /var/log/vmkernel

file:

...

Device vmhba2:0:2 has disappeared and has been removed. Device vmhba2:0:3 has disappeared and has been removed.

Now, the VM's survive the takeover and are still up and running, but

...

I attempt a giveback, all three hosts lose access to these LUN's and the VM's go down.

It appears to me that the ESX host that owns the active VM's has a

SCSI

...

reservation to these LUN's and a takeover is not enough to cause it to

...

reset and remove that reservation(even though the paths failover properly).

This type of behavior is mentioned in http://www.vmware.com/community/thread.jspa?messageID=752442.

All of the timeout values have been verified etc. . .

Thanks in advance for the help,

--Carl

6745

Age (days ago)

6746

Last active (days ago)

toasters@lists.teaparty.net

7 comments

6 participants

tags (0)

participants (6)

Carl Howell
dritschel62＠yahoo.com
Glenn Dekhayser
Johnson, James A [HDS]
Pinsker, Christian
Wilkinson, Brent