Re: Splitting large volumes

List overview All Threads
Download

newer

older

Snapshots and the block allocation...

Alot minus alot equals some?

Eyal Traitel

29 Aug 2000 29 Aug '00

1:12 p.m.

The easiest solution, that also demands minimal downtime is using ndmpcopy/rsync altogether...

This will work only if you have free volumes on the new machine...

1. Copy a baseline copy of the seperate parts of your current volume to the target volumes, using ndmpcopy/rsync/tar/whatever (ndmpcopy will be the fastest probably). This can be done while users are working, a few hours/days before the migration downtime is planned. 2. In something like a few hours before the downtime, use the utility "rsync" and/or ndmpcopy to copy incrementally the changes since the baseline. 3. DOWNTIME - while in downtime, just do a final syncing, again using ndmpcopy incremental/rsync, and change the relevant settings: NIS maps, filer's /etc/exports, /etc/quotas...

Eyal.

===== Yours, Eyal Traitel eTraitel@yahoo.com, Home: 972-3-5290415 (Tel Aviv) *** eTraitel - it's the new eBuzzword ! ***

__________________________________________________ Do You Yahoo!? Yahoo! Mail - Free email you can access from anywhere! http://mail.yahoo.com/

Show replies by date

Garrett Burke

29 Aug 29 Aug

3:56 p.m.

New subject: Splitting large volumes

Eyal Traitel wrote:

...

The easiest solution, that also demands minimal downtime is using ndmpcopy/rsync altogether...

This will work only if you have free volumes on the new machine...

Copy a baseline copy of the seperate parts of your current volume to the target volumes, using ndmpcopy/rsync/tar/whatever (ndmpcopy will be the fastest probably). This can be done while users are working, a few hours/days before the migration downtime is planned.

In something like a few hours before the downtime, use the utility "rsync" and/or ndmpcopy to copy incrementally

Hi,

There's a bug with incremental NDMPCopy which affects files which are now shorter than they were in the initial level 0 dump. The bug is in the restore phase. Fixed in 5.3.5

Luckily we were only migrating ~60GB and the system wasn't running 24x7 (in a recent previous existence)

-- Garrett Burke, Operations Manager Eircom Multimedia Infrastructure Internet House 26-34 Temple Bar.

Steve Losen

4:14 p.m.

New subject: Splitting large volumes

...

The easiest solution, that also demands minimal downtime is using ndmpcopy/rsync altogether...

This will work only if you have free volumes on the new machine...

Copy a baseline copy of the seperate parts of your current volume to the target volumes, using ndmpcopy/rsync/tar/whatever (ndmpcopy will be the fastest probably). This can be done while users are working, a few hours/days before the migration downtime is planned.

In something like a few hours before the downtime, use the utility "rsync" and/or ndmpcopy to copy incrementally the changes since the baseline.

DOWNTIME - while in downtime, just do a final syncing, again using ndmpcopy incremental/rsync, and change the relevant settings: NIS maps, filer's /etc/exports, /etc/quotas...

In theory, this is the way to go and is what we ended up doing when we copied a volume to a new volume about a couple weeks ago. We did a level 0 dump/restore with ndmpcopy while the filer was up and during the downtime we picked up the changes with rsh filer dump | rsh filer restore. We didn't use rsync because we didn't want to lose any CIFS file attributes.

Let me warn you about some problems we ran into. We've submitted bug reports where appropriate.

1) We are running 5.3.5R2P2 and there is a known bug in incremental NDMP where it fails to dump files that should be dumped because NDMP dump only looks at the mtime and not the mtime and the ctime. To work around the bug, which is only in NDMP dump, we used "rsh filer dump | rsh filer restore". We ran into some serious problems where the restore would stop working and drop into an infinite loop. This would happen often, but not always. I'm talking to netapp about this one.

2) We discovered a bug where a level 0 subtree dump skips files dated before 00:00:00 Jan 1, 1970 GMT, i.e., files with negative timestamps. Don't ask me how they got there, they are user files and we had over 500 of them. This problem does not happen with a full volume dump, just a subtree. So I suggest either doing a full volume dump or running a find to locate all such files and "touch" them to a time after 1970.

3) Very minor bug -- one of our users has the unix uid 65535, which has the bit pattern 0xffff which is -1 in a 16 bit signed int. After the restore, this user's files were all owned by root instead of 65535. Nearby uids both above and below worked correctly.

4) Our original filesystem was created under DOT 5.0.2 and the filer was upgraded at least twice since. We run a mixed NFS and CIFS environment. In some of the older versions of DOT were some bugs in the WAFL directory format. One bug allowed two different files to be created in the same directory with the exact same name. Even though our version of DOT no longer has this bug, our volume still had three such pairs of files on it. The full dump/restore had no problems with these files, but the incremental restore -r could not cope and failed without restoring anything. If you can find these file pairs the fix is simple. Just rename one of the files. You have no control over which one of the two the filer picks, but afterward, you can see both files. This is a particularly insidious problem because any incremental dump of a volume with duplicate filenames CANNOT BE RESTORED with "restore -r". You will be forced to use "restore -x" which is less than desirable. Dump does not issue any errors, either. It's only when you restore that you discover the problem.

You can locate duplicate files like this:

find dir -print | sort > out1 sort -u out1 > out2 diff out1 out2

MORAL: Before a big volume copy, do some dry runs to be sure you won't run into problems during your downtime. Be sure to test both the full dump/restore and incremental dump/restore. We discovered many of these problems during the two weeks leading up to our downtime. Even still, we hit a couple snags during the downtime that cost us at least 2 hours.

One final tip: Before running restore -r on an incremental dump, be sure to save a copy of the restore_symboltable file since restore -r modifies it. If the restore modifies the file and then fails, you can't rerun the "restore -r" unless you put back the original file. Even then you may have problems, and will need to use "restore -x".

Steve Losen scl@virginia.edu phone: 804-924-0640

University of Virginia ITC Unix Support

Todd C. Merrill

11:02 p.m.

New subject: Splitting large volumes

On Tue, 29 Aug 2000, Steve Losen wrote:

...

In some of the older versions of DOT were some bugs in the WAFL directory format. One bug allowed two different files to be created in the same directory with the exact same name. Even though our version of DOT no longer has this bug, our volume still had three such pairs of files on it.

Anybody have the burt# for this, and in what releases it has been fixed? I recall in early 1999 we ran into this a few times, but some user would remove one of the files before we nailed it down with tech support, so my notes are incomplete.

Besides the tedious `ls|sort` method described, is there any other way to detect this error that may be latent on our filers? Would a `wack` detect and fix it? (Whoops...'cuse me, "WAFL_check.")

Until next time...

The Mathworks, Inc. 508-647-7000 x7792 3 Apple Hill Drive, Natick, MA 01760-2098 508-647-7001 FAX tmerrill@mathworks.com http://www.mathworks.com ---

Joan Pearson

11:14 p.m.

New subject: Splitting large volumes

It was burt 18506, and is fixed in 5.3.5 (and all subsequent releases).

Wafl_check will not detect it.

Joan Pearson At 03:02 PM 8/29/00 , you wrote:

...

On Tue, 29 Aug 2000, Steve Losen wrote:

...
In some of the older versions of DOT were some bugs in the WAFL directory format. One bug allowed two different files to be created in the same directory with the exact same name. Even though our version of DOT no longer has this bug, our volume still had three such pairs of files on it.

Anybody have the burt# for this, and in what releases it has been fixed? I recall in early 1999 we ran into this a few times, but some user would remove one of the files before we nailed it down with tech support, so my notes are incomplete.

Besides the tedious `ls|sort` method described, is there any other way to detect this error that may be latent on our filers? Would a `wack` detect and fix it? (Whoops...'cuse me, "WAFL_check.")

Until next time...

The Mathworks, Inc. 508-647-7000 x7792 3 Apple Hill Drive, Natick, MA 01760-2098 508-647-7001 FAX tmerrill@mathworks.com http://www.mathworks.com

Steve Losen

30 Aug 30 Aug

4:03 p.m.

New subject: Check your volumes (was Re: Splitting large volumes)

...

It was burt 18506, and is fixed in 5.3.5 (and all subsequent releases).

Wafl_check will not detect it.

Joan Pearson At 03:02 PM 8/29/00 , you wrote:

...
On Tue, 29 Aug 2000, Steve Losen wrote:

...
In some of the older versions of DOT were some bugs in the WAFL directory format. One bug allowed two different files to be created in the same directory with the exact same name. Even though our version of DOT no longer has this bug, our volume still had three such pairs of files on it.

I forgot to mention earlier that when running our giant "find" to find duplicate filenames, we hit some files that could not be accessed via NFS. All the files had characters in their names with numeric codes greater than 0177 (dec 127), i.e., the high order bit was set. Not all files with such characters displayed the problem. The filenames had extensions such as .doc which indicates they were probably created with CIFS.

You could list the directory and see the files, but if you tried anything that accessed that particular file, you got "file not found". So if you did "ls" you saw the file, but if you did "ls -l", you got the error "foo: file not found". This is because ls -l must stat() each file to get owner, permissions, etc., and the stat() call was failing.

Fortunately, dump/restore did not have problems with these files. We needed to move the old volume and after we copied it to a new volume the files worked properly on the new volume. I presume this is another WAFL directory format bug left over from an earlier release.

So from our experience, it appears that volumes created prior to 5.3.5 may have some "cruft" in them and one way to ferret it out is to run a big find and see if find reports any "not found" errors or lists any duplicates. It might be best to run this on a snapshot since a user could remove a file out from under find on the live filesystem.

We have found that running several finds in parallel on different subtrees of the volume is much faster than a single find.

Unfortunately, the "inaccessible from NFS" file problem cannot be fixed with NFS. In our case, copying the volume fixed it. You may need to use ndmpcopy to make a new copy of the afflicted directory or you may be able to fix it by copying or renaming the file with CIFS.

When netapp fixes a bug in WAFL, it would be nice if they would also provide some warning that old volumes may still have problems, and also provide a means to remedy them. As I pointed out earlier, incremental dumps of volumes with duplicate filenames cannot be restored with "restore -r". Anyone backing up their filers with incremental dumps will want to be sure that their volumes are free of duplicates.

Anyone backing up their filers with NFS will want to be sure that all files are accessible from NFS. We only saw the problem on regular files, but I don't see why a directory name could not have the same problem, making that whole subtree inaccessible.

Steve Losen scl@virginia.edu phone: 804-924-0640

University of Virginia ITC Unix Support

9091

Age (days ago)

9092

Last active (days ago)

toasters@lists.teaparty.net

5 comments

5 participants

tags (0)

participants (5)

Eyal Traitel
Garrett Burke
Joan Pearson
Steve Losen
Todd C. Merrill