This happens when you have a resolution log that is full. In the SrvLog you will usually be able to see which volume is affected, take down it's volume id (you may need to consult /vice/vol/VRList on the SCM to do this. Kill the dead (zombied) server, and restart it. The moment it is up you do:
# filcon isolate -s this_server We need to prevent clients from overwriting the log again # volutil setlogparms volid reson 4 logsize 16384 # filcon clear -s this_server
If this happens you have several options. If the server has crashed during salvaging it will not come up by trying again, you must either repair the damaged volume or not attach that volume.
Not attaching the volume is done as follows. Find the volume id of the damaged volume in the SrvLog. Create a file named /vice/vol/skipsalvage with the lines:
You can also try to repair the volume with norton. Norton is invoked as:
norton [LOG] [DATA] [DATA-SIZE]These parameters can be found in /vice/srv.conf. See norton(8) for detailed information about norton's operation. Built-in help is also available while running norton.
Often corruption is replicated. This means that if you find a server has crashed and does not want to salvage a volume, your other replicas may suffer the same fate: the risk is that you may have to go back to tape (you do make tapes, right?). Therefore first copy out good data from the available replicas, then attend to repairing or skipping them in salvage.
Very often you have to take both a volume and its most recent clone (generated during backup) offline, since corruption in a volume is inherited by the clone.
Tuesday I have lost my email folder - the whole volume moose:braam.life was corrupted on server moose ant it wouldn't salvage. Here is how I got it back.
First I tried mounting moose.braam.life.0.backup but this was corrupted too.
On the SCM in /vice/vol/VRList I found the replicated volume number f0000427 and the volume number ce000011 (fictious) for the volume.
I logged in as root to bison, our backup controlller. I read the backuplog for Tuesday morning in /vice/backuplogs/backuplog.DATE and saw that the incremental dump for August 31st had been fine. At the end of that log, I saw the name f0000427.ce000011 listed as dumped under /backup (a mere symlink) and/backup2 as spool directory with the actual file. The backup log almost shows how to move the tape to the correct place and invoke restore:
# cd /backup2 # mt -f /dev/nst0 rewind # restore -b 500 -f /dev/nst0 -s 3 -i Value after -s depends upon which /backup volume we pick to restore backup. restore> cd 31Aug1998 restore> add viotti.coda.cs.cmu.edu-f0000427.ce000011 restore> extract Specify volume #: 1
In /vice/db/dumplist I saw that the last full backup had been on Friday Aug28. I went to the machine room and inserted that tape (recent tapes are above bison). This time f0000427.ce000011 was a 200MB file (the last full dump) in /backup3. I extract the file as above.
Then I merged the two dumps:
# merge /restore/peter.mail /backup2/28Aug1998/f0000427.ce000011 \ > /backup3/31Aug1998/f0000427.ce000011
This took a minute or two to create /restore/peter.mail. Now all that was needed was to upload that to a volume:
# volutil -h moose restore /restore/peter.mail /vicepa vio:braam.mail.restored
Back to the SCM, to update the volume databases:
# bldvldb.sh viotti
Now I could mount the restored volume:
# cfs mkm restored-mail vio:braam.mail.restored
When trying to create volumes, and createvol_rep reports RPC2_NOBINDING, it is an indication that the server is not (yet) accepting connections.
It is useful to look at /vice/srv/SrvLog, the server performs the equivalent of fsck on startup, which might take some time. Only when the server logs Fileserver Started in SrvLog, it starts accepting incoming connections.
Another reason could be that an old server is still around, blocking the new server from accessing the network ports.
Some process has the UDP port open which rpc2portmap or auth2 is trying to obtain. In most cases this is an already running copy of rpc2portmap or auth2. Kill all running copies of the program in question and restart them.
Servers can crash when they are given inconsistent or bad data-files. You should check whether updateclnt and updatesrv are both running on the SCM and the machine that has crashed. You can kill and restart them. Then restart codasrv and it should come up.