The Coda filesystem is still under development, and there certainly are
several bugs which can crash both clients and servers. However, many problems
users observe are related to semantical differences of the Coda filesystem
compared to well-known NFS or SMB network filesystems.
This section will point out several logs to look at for identifying the cause
of problems. Even if the source of the problem cannot be found, the
information gathered from Coda's logging mechanisms will make it easier for
people on the coda mailinglist
to assist in solving the problem(s).
Some of the more common problems are illustrated in detail. At the end of this
section some of the more involved debugging techniques will be addressed. This
will be helpful to developers to isolate problems more easily.
At the end there is a whole section describing how to solve some problems with
Windows95, only the Coda related stuff!.
Most problems can be solved, or at least recognized by using the information
logged by the clients and servers. The first step in finding out where the
problems stems from is doing a tail -f on the logfiles.
It must also be noted that, when coda clients and servers crash they do not
`dump core', but start sleeping so that we developers can attach debuggers.
As a result, a crashed client or server still shows up in the ps auxwww
output, and only the combination of lack of file-service and error messages
in logfiles indicate that something is really wrong.
Since release 5.3.4 servers now actually exit when they crash, create a
/vice/srv/ZOMBIFY to force a server to go into an endless
Client debugging output
- codacon is a program which connects to venus and provides the user
with run-time information. It is the initial source of information, but cannot
be used to look back into the history. It is therefore advisable to always
have a codacon running in a dedicated xterm.
client$ xterm -e codacon
/usr/coda/etc/console is a logfile which contains mostly error
or warning messages, and is a place to look for errors which might have
occured. When assertions in the code fail, it is logged here.
/usr/coda/venus.cache/venus.log contains more in-depth
information about the running system, which can be helpful to find out what
the client is or was doing.
- cmon is an ncurses program that can be run on a client to gather
and display statistics from a group of servers. When a server goes down it
will not respond to the statistics requests, which makes this a simple method
for monitoring server availability.
client$ xterm -e cmon server1 server2 server3 ...
/vice/srv/SrvLog and /vice/srv/SrvErr are the server logfiles.
Other logfiles that could be helpful in discovering problems are:
Client does not connect to
When you have set up your client for the first time, and it can not connect to
the testserver at CMU, there are a couple of possible reasons. You might be
running an old release of Coda, check the Coda web-site to see what the latest
Another common reason is that your site is behind a firewall, which blocks, or
allows only outgoing, udp traffic. Either try Coda on a machine outside of the
firewall, or set up your own server.
The third reason is that the testserver might be down, for maintenance or
upgrades. That does not happen often, but you can check whether it is up, and
how long it has been running using cmon.
Venus comes up but prints cannot find RootVolume
All of the reasons in the previous item could be the cause. It is
also possible that your
/etc/services file is not allright.
It needs the entries:
# Iana allocated Coda filesystem port numbers
rpc2portmap 369/udp # Coda portmapper
codaauth2 370/udp # Coda authentication server
venus 2430/tcp # codacon port
venus 2430/udp # Venus callback/wbc interface
venus-se 2431/tcp # tcp side effects
venus-se 2431/udp # udp sftp side effect
codasrv 2432/tcp # not used
codasrv 2432/udp # server port
codasrv-se 2433/tcp # tcp side effects
codasrv-se 2433/udp # udp sftp side effect
Trying to access a file returns Connection timed out (ETIMEDOUT).
The main reason for getting Connection timed out errors is that the volume
where the file is located is disconnected from the servers. However, it can
also occur in some cases when the client is in write-disconnected mode, and
there is an attempt to read a file which is open for writing. See
Volume is disconnected/Volume is write-disconnected
for more information.
Commands do not return, except by using ^C./
When command are hanging it is likely that venus has crashed. Check
Venus fails when restarted.
If venus complains (in
venus.log about not being able to open
/dev/cfs0, it is because
/coda is still mounted.
# umount /coda
Another reason for not restarting is that another copy of venus is still
around, and venus is unable to open it's network socket. In this case there
will be a message in
venus.log stating that RPC2_CommInit has failed.
Venus doesn't start.
A reason is that you do not have the correct kernel module. This can be
tested by inserting the module by hand, and then listing the available
modules. `coda' should show up in that listing. Otherwise reinstall (or
recompile) a new module.
# depmod -a
# insmod coda.o
Module Size Used by
coda 50488 2
If the kernel-module can be loaded without errors, check
message stating `Cannot get rootvolume name' indicated either a misconfigured
server or the codasrv/codasrv-se ports are not defined in
/etc/services, which should contain the following entries. See above for the entries needed.
I'm disconnected and Venus doesn't start
Put the hostnames of your servers in
I cannot get tokens while disconnected.
Take a vacation until we release a version of Coda which uses it's
telepathic abilities to contact the auth2 server. We will add this
Hoard doesn't work
Make sure you have version 5.0 of Coda or later. Before you can hoard you must make sure that:
- You started Venus with the flag
- You have tokens
The server crashed and prints messages about "AllocViaWrapAround"
This happens when you have a resolution log that is full. In the
SrvLog file you will usually be able to see which volume is
affected, take down it's volume id (you may need to consult
/vice/vol/VRList on the SCM to do this. Kill the dead
(zombied) server, and restart it. The moment it is up you do:
filcon isolate -s "this server" # to prevent clients from again
# overwriting the log
volutil setlogparms "volid" reson 4 logsize 16384
filcon clear -s "this server"
Unless you do "huge" things 16k will be plenty.
server doesn't start due to salvaging problems
If this happens you have several options. If the server has crashed
during salvaging it will not come up by trying again, you must either
repair the damaged volume or not attach that volume.
Not attaching the volume is done as follows. Find the volume id of
the damaged volume in the SrvLog. Create a file named
/vice/vol/skipsalvage with the lines:
1 indicates that a single volume is to be skipped and
0xdd000123 is the volume id of the replica that should not be
attached. If this volume is a replicated volume, take all replicas
offline, since otherwise the clients will get very confused.
You can also try to repair the volume with
norton. Norton is invoked as:
norton LOG DATA DATA-SIZE
These parameters can be found in /vice/srv.conf.
The Norton manual pages give details about norton's operation and
there is online guidance available which is possibly more helpful.
- Often corruption is replicated. This means that if you find a
server has crashed and does not want to salvage a volume, your other
replicas may suffer the same fate: the risk is that you may have to go
back to tape (you do make tapes, right?). Therefore first copy
out good data from the available replicas, then attend to
repairing or skipping them in salvage.
- Very often you have to take both a volume and its most recent
clone (generated during backup) offline, since corruption in a volume
is inherited by the clone.
- If you find that a replica of a volume is corrupt, do not
attempt to merely replace that replica. We have found that this
corrupts the volume databases. It is better to make a new replicated
volume and copy of the data from the healthy replicas (keep the server
with the bad replica down).
How to restore a backup from tape
Tuesday I lost my email folder - the whole volume
moose:braam.life was corrupted on server moose, it wouldn't
salvage. Here is how I got it back.
First I tried mounting
moose.braam.life.0.backup but this was
On the SCM in
/vice/vol/VRList I found the replicated volume
f0000427 and the volume number
for the volume.
I logged in as root to bison, our backup controlller. I read the
backuplog for Tuesday morning in /vice/backuplogs/backuplog.DATE and
saw that the incremental dump for August 31st had been fine. At the
end of that log, I saw the name
f0000427.ce000011 listed as dumped
under /backup (a mere symlink) and /backup2 as spool directory with
the actual file. The backup log almost shows how to move the tape to
the correct place and invoke restore:
mt -f /dev/nst0 rewind
restore -b 500 -f /dev/nst0 -s 3 -i
-s 3 option varies according to which /backup
volume the backup is restored from. This invokes the restore
command. Typing help allowed me to add then extract the file I
wanted. It took a little while before the file was back. From the
restore prompt do:
restore> cd 31Aug1998
restore> add viotti.coda.cs.cmu.edu-f0000427.ce000011
Specify volume #: 1
In /vice/db/dumplist I saw that the last full backup had been on
Friday Aug28. I went to the machine room and inserted that tape
(recent tapes are above bison). This time f0000427.ce000011 was a
200MB file (the last full dump) in /backup3. I extract the file as
Then I merged the two dumps:
merge /restore/peter.mail /backup2/28Aug1998/f0000427.ce000011 \
This took a minute or two to create /restore/peter.mail. Now all that
was needed was to upload that to a volume:
volutil -h moose restore /restore/peter.mail /vicepa vio:braam.mail.restored
Back to the SCM, to update the volume databases:
Now I could mount the restored volume:
cfs mkm restored-mail vio:braam.mail.restored
and copy it into a read write volume using cpio or tar.
createvol_rep reports RPC2_NOBINDING.
When trying to create volumes, and createvol_rep reports RPC2_NOBINDING, it is
an indication that the server is not (yet) accepting connections.
It is useful to look at
/vice/srv/SrvLog, the server performs the
fsck on startup, which might take some time. Only when
the server logs `Fileserver Started' in SrvLog, it starts accepting incoming
Another reason is that an old server is still around, blocking the new server
from accessing the network ports.
RPC2_DUPLICATESERVER in the rpc2portmap/auth2 logs
Some process has the UDP port open which rpc2portmap or auth2 is trying to
obtain. In most cases this is an already running copy of rpc2portmap or auth2.
Kill all running copies of the program in question and restart them.
Server crashed shortly after updating files in
Servers can crash when they are given inconsistent or bad data-files. You
should check whether
updatesrv are both running on
the SCM and the machine that has crashed. You can kill and restart them. Then
codasrv and it should come up.
Users cannot authenticate or created volumes are not mountable.
Check whether auth2, updateclnt, and updatesrv are running on all fileservers.
Also check their logfiles for possible errors.
As most common problems are related to the semantical differences arising
as a result of `involuntary' disconnections, this section contains some
background information of why volumes become disconnected or
write-disconnected. And how to get them to reconnect again.
Volume is fully disconnected.
There are several reasons why a coda client may have disconnected some or all
volumes from an accessible server.
- Pending reintegration.
When modifications have been made to the volume in disconnected mode, the
client will not reconnected the volume until all changes have been
reintegrated. Also, reintegration will not occur without proper user
authentication tokens. Furthermore, reintegration is suspended as long as
there are objects in conflict.
The most important item here is to have a codacon process running, since
it will give up-to-date information on what venus is doing. Venus will inform
the user about missing coda authentication tokens,
`Reintegration: pending tokens for user <uid>'. In this case the user
should authenticate himself using the clog command.
Conflicts, which require us to use the repair tool, are conveyed using
the `local object <pathname> inconsistent' message. Otherwise codacon
should show messages about backfetches, and how many modifications were
- Access permissions.
The client may also disconnect when a servers reports an error to an
operation, when according to the client this is a valid operation. Causes for
this are authentication failure; check tokens using ctokens and
optionally obtain new tokens using clog. Or inconsistencies between the
data cached on the client and the actual data stored on the server; this will
reveal itself as an inconsistent object during subsequent reintegration.
- Lost connections.
Sometimes the client does not receive a prompt reply from an accessible
server, and marks the server as dead. This will ofcourse disconnect the volume
if the last server is lost. Once every five minutes, the client automatically
verifies connectivity with all known servers, and can thus recover from lost
connections. However, this action can also be triggered by the user by
excecuting the cfs checkservers command.
If cfs checkservers reports that servers are unreachable, it might be
interesting to check with cmon if the server is responding at all, since
we might be faced with a crashed server. When a server was considered
unreachable, but is successfully contacted after `cfs checkservers',
reintegration will automatically start (when a user has tokens, and there are
Volume is write-disconnected.
Write-disconnected operation is used as often as
weakly connected mode to describe this volume state, and they are
effectively the same. This is the special situation where a client observes a
weak connectivity with a server, and therefore forces the associated volumes
in weakly connected mode. Weakly connected volumes postpone writing to the
server to significantly reduce waiting on a slow network connection. Read
operations are still serviced by the local cache and the servers, as in fully
connected mode. Which is why this mode of operation is also called
The write operations are effectively a continuous reintegration
(trickle-reintegration) in the background. This mode, therefore, requires
users to be authenticated and gives more chance for possible file conflicts.
The following points are several reasons for write-disconnected operation.
- Weak network connectivity.
Venus uses bandwidth estimates made by the rpc2 communication layer to decide
on the quality of the network connection with the servers. As soon as the
connectivity to one of the servers drops to below the weakly connected
treshhold (currently 50 KB/s), it will force all volumes associated with that
server into weakly-connected mode. The cfs wr command can be used to
force the volumes back into fully connected mode, and immediately reintegrate
To avoid switching to weakly connected mode, use cfs strong. This
way venus ignores bandwidth estimates. cfs adaptive will make venus
revert to interpreting bandwidth estimates.
When the user was not authenticated, or conflicts were created during the
write-disconnected operation, the user must first obtain proper authentication
tokens or repair any inconsistent objects before the volume becomes fully
connected again. Here again codacon is an invaluable tool for obtaining
insight into the client's behaviour.
- User requested write-disconnect mode.
Users can ask venus to force volumes in write-disconnected mode, exchanging
high consistency for significantly improved performance. By using the
-time flags on the cfs wd commandline, some control is
given about the speed at which venus performs the trickle-reintegration.
For instance, to perform the trickle-reintegrate more quickly than the
default, where only mutations to the filesystem older than 15 minutes are
reintegrated. You could use cfs wd -age 5, which will attempt to
reintegrate all mutations older than 5 seconds.
- Pending reintegration.
When a volume is write-disconnected, it will stay write-disconnected until a
user properly authenticates using clog.
rpc2tcpdump is the regular tcpdump, which is modified to decode rpc2
protocol headers. This makes it a very useful tool for analyzing why programs
fail to work.
All traffic between
venus and the coda servers can be viewed using the
# tcpdump -s120 -Trpc2 port venus or port venus-se
To identify problems with
clog, for instance which server it is trying to
get tokens from.
# tcpdump -s120 -Trpc2 port codaauth
debugging with gdb
To be able to debug programs that use RVM, most coda related application will
go into an endless sleep when something goes really wrong. They print their
process-id in the log (f.i.
SrvLog), and a user can
attach a debugger to the crashed, but still running, program.
# gdb /usr/sbin/venus `pidof venus`
This makes it possible to get a stack backtrace (
where), go to a specific
stack frame (
frame <x>), or view the contents of variables,
print <varname>). By installing the coda sources in same place as
where the binaries were initially built from, it is possible to view the
surrounding code fragment from within the debugger using the
When using RedHat Linux rpms, you can install the sources in the right place
by installing the coda source rpm file.
# rpm -i coda-x.x.x.src.rpm
# rpm -bp /usr/src/redhat/SPECS/coda.spec
On other platforms look at the paths reported in the backtrace and unpack the
source tarball in the correct place.
#0 CommInit () at /usr/local/src/coda-4.6.5/coda-src/venus/comm.cc:175
#1 0x80fa8c3 in main (argc=1, argv=0xbffffda4)
# cd /usr/local/src
# tar -xvzf coda-4.6.5.tgz
- Unable to shutdown Windows95.
Check the DOS Windows settings of Venus and Relay. The check box
Properties->Misc->Termination must be unticked.
- I cannot reboot Windows95 and I think it is due to the VXDs loaded for
Boot your System in DOS mode by pressing F8 on boot time. Cd to the windows
directory and type
edit system.ini. In the section
you will find the entries
Comment them out by using a
; in front of the lines. Try to restart
- How can I find out why
See troubleshooting venus.
When this happens it might not be possible to restart Venus, if it is still mounted.
In this case try to unmount by typing
If it does not work, you want to reboot the machine.
- How can I find out more about what has happend
Look in the file
c:\vxd.log. The file system driver
codadev.vxd prints information about all requests and answers in this
file. the information is only stored if the debug level has been turned on. the debug
level is specified in the registry
Set the debug level higher than 0 to receive messages in the debug file.
- I hook my running machine off the network and the explorer blocks.
Venus switches to disconnected mode after a short timeout. After that it
should work fine. If it doesn't, check if you have 'network connections' set
up in the explorer (e.g. samba drive). 'Network connections' block your
system, when no network is available.
- Most command line tools, that talk to Venus through the ioctl interface of the
Coda kernel module seem to work even when they print error messages.
- Handling large files (in particular executables) does not work well in a low
hoard.exe use absolute pathnames so far.
- Long filenames are not supported under DOS environment yet. You can access
files, but you need to use the long filenames.