Investigation of gridftp directory problems and RSV probe failures
Background
The RSV
gridftp-simple-probe
has been known to fail from time to time.
This probe does the following:
- use
globus-url-copy
to copy a file from local disk to dCache
- use
globus-url-copy
to copy the file back to local disk
- use
diff
to compare the original and retrieved files
Currently, the code has some
sleep
commands inserted into it, for which
the purpose is unclear - the authors of the code don't supply adequate justification,
so the
sleep
calls can considered to be kludges.
There is a 5-second
sleep
between the first and second steps, as if we can't trust
that the file was stored after the first copy command returned. In fact, the
gridftp server should not return a success code until the file is stored, so this
sleep
should not be necessary, and is likely masking some other bug.
There is another, even more troubling
sleep
between the second and third steps,
this one is currently set at 30 seconds. Since the file has been stored to local disk,
there should be no latency at all here (nothing like NFS is involved), so there
is no reason for a 30-second sleep. Furthermore, this is troubling in its arbitrariness -
the figure of 30 seconds was arrived at because an earlier value of 5 was found not
to work. Clearly, it is necessary to take a closer look and find out what is really
going on.
Investigation
The failing probe
The code which is failing looks like this:
sub Do_Gridftp_File_Transfer_From {
my %h_tmp = %{$_[0]};
my %h_tmp_gridftp = %{$_[1]};
$h_tmp{'callingRoutine'} = "Do_Gridftp_File_Transfer_From()";
my $gridftp_cmd = "$h_tmp{'globusurlcopyCmd'} $h_tmp{'globusurlcopyRemoteTestfileUri' $h_tmp{'globusurlcopyBackTestfileUri'} 2>&1";
print STDERR " VERBOSE: Sleeping for $h_tmp{'gridftpDelaySeconds'} seconds before attempting to transfer file back ... \n" if (defined($h_tmp{'verbose'}));
sleep ($h_tmp{'gridftpDelaySeconds'});
%h_tmp_gridftp = %{&Run_Command (\%h_tmp, \%h_tmp_gridftp, $gridftp_cmd, "backtick")};
sleep 30;
%h_tmp_gridftp = %{&Check_Gridftp_Transfer_Status (\%h_tmp, \%h_tmp_gridftp, "false")};
return \%h_tmp_gridftp;
}
It's a little hard to see what this code is doing. The
sleep
calls are evident, but the call to
globus-url-copy
is indirect, via the function
Run_Command
.
In the same file, there is also a comment which reads:
## KNOWN BUGS
## <<1>> The Run_Command () routine has a bug that might leave zombies/stray
## processes if a child process times out and is killed. It appears to be
## well known problem in Perl programming circles with no easy fix.
## <<2>> The Run_Command () routine has a bug where in, the cmdOut is not
## set properly in certainly situations if the system() option is used.
##
A simple test
In order to debug in a simpler environment, without the complexities of the above
Perl, I tried the simplest possible test in shell:
ls > /tmp/testfile
for x in {1..100}; do
dst=gsiftp://uct2-dc1.uchicago.edu/pnfs/uchicago.edu/test/cgw/testfile.$x
globus-url-copy file:///tmp/testfile $dst
globus-url-copy $dst file:///tmp/testfile.$x
diff /tmp/testfile /tmp/testfile.$x
done
This runs 100 passes without error. (In fact, a slightly different version ran over 2800
passes before I hit ctrl-C).
Improving the test
I wanted to turn the above snippet of shell into a usable script, so
I changed hardcoded
/pnfs
paths into variables. I also didn't want
to assume the existence of the target directory, so I added a call to
mkdir
. The improved script looks like this:
gridftp_test_00.sh.
This is where I ran across a problem in dCache.
Apparently, the dCache gridftp door over-aggressively caches directory
information, so that when new dirs are created, the gridftp door doesn't
"know" about them right away. (Similar problems with creating directories
in dCache have been noted previously, e.g.
here)
If the test script creates the directory "up-front", then does some number of
loops of "copy in, copy out, diff", then the first few iterations will often fail,
after which the test works reliably. One might be tempted to
sleep
, but
this is masking a fundamental error, and also one is faced with the arbitrariness
of how long to sleep - is 5 seconds enough? Is 30 seconds enough?
Instead, I decided to try to provoke the error as much as possible by deleting
and creating the directory inside the loop.
I also created equivalent tests which use
dccp
and
srmcp
to perform the
same "copy in, copy out, diff" operation.
All of the test scripts are available at
http://repo.mwt2.org/viewvc/tests
The scripts have various small enhancements compared to the code example
above, for example, reporting elapsed real-time, and cleaning up the test files,
unless an error occurred, in which case the test files are kept as debugging aids.
Results
Here's the result of
runtests.sh
on
uct2-grid6
dccp_test.sh
Finished 20 passes, all OK
real 0m33.790s
gridftp_test.sh
Finished 20 passes, 9 failure(s)
real 0m44.949s
srmcp_test.sh
Finished 20 passes, all OK
real 6m25.256s
The tests with
dccp
and
srmcp
did not exhibit the same directory
creation bug as
gridftp
. The
gridftp_test
failure rate varies,
from 0 to 100% but typically about 50% of the test transfers fail
when the destination directory has been freshly created.
To investigate further, I inserted an
ls
in between the copy-in and the
copy-out, and at this point, the file is present and has non-zero size according
to
/pnfs
.
Here's the output of a test run where the commands are being echoed as they
are excuted:
globus-url-copy file:///tmp/gridftp-test/testfile gsiftp://uct2-dc1.uchicago.edu /pnfs/uchicago.edu/test/cgw/tmp/gridftp-test/testfile7
ls -l /pnfs/uchicago.edu/test/cgw/tmp/gridftp-test/testfile7
-rw-r--r-- 1 usatlas3 usatlas 1024 Sep 1 16:05 /pnfs/uchicago.edu/test/cgw/tmp/gridftp-test/testfile7
globus-url-copy gsiftp://uct2-dc1.uchicago.edu/pnfs/uchicago.edu/test/cgw/tmp/gridftp-test/testfile7 file:///tmp/gridftp-test/testfile7
error: globus_ftp_client_state.c:globus_l_ftp_client_connection_error:4580:
the server responded with an error
550 File not found
diff --brief /tmp/gridftp-test/testfile /tmp/gridftp-test/testfile7
Files /tmp/gridftp-test/testfile and /tmp/gridftp-test/testfile7 differ
This is the "smoking gun" - the file is there, according to
ls
, but attempting to retrieve it via
globus-url-copy
fails. The file
/tmp/gridftp-test/testfile7
is present but empty (0 bytes). This is similar to the failure mode observed
in the RSV probes.
Conclusions
- Directory creation is problematic for dcache gridftp and the failure can be reproduced reliably.
- Other than this new directory problem, there is no other evidence of dcache or local filesytem latency, hence no need for
sleep
in RSV probes.
- Relative performance of
dccp
, globus-url-copy
, srmcp
(20 transfers of 1K file):
dccp_test.sh: Finished 20 passes, all OK 0m19.240s
gridftp_test.sh: Finished 20 passes, all OK 0m54.562s
srmcp_test.sh: Finished 20 passes, all OK 7m34.572s
-
- Note that
globus-url-copy
is about 3 times slower than dccp
, and srmcp
is about 20 times slower than dccp
.
Suggestions
There is a bug related to creation of directories in dCache which only affects the
gridftp door. Tests with
dccp
and
srmcp
work correctly and show no
latency effects or failures. The problem is specific to the gridftp door, and only
occurs after creation of a new directory. This is not a new issue.
We should work with OSG-STORAGE to get this dCache bug resolved. I might
be able to help, since I helped write the original dCache gridftp implementation -
but I'd need access to the current source.
It would also be good to review the RSV probe implementation. I suggest eliminating
the use of the buggy Run_Command function (possibly moving from Perl to shell),
and removing all unexplained/unjustified
sleep
commands.
Workaround
Brian Bockelman suggests:
This problem is completely solved if you switch the PermissionHandlerDataSource for the GridFTP door,
which suggests something might be wrong in PnfsFileMetaDataSource orPnfsFile
Adding the line
PermissionHandlerDataSource=diskCacheV111.services.PnfsManagerFileMetaDataSource
to
/opt/d-cache/config/gridftpdoorSetup
(and restarting the gridFTP door) has
fixed the problem.
gridftp_test.sh
runs 200 passes with no error.
--
CharlesWaldman - 01 Sep 2008