Investigation of gridftp directory problems and RSV probe failures

Background

The RSV gridftp-simple-probe has been known to fail from time to time.

This probe does the following:
  1. use globus-url-copy to copy a file from local disk to dCache
  2. use globus-url-copy to copy the file back to local disk
  3. use diff to compare the original and retrieved files

Currently, the code has some sleep commands inserted into it, for which the purpose is unclear - the authors of the code don't supply adequate justification, so the sleep calls can considered to be kludges.

There is a 5-second sleep between the first and second steps, as if we can't trust that the file was stored after the first copy command returned. In fact, the gridftp server should not return a success code until the file is stored, so this sleep should not be necessary, and is likely masking some other bug.

There is another, even more troubling sleep between the second and third steps, this one is currently set at 30 seconds. Since the file has been stored to local disk, there should be no latency at all here (nothing like NFS is involved), so there is no reason for a 30-second sleep. Furthermore, this is troubling in its arbitrariness - the figure of 30 seconds was arrived at because an earlier value of 5 was found not to work. Clearly, it is necessary to take a closer look and find out what is really going on.

Investigation

The failing probe

The code which is failing looks like this:

sub Do_Gridftp_File_Transfer_From {
    my %h_tmp         = %{$_[0]};
    my %h_tmp_gridftp = %{$_[1]};
    $h_tmp{'callingRoutine'} = "Do_Gridftp_File_Transfer_From()";
    
    my $gridftp_cmd = "$h_tmp{'globusurlcopyCmd'} $h_tmp{'globusurlcopyRemoteTestfileUri' $h_tmp{'globusurlcopyBackTestfileUri'} 2>&1";
    print STDERR " VERBOSE: Sleeping for $h_tmp{'gridftpDelaySeconds'} seconds before attempting to transfer file back ... \n" if (defined($h_tmp{'verbose'})); 
    sleep ($h_tmp{'gridftpDelaySeconds'});
    %h_tmp_gridftp = %{&Run_Command (\%h_tmp, \%h_tmp_gridftp, $gridftp_cmd, "backtick")};
    sleep 30;
    %h_tmp_gridftp =  %{&Check_Gridftp_Transfer_Status (\%h_tmp, \%h_tmp_gridftp, "false")};
    return \%h_tmp_gridftp;
}

It's a little hard to see what this code is doing. The sleep calls are evident, but the call to globus-url-copy is indirect, via the function Run_Command.

In the same file, there is also a comment which reads:

## KNOWN BUGS
## <<1>> The Run_Command () routine has a bug that might leave zombies/stray 
##   processes if a child process times out and is killed. It appears to be 
##   well known problem in Perl programming circles with no easy fix. 
## <<2>> The Run_Command () routine has a bug where in, the cmdOut is not
##   set properly in certainly situations if the system() option is used.
##

A simple test

In order to debug in a simpler environment, without the complexities of the above Perl, I tried the simplest possible test in shell:

ls > /tmp/testfile 
for x in {1..100}; do  
    dst=gsiftp://uct2-dc1.uchicago.edu/pnfs/uchicago.edu/test/cgw/testfile.$x 
    globus-url-copy file:///tmp/testfile $dst 
    globus-url-copy $dst file:///tmp/testfile.$x 
    diff /tmp/testfile /tmp/testfile.$x 
done 

This runs 100 passes without error. (In fact, a slightly different version ran over 2800 passes before I hit ctrl-C).

Improving the test

I wanted to turn the above snippet of shell into a usable script, so I changed hardcoded /pnfs paths into variables. I also didn't want to assume the existence of the target directory, so I added a call to mkdir. The improved script looks like this: gridftp_test_00.sh.

This is where I ran across a problem in dCache.

Apparently, the dCache gridftp door over-aggressively caches directory information, so that when new dirs are created, the gridftp door doesn't "know" about them right away. (Similar problems with creating directories in dCache have been noted previously, e.g. here)

If the test script creates the directory "up-front", then does some number of loops of "copy in, copy out, diff", then the first few iterations will often fail, after which the test works reliably. One might be tempted to sleep, but this is masking a fundamental error, and also one is faced with the arbitrariness of how long to sleep - is 5 seconds enough? Is 30 seconds enough?

Instead, I decided to try to provoke the error as much as possible by deleting and creating the directory inside the loop.

I also created equivalent tests which use dccp and srmcp to perform the same "copy in, copy out, diff" operation.

All of the test scripts are available at http://repo.mwt2.org/viewvc/tests The scripts have various small enhancements compared to the code example above, for example, reporting elapsed real-time, and cleaning up the test files, unless an error occurred, in which case the test files are kept as debugging aids.

Results

Here's the result of runtests.sh on uct2-grid6
dccp_test.sh
Finished 20 passes, all OK
real    0m33.790s

gridftp_test.sh
Finished 20 passes, 9 failure(s)
real    0m44.949s

srmcp_test.sh
Finished 20 passes, all OK
real    6m25.256s

The tests with dccp and srmcp did not exhibit the same directory creation bug as gridftp. The gridftp_test failure rate varies, from 0 to 100% but typically about 50% of the test transfers fail when the destination directory has been freshly created.

To investigate further, I inserted an ls in between the copy-in and the copy-out, and at this point, the file is present and has non-zero size according to /pnfs.

Here's the output of a test run where the commands are being echoed as they are excuted:

globus-url-copy file:///tmp/gridftp-test/testfile gsiftp://uct2-dc1.uchicago.edu /pnfs/uchicago.edu/test/cgw/tmp/gridftp-test/testfile7

ls -l /pnfs/uchicago.edu/test/cgw/tmp/gridftp-test/testfile7
-rw-r--r--  1 usatlas3 usatlas 1024 Sep  1 16:05 /pnfs/uchicago.edu/test/cgw/tmp/gridftp-test/testfile7

globus-url-copy gsiftp://uct2-dc1.uchicago.edu/pnfs/uchicago.edu/test/cgw/tmp/gridftp-test/testfile7 file:///tmp/gridftp-test/testfile7
error: globus_ftp_client_state.c:globus_l_ftp_client_connection_error:4580: 
the server responded with an error
550 File not found

diff --brief /tmp/gridftp-test/testfile /tmp/gridftp-test/testfile7
Files /tmp/gridftp-test/testfile and /tmp/gridftp-test/testfile7 differ

This is the "smoking gun" - the file is there, according to ls, but attempting to retrieve it via globus-url-copy fails. The file /tmp/gridftp-test/testfile7 is present but empty (0 bytes). This is similar to the failure mode observed in the RSV probes.

Conclusions

  • Directory creation is problematic for dcache gridftp and the failure can be reproduced reliably.
  • Other than this new directory problem, there is no other evidence of dcache or local filesytem latency, hence no need for sleep in RSV probes.
  • Relative performance of dccp, globus-url-copy, srmcp (20 transfers of 1K file):
        dccp_test.sh:    Finished 20 passes, all OK  0m19.240s
        gridftp_test.sh: Finished 20 passes, all OK  0m54.562s
        srmcp_test.sh:   Finished 20 passes, all OK  7m34.572s
    • Note that globus-url-copy is about 3 times slower than dccp, and srmcp is about 20 times slower than dccp.

Suggestions

There is a bug related to creation of directories in dCache which only affects the gridftp door. Tests with dccp and srmcp work correctly and show no latency effects or failures. The problem is specific to the gridftp door, and only occurs after creation of a new directory. This is not a new issue.

We should work with OSG-STORAGE to get this dCache bug resolved. I might be able to help, since I helped write the original dCache gridftp implementation - but I'd need access to the current source.

It would also be good to review the RSV probe implementation. I suggest eliminating the use of the buggy Run_Command function (possibly moving from Perl to shell), and removing all unexplained/unjustified sleep commands.

Workaround

Brian Bockelman suggests:
 
This problem is completely solved if you switch the PermissionHandlerDataSource for the GridFTP door, 
which suggests something might be wrong in PnfsFileMetaDataSource orPnfsFile

Adding the line
PermissionHandlerDataSource=diskCacheV111.services.PnfsManagerFileMetaDataSource 
to /opt/d-cache/config/gridftpdoorSetup (and restarting the gridFTP door) has fixed the problem. gridftp_test.sh runs 200 passes with no error.

-- CharlesWaldman - 01 Sep 2008
Topic revision: r4 - 08 Sep 2008, CharlesWaldman
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback