Monitoring

Monitoring2

Logs and tickets

Documents

ATLAS at BNL

Production caches and important files:

In the 1st one are the release Distribution Kits and AtlasPoint1, in the 2nd one are the TRFs. The 3rd one is a cache.

Analysis

Old servers were http://panda.cern.ch:25980/server/pandamon/ and http://pandamon.usatlas.bnl.gov:25880/.

Clients

Meetings

Monitoring notes and gotchas

Some notes about time and timestamps in different monitoring sites:
  • Panda monitor is using UTC (date -u: Chicago -6, -5 if DST)
  • ARDA dashboard (DDM) is using CERN time (Chicago -7)

ATLAS membership

See the documents.

Tasks

Check site

To check if a site is available

Report Panda monitor problems

Sometimes you can see a number of issues with Panda-monitoring, for example: errors when you press some buttons, time-outs, slow response, etc. In order to help the experts to debug and fix the issue, if you see such a problem in your browser next time could you please report the following information:

1. Any error in Panda monitor received through the browser.
  • please cut and paste the error-text in your browser
  • URL-string you use in the browser
  • your ip-address
  • type of the browser you use
2. Time-outs in Panda-monitor/browser
  • cut and paste time-out error, including the time if available
  • whether the time-out is reproducible
  • URL-string you use in the browser
  • your ip-address
  • type of the browser you use
3. Slow response in Panda-monitor (no error)
  • How long does it take to execute the function in the browser
  • URL-string you use in the browser
  • button you use in Panda-mon (or the query)
  • your IP-address and the type of the browser
From Yury, email, 230608

Pilot queue control

This command controls the queue length (tpmes=setnqueue&nqueue=). Here sets nqueue=10 for queue UC_ATLAS_MWT2-condor
$ curl 'http://pandamon.usatlas.bnl.gov:25880/server/pandamon/query?tpmes=setnqueue&nqueue=10&queue=UC_ATLAS_MWT2-condor'                             
Each queue can be set manual/auto (defines how queue parameters are updated) and online/offline (defines if ATLAS jobs are assigned to the queue), so possible commands are: tpmes= setmanual setauto setoffline setonline
curl -sS  'http://gridui07.usatlas.bnl.gov:25880/server/pandamon/query?tpmes=setmanual&queue=Taiwan-LCG2-lcg00125-atlas-lcgpbs'
curl -sS  'http://gridui07.usatlas.bnl.gov:25880/server/pandamon/query?tpmes=setoffline&queue=Taiwan-LCG2-lcg00125-atlas-lcgpbs'
curl -sS 'http://gridui07.usatlas.bnl.gov:25880/server/pandamon/query?tpmes=setonline&queue=Taiwan-LCG2-lcg00125-atlas-lcgpbs'
curl -sS  'http://gridui07.usatlas.bnl.gov:25880/server/pandamon/query?tpmes=setauto&queue=Taiwan-LCG2-lcg00125-atlas-lcgpbs'

Find task bugs

If it is not site-related issue. E.g. all jobs of this task failed with that error up to 7 attempts each at different sites: MWT2_UC,BNL_ATLAS_1,SLACXRD,OU_OCHEP_SWT2,AGLT2,etc. Probably this task needs to be ABORTED.

Before submitting a new bug about the task in question you can always check whether the bug related to this task is already in Savannah. You can use Panda montor, "Quck Search" in the left yellow bar. Just type the task ID 23381 in "Task request", you will get http://pandamon.usatlas.bnl.gov:25880/server/pandamon/query?qTID=23381&mode=taskquery&qsubmit=QuerySubmit "Bug report" shows the corresponding # in Savannah (previous one was overwritten by the current one).

There is another way to check whether the bug was already submitted - it's a direct search in Savannah: https://savannah.cern.ch/bugs/?func=search&group=validation

IM use

There is an 'atlasshift' chat room with no password in Google Talk (using partychat)

To join it:
  1. If you have no Google account go to http://www.google.com/talk/ download a client or open the web one and register
  2. open your favorite Google Talk client
  3. add partychat#@gmail.com (# is any digit 0-9) as your buddy: this is not a real person; it is a service gateway. Steps 2-3 are documented also at http://techwalla.googlepages.com/
  4. open a message window to talk with partychat#@gmail.com (double click, IM, ...)
  5. send the message '/join atlasshift' . If you are using Pidgin or another client that supports IRC like commands (you will see 'Unknown command error' instead of
your message) then type '/say /join atlasshift' (this protects your message for partychat). Partychat will reply to you saying that you joined the chat room
  1. '/list' to see who is there
  2. type something to talk with all the other users in the chat room
  3. '/exit' to go out (exit the chat room)

'atlasshift' is a temporary name. If you don't like it I can change it, as long as we agree on something.

Production servers

Here is a document on login on RACF (BNL) machines: https://www.racf.bnl.gov/docs/howto/interactive/login

I neded to contact Dantong and provide my public key. atlasgw is forwarding ssh-agent (key used to login there).

gridui03,7,9 are the ones that run autopilots. gridui05,6,7 also need healthy proxies, the Panda server and monitor run there (the monitor needs proxy for DQ2 and LFC access and logfile retrieval). gridui01 is now just an alias, the former gridui01 is now a development machine. The new servers gridui05+ require that each person requiring access to sm be specifically authorized. John or Dantong can do this (and John has said shifters can and should do this) - Torre

Release updates

Releases installes from 14.1 can be checked in BDII

To know which tags (options) are possible in ATLAS kit setup you can check AtlasLogin and AtlasSetup files: https://twiki.cern.ch/twiki/bin/view/Atlas/AtlasLogin

Sometime these files need to be updated

Pacman commands

pacman -update-check can be used to check for updates in packages. It is not intrusive (writes to log files only). For more info on carefully update: http://physics.bu.edu/pacman/htmls/Updating.html

To update selectively you can mention the package names:
                                                                                                                                                                                                      
% pacman -update AtlasSettings AtlasLogin                                                                                                                                                                                            

Subversion (SVN)

Help: Browsing:
export SVNROOT=svn+ssh://mambelli@svn.cern.ch/reps/atlasoff
cmt co [-r ] [/]
svn co $SVNROOT/[/]/trunk 

Dist Analysis

For support: hn-atlas-dist-analysis-help@cern.ch

MWT2 info

Best practices

Check for installed releases

https://atlas-install.roma1.infn.it/atlas_install/

Ticket submission

GGUS ticket in the weekend are not useful (they are not routed until Monday). Better to route the ticket directly and/or add in CC the site contact: Site will receive the request and GGUS will still track the problem

Notes

Sure we can add, but could you be more specific? I did not follow your                                                                                                    
discussion so I need more explanations which services you want to send                                                                                                    
notifications to this address.                                                                                                                                            
                                                                                                                                                                          
And by the way: there are already pages which show various open tickets                                                                                                   
on www, without the need to logon to rt. go to rt page                                                                                                                    
                                                                                                                                                                          
https://rt-racf.bnl.gov/rt/                                                                                                                                               
                                                                                                                                                                          
and select one of the links on the bottom half of the page.                                                                                                               
                                                                                                                                                                          
Or you can go directly to                                                                                                                                                 
                                                                                                                                                                          
http://rt-racf.bnl.gov/rss/tier2.rss                                                                                                                                      
                                                                                                                                                                          
for tier2 related tickets,                                                                                                                                                
                                                                                                                                                                          
to                                                                                                                                                                        
http://rt-racf.bnl.gov/rss/nagios.rss                                                                                                                                     
                                                                                                                                                                          
for nagios tickets, and to                                                                                                                                                
http://rt-racf.bnl.gov/rss/itb.rss                                                                                                                                        
                                                                                                                                                                          
for osg tickets.                                                                                                                                                          
                                                                                                                                                                          
You can see the text of the original ticket and the first response. This                                                                                                  
should be enough for fast browsing. To see more you will have to logon                                                                                                    
to rt.                                                                                                                                                                    
                                                                                                                                                                          
Please let me know if this is what you want.                                                                                                                              

To do

  • check cacti
  • Differential page

  • Error page with results like:
http://pandamon.usatlas.bnl.gov:25880/server/pandamon/query?mode=archive&site=MWT2_IU&hours=12&sort=endTime http://gridui05.usatlas.bnl.gov:25880/server/pandamon/query?hours=6&overview=errorlist -- MarcoMambelli - 06 Jun 2008
Topic revision: r63 - 02 Feb 2010, MarcoMambelli
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback