Links
Monitoring
Monitoring2
Logs and tickets
Documents
ATLAS at BNL
Production caches and important files:
In the 1st one are the release Distribution Kits and
AtlasPoint1, in the 2nd one are the TRFs. The 3rd one is a cache.
Analysis
Old servers were
http://panda.cern.ch:25980/server/pandamon/ and
http://pandamon.usatlas.bnl.gov:25880/.
Clients
Meetings
Monitoring notes and gotchas
Some notes about time and timestamps in different monitoring sites:
- Panda monitor is using UTC (
date -u
: Chicago -6, -5 if DST)
- ARDA dashboard (DDM) is using CERN time (Chicago -7)
ATLAS membership
See the documents.
Tasks
Check site
To check if a site is available
Report Panda monitor problems
Sometimes you can see a number of issues with Panda-monitoring, for example: errors when you press some buttons, time-outs, slow response, etc. In order to help the experts to debug and fix the issue, if you see such a problem in your browser next time could you please report the following information:
1. Any error in Panda monitor received through the browser.
- please cut and paste the error-text in your browser
- URL-string you use in the browser
- your ip-address
- type of the browser you use
2. Time-outs in Panda-monitor/browser
- cut and paste time-out error, including the time if available
- whether the time-out is reproducible
- URL-string you use in the browser
- your ip-address
- type of the browser you use
3. Slow response in Panda-monitor (no error)
- How long does it take to execute the function in the browser
- URL-string you use in the browser
- button you use in Panda-mon (or the query)
- your IP-address and the type of the browser
From Yury, email, 230608
Pilot queue control
This command controls the queue length (tpmes=setnqueue&nqueue=). Here sets nqueue=10 for queue UC_ATLAS_MWT2-condor
$ curl 'http://pandamon.usatlas.bnl.gov:25880/server/pandamon/query?tpmes=setnqueue&nqueue=10&queue=UC_ATLAS_MWT2-condor'
Each queue can be set manual/auto (defines how queue parameters are updated) and online/offline (defines if ATLAS jobs are assigned to the queue), so possible commands are: tpmes= setmanual setauto setoffline setonline
curl -sS 'http://gridui07.usatlas.bnl.gov:25880/server/pandamon/query?tpmes=setmanual&queue=Taiwan-LCG2-lcg00125-atlas-lcgpbs'
curl -sS 'http://gridui07.usatlas.bnl.gov:25880/server/pandamon/query?tpmes=setoffline&queue=Taiwan-LCG2-lcg00125-atlas-lcgpbs'
curl -sS 'http://gridui07.usatlas.bnl.gov:25880/server/pandamon/query?tpmes=setonline&queue=Taiwan-LCG2-lcg00125-atlas-lcgpbs'
curl -sS 'http://gridui07.usatlas.bnl.gov:25880/server/pandamon/query?tpmes=setauto&queue=Taiwan-LCG2-lcg00125-atlas-lcgpbs'
Find task bugs
If it is not site-related issue. E.g. all jobs of this task failed with that error up to 7 attempts each at different sites:
MWT2_UC,BNL_ATLAS_1,SLACXRD,OU_OCHEP_SWT2,AGLT2,etc.
Probably this task needs to be ABORTED.
Before submitting a new bug about the task in question you can always check whether the bug related to this
task is already in Savannah. You can use Panda montor, "Quck Search" in the left yellow bar. Just type the task ID 23381 in "Task request", you will get
http://pandamon.usatlas.bnl.gov:25880/server/pandamon/query?qTID=23381&mode=taskquery&qsubmit=QuerySubmit
"Bug report" shows the corresponding # in Savannah (previous one was overwritten by the current one).
There is another way to check whether the bug was already submitted - it's a direct search in Savannah:
https://savannah.cern.ch/bugs/?func=search&group=validation
IM use
There is an 'atlasshift' chat room with no password in Google Talk (using partychat)
To join it:
- If you have no Google account go to http://www.google.com/talk/ download a client or open the web one and register
- open your favorite Google Talk client
- add partychat#@gmail.com (# is any digit 0-9) as your buddy: this is not a real person; it is a service gateway. Steps 2-3 are documented also at http://techwalla.googlepages.com/
- open a message window to talk with partychat#@gmail.com (double click, IM, ...)
- send the message '/join atlasshift' . If you are using Pidgin or another client that supports IRC like commands (you will see 'Unknown command error' instead of
your message) then type '/say /join atlasshift' (this protects your message for partychat). Partychat will reply to you saying that you joined the chat room
- '/list' to see who is there
- type something to talk with all the other users in the chat room
- '/exit' to go out (exit the chat room)
'atlasshift' is a temporary name. If you don't like it I can change it, as long as we agree on something.
Production servers
Here is a document on login on RACF (BNL) machines:
https://www.racf.bnl.gov/docs/howto/interactive/login
I neded to contact Dantong and provide my public key. atlasgw is forwarding ssh-agent (key used to login there).
gridui03,7,9 are the ones that run autopilots. gridui05,6,7 also need healthy proxies, the Panda server and monitor run there (the monitor needs proxy for DQ2
and LFC access and logfile retrieval). gridui01 is now just an alias, the former gridui01 is now a development machine. The new servers gridui05+ require that
each person requiring access to sm be specifically authorized. John or Dantong can do this (and John has said shifters can and should do this)
- Torre
Release updates
Releases installes from 14.1 can be checked in BDII
To know which tags (options) are possible in ATLAS kit setup you can check
AtlasLogin and
AtlasSetup files:
https://twiki.cern.ch/twiki/bin/view/Atlas/AtlasLogin
Sometime these files need to be updated
Pacman commands
pacman -update-check
can be used to check for updates in packages. It is not intrusive (writes to log files only). For more info on carefully update:
http://physics.bu.edu/pacman/htmls/Updating.html
To update selectively you can mention the package names:
% pacman -update AtlasSettings AtlasLogin
Subversion (SVN)
Help:
Browsing:
export SVNROOT=svn+ssh://mambelli@svn.cern.ch/reps/atlasoff
cmt co [-r ] [/]
svn co $SVNROOT/[/]/trunk
Dist Analysis
For support:
hn-atlas-dist-analysis-help@cern.ch
MWT2 info
Best practices
Check for installed releases
https://atlas-install.roma1.infn.it/atlas_install/
Ticket submission
GGUS ticket in the weekend are not useful (they are not routed until Monday).
Better to route the ticket directly and/or add in CC the site contact: Site will receive the request and GGUS will still track the problem
Notes
Sure we can add, but could you be more specific? I did not follow your
discussion so I need more explanations which services you want to send
notifications to this address.
And by the way: there are already pages which show various open tickets
on www, without the need to logon to rt. go to rt page
https://rt-racf.bnl.gov/rt/
and select one of the links on the bottom half of the page.
Or you can go directly to
http://rt-racf.bnl.gov/rss/tier2.rss
for tier2 related tickets,
to
http://rt-racf.bnl.gov/rss/nagios.rss
for nagios tickets, and to
http://rt-racf.bnl.gov/rss/itb.rss
for osg tickets.
You can see the text of the original ticket and the first response. This
should be enough for fast browsing. To see more you will have to logon
to rt.
Please let me know if this is what you want.
To do
- check cacti
- Differential page
- Error page with results like:
http://pandamon.usatlas.bnl.gov:25880/server/pandamon/query?mode=archive&site=MWT2_IU&hours=12&sort=endTime
http://gridui05.usatlas.bnl.gov:25880/server/pandamon/query?hours=6&overview=errorlist
--
MarcoMambelli - 06 Jun 2008