User Tools

Site Tools


projects:bpm-sis18:status

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Last revision Both sides next revision
projects:bpm-sis18:status [2012/07/11 17:23]
rhaseitl
projects:bpm-sis18:status [2012/07/12 19:00]
rhaseitl
Line 2: Line 2:
  
 Errors occurring sporadically: Errors occurring sporadically:
-  * the Liberas loose their connection: they appear red in the detailed status panel, giving the status "Software Error" or "BPM Communication Error". The Libera can not be controlled from TOPOS any more. Most of the time, a reboot of the Liberavia ssh helps. +  * the Liberas loose their connection: they appear red in the detailed status panel, giving the status "Software Error" or "BPM Communication Error". The Libera can not be controlled from TOPOS any more. Most of the time, a reboot of the Libera via ssh helps. 
 In at least one case, I also had to restart the FESA classes on the CCCPs to make the system working again. In at least one case, I also had to restart the FESA classes on the CCCPs to make the system working again.
 This happens sometimes during beamtime. Or after the system was not used for a while and is started again (= the GUI was closed for a while and is started again). This happens sometimes during beamtime. Or after the system was not used for a while and is started again (= the GUI was closed for a while and is started again).
Line 22: Line 22:
   * connection to BPM established/lost/reconnected   * connection to BPM established/lost/reconnected
   * debug output at every status change of the system (Initializing, Start, Stop,...)   * debug output at every status change of the system (Initializing, Start, Stop,...)
 +  * logging should use the Log4j framework (from within the FESA class possible with SDLog (HBr))
 +  * the GUI should **not** encapsulate exceptions thrown by cmw / rda into its own Exception class (HBr)
 +  * a detailed documentation of the meaning and reasns for each error message, exception etc. should be made (HBr)
  
 \\ \\
 Log on the generic servers (with timestamps!): Log on the generic servers (with timestamps!):
   * version number (or similar) at startup   * version number (or similar) at startup
-  * internal register values (when changed, on start, on stop)+  * internal register values (when changed, on start trigger, on stop trigger)
   * when a start or stop trigger arrives   * when a start or stop trigger arrives
   * when the ring buffer is full   * when the ring buffer is full
-  * operating mode (raw, bunch to bunch)+  * operating mode (raw, bunch to bunch, calibrations, log every change from - to)
   * log any other useful events   * log any other useful events
 +  * log buffer overflows
 +  * there seem to be logs on the liberas under /var/log. But without timestamps. When a separate network for the Liberas is used, the time can't be queried from a global NTP server. -> Setup a "proxy" on the concentrators?
  
 \\ \\
Line 38: Line 43:
   * is this a lot of work? does it require changes in the gen servers / FPGA code?   * is this a lot of work? does it require changes in the gen servers / FPGA code?
  
-**Would it make sense to have simple standalone tool to see if the FESA serversthe BPMs, the gen servers are up and running without an error flag?! In principle this information is provided by the default status panel, but you have to be an expert to interpret some errors.**+\\ 
 +Connection to the PTIF: 
 +  Display the connection status and if a command which has been sent, was "acknowledge" by the PTIF. Can give a hint, when the PTIF seems to be reachable by TCP/IP but the FESA class is not sending commands. Test case: Pull network cable and reconnect. System should be able to detect this, if the PTIF cannot be controlled afterwards. 
 + 
 +Have a standalone tool to see ALL system components directly: FESA server classesCCCPs, Liberas (pingable), the gen servers (are up and running?), show error flags. It might be a good idea to have a button for each component to perform a check. E.g. perform a ping for the Liberas and CCCPs. Or perform a data query from the FESA classes. 
 +Some of this information is provided by the detailed status panel, but you have to be an expert to interpret some errors.  
  
 == Goals == == Goals ==
Line 44: Line 55:
 Add logging output to all system components to know what is going on in each component for each status change. There should be a flag to en-/disable logging at startup. Add logging output to all system components to know what is going on in each component for each status change. There should be a flag to en-/disable logging at startup.
 Provide tools to observe the health status of the system components. Provide tools to observe the health status of the system components.
 +
 +
 +
 +Some more considerations (MSchw):
 +  * I strongly support to have as much logging information as possible, e.g. to a textfile
 +  * From my point of view it is very important that we clearly understand what SHOULD happen, e.g. when the user presses a button, BEFORE we try to understand why something we intend to do DOES NOT happen.
 +  * I would recommend to have one or several (very basic, synoptic, NOT on code basis) diagramS of the internal process flow. This/these should be created by SD (Rainer?) together with Cosylab. The diagram/diagrams should include the same status names/exceptions used in the logfiles proposed above, thus helping to understand the history of error states.
 +  * I also support the idea of a stand-alone diagnostic tool as described above, just make sure the displayed information is clearly defined and leaves few space for misinterpretations.
 +
  
  
  
  
projects/bpm-sis18/status.txt · Last modified: 2012/07/13 09:11 by klang