A Space Systems Testbed for Situated Agent Observability and Interaction
Sam Siewert
Gary Nutt
Abstract
Future space systems must be considerably more autonomous to enable continuing exploration, science investigation, and commercial utilization of space. The need for greater autonomy stems from shrinking government budgets to support scientific missions and from commercial interest in cost-effectively operating large constellations of communications and remote sensing satellites. Most automated space systems are described as "telerobotic". Telerobotic systems include both automated sensor sampling of device state and environment state in addition to automated device actuation and control. The highest level of automation is "full autonomy" where a device would simply attempt to meet goals specified prior to a mission with no operator interaction. It is clear that full autonomy is difficult to achieve, especially if the mission is reasonably complicated and if the environment in which the device must operate is challenging (i.e. difficult to model and predict). The goal of the research presented here is to identify an application design approach and the system support required to build semi-autonomous systems. Semi-autonomous systems can be characterized as requiring operator management by exception only. In order to achieve this, an approach to link automated event detection to intelligent goal-oriented reaction is analyzed in this paper. Management by exception mitigates risk associated with full autonomy, yet presents some problems related to operator interaction with the semi-autonomous system. The problems are described here in detail with proposed solutions. In order to demonstrate and validate the concepts presented, a distributed systems VE (Virtual Environment) tailored for semi-autonomous systems management is being implemented. The basic concepts will also be tested on a 1997 manifested semi-autonomous Space Shuttle Hitchhiker payload being built at the University of Colorado.
Introduction
Classically, four levels of human attention required for operation of a remote device have been defined in robotics: teleoperations, telesensing, telerobotics, and full autonomy. The lowest level, teleoperations requires constant and direct control and monitoring by an operator. The next level, telesensing, includes automated device sensor sampling of device and environment state with abstract presentation of state to the operator. The third level, telerobotics includes both automated sensing and device actuation and control. Finally, with full autonomy, a device would simply carry out a goal-oriented specification of a mission with no operator interaction. It is clear that full autonomy is difficult to achieve due to unpredictable variations of the environment in which the device must operate. Traditionally, systems are designed with some pre-determined level of autonomous operation in mind rather than with the idea of building generic system support to enable development of flexible semi-autonomous systems applications. A situated agent design is considered here with enhancements for use within a distributed environment. Situated agents simply link event detection to goal-oriented reactions in order to achieve specified mission goals [5]. Management by exception is therefore enabled by situated agents. However, in order for such a system to work reliably, some inherent problems must first be solved.
The fundamental problem is observability of situated agent execution status and state to enable effective operator interaction with distributed agents. In many traditional systems, variations from expected behavior are handled exclusively by human operators. For semi-autonomous system operation, the variations must be handled jointly by the system itself and human operators. The focus of this paper and research is uniquely dedicated to determining system support for application of situated agent technology within a distributed system. The "DATA" (Distributed Automation Technology Advancement) Space Shuttle Hitchhiker payload incorporates sophisticated methods to detect events [2 & 3], and a real-time rule-based inferencing engine for reaction [1]. The distributed environment in which the semi-autonomous DATA system must operate is depicted in Figure 1. The DATA system, and the proposed VE, described in more detail later in this paper, have common properties which are hypothesized to be common to this class of application in general. The DATA system is described first along with properties and associated problems derived from it. The VE is being used to further test properties identified with DATA and to test alternatives and solutions.
Figure 1: DATA System
DATA Semi-Autonomous Operations System Prototype Description
The DATA testbed prototype distributed situated agent system has been built using an "off-the-shelf" forward-chaining rule-based inferencing and task control system called SCL (Spacecraft Command Language) [1] which runs as a Unix application and as an embedded systems RTOS (Real-time Operating System) application. The RTOS used is RTEMS (Real-Time Executive for Military Systems). SCL has been integrated with the NASA JPL (Jet Propulsion Laboratory) developed SELMON (SELective MONitoring) application [2 & 3]. SELMON acts as the detection layer and SCL as the reaction layer to implement a situated agent interfaced to the DATA sensor and actuator devices. The DATA system currently is composed of an embedded system situated agent and a remote ground-segment situated agent which communicate through an RS232 interface in the testbed and will communicate through the GSFC (Goddard Spaceflight Center) "ACCESS" telemetry and uplink system during the Shuttle flight of the payload. Commands will flow from the UPOCC (University Payload Operations Control Center) to CGSE (Customer Ground Support Equipment) or directly from CGSE for ACCESS uplink. The DATA embedded system situated agent, shown in Figure 2, will downlink both full and compressed status to enable validation of the compressed status scheme. The ground-segment situated agent is identical to the embedded situated agent except that it also includes data management of full status history and a planner and scheduler developed at NASA JPL called "Plan-IT II". Plan-IT II makes the ground situated agent more of a goal-oriented agent since it schedules tasks and activates or deactivates event triggered tasks according to a model which optimizes mission goals within system constraints. Also, within the ground-segment, the ground agent interfaces to virtual devices that are simulations of the embedded-segment devices. From this system a basic set of observability and interaction properties (problems) were identified.
Figure 2: DATA Distributed Situated Agent Architecture
VE Semi-autonomous Systems Testbed
The VE is a "ring" architecture with servers that manage situated agents and agent surrogates (Figure 3). The agent surrogates are simply simulation models of remotely situated agents which may be updated with minimal state information from the real agent being modeled. When a reaction is issued from a local VE user interface it is first received by the local agent to which the VE user interface is connected, the local surrogate is updated (assuming the update passes constraints), and the reaction forwarded to the WAN server for transport to a remote WAN server and appropriate agent or simply forwarded to a local agent connected to the local server. If the reaction is ultimately rejected by the agent it is addressed to, then the local surrogate will be restored to its state prior to the reaction when the rejection message is propagated around the server ring.
Figure 3: VE Architecture
Observability and Interaction Properties and Problems
Given the semi-autonomous operational scheme, the ability for the human operators to observe situated agent status and interact with them effectively is critical. From analysis of DATA, observability and interaction problems and general properties have been formalized. These properties are given in dependency order from most primitive to most complex. The property of "emergent behavior" is not discussed in this paper, but would be a level of complexity even greater than the six identified here, and can perhaps be analyzed in terms of these lower level properties. For the properties that are identified here, each one is discussed in terms of DATA and the VE in the six sections which follow this one. The properties and related problems identified include:
1) Dispatch Latency and Preemptability:
The dispatch latency for a reaction must be predictable. For a multi-tasking system this requires preemptability, task priority inversion handling, elimination of hidden scheduling, and a programmer's interface for specifying real-time tasks. This is important since these system features will affect the time between detection of an event to the time when reaction starts. Both RT-Mach and Solaris 2.x have such features and both are being evaluated and used.
2) Reaction Time:
The response time of a triggered reaction must be able to meet real-time deadlines. This requires predictability in execution and scheduling. Reactions will often have time constraints for completion and in many cases a "late" reaction may be useless or even detrimental. Again, RT-Mach and Solaris 2.x are being evaluated and used here since they provide such features.
3) Distributed Perception:
Given an object, X, that can be perceived by several different agents, including Ai as fi(X), what are the properties that can be guaranteed about the observation fi(X)? The object X is considered to be a part of the system in a particular segment (ground or space for example) for which sensor-based state information is available to an agent which monitors and controls X. Differences in observation fi(X) from fj(X) will be due to latency and synchronization factors resulting from the cost of observing X by Ai and Aj from their respective operational segments.
4) Reaction Order and Reliability:
If an observation, fi(X), by agent Ai is linked to an automated reaction, expressed fi(X)-> gi(X), then there is a race condition between fj(X) -> gj(X) and a possible Aj reaction fi(X) -> gi(X). Also, since detection is imperfect, classifying the effect of the reaction is extremely important. If a reaction is triggered by a false positive detection, then the decision to react must be based on confidence and impact of an incorrect reaction or impact of no reaction.
5) Reaction Observability and Incompleteness:
How does Ai verify that gi(X) executed successfully. If Ai is a complex agent such that it has fi(X) -> gi,1(X); gi,2(X); ... ;gi,n(X), then it is preferable to observe intermediate completion of this transaction with X and have a well defined transaction abort protocol.
6) Agent Localization:
An agent Ai that can observe X with the least cost will always have the best observability of X, but may not be the best agent to manipulate X based on its fi(X) -> gi(X) reaction knowledge base and ability to process observation data from X. In this case, migration of agents, detection monitors, or reactions is desirable. Such agent migration is also desirable for system verification and evolution of the system to higher levels of autonomy since agent automation could, for example, first be deployed in the operator segment with the requirement for operator approval of reactions, and later migrated closer to X.
Dispatch Latency and Preemptability
For both the DATA system and the VE, operating systems are being used which include kernel features to minimize the latency of dispatching a task which executes a reaction. For DATA the commercially available Solaris 2.x monolithic kernel operating system is being used. However, for the VE, the RT-Mach microkernel operating system is going to be used. Both systems provide "soft" real-time performance such that the time to dispatch can be statistically estimated. For example, Solaris 2.4 provides 2 millisecond dispatching in most situations [8]. The dispatch latency is directly related to the preemptability of the system itself when it is running "kernel mode" code. Well known scheduling problems such as priority inversion will also affect dispatch latency. In such a case, a lower priority task continues to execute in place of a higher priority one. Both RT-Mach and Solaris 2.x deal with these problems, however, there are many differences in their policy and implementation (microkernel compared to monolithic). More work to measure performance and to identify which features most affect semi-autonomous systems must still be done.
Reaction Time
The ability of the operating system kernel to meet real-time deadlines for tasks which are reactions to detected events requires predictability in execution and scheduling. Reactions will often have time constraints for completion and a "late" reaction may be useless or even detrimental. It would be useful to predict whether there is time to react given the current system load. This is exceedingly difficult to do. Especially if the reaction code is not completely deterministic (i.e. it has state dependent code branching). Therefore, it is more realistic to statistically predict the ability to meet deadlines rather than deterministically. When a task is very unlikely to meet a deadline, then the scheduler can discontinue the task such that other tasks may not miss deadlines they would otherwise miss. Solaris supports reaction time only with a real-time scheduling class and a non-real-time class. RT-Mach is much more sophisticated and provides methods to predict if a scheduled task will be able to meet hard or soft deadlines (based upon system load characteristics) [7]. More work must be completed here as with dispatch latency and preemptability.
Distributed Perception
As already noted, semi-autonomous operation of robotic systems is complicated by the unpredictability of the environment and the system itself. Such deviations from expectation (often called anomalies or faults) are manifested by event detections. It is interesting to note that not all anomalies are negative events, but may in fact represent opportunities to increase overall performance. Either way, the identification of positive or negative events starts with simple state observation which is complicated by the distributed nature of the systems considered here. This problem of distributed state observation is a well known [6], but is the root of even more complex problems identified with what is called distributed perception here. Observability of complex systems requires more than simple observation of state, and may in fact involve detection of changes in state observations over time such that behavioral changes are detected. In general detection performance is dependent upon observation latency, observation frequency, and behavioral references. Baye's rule provides a method to quantify detection performance in terms of probability of a false alarm P(A|E~), probability of an event P(E), probability of correct detection P(A|E), and probability of an event given an alarm P(E|A), where E is an occurrence of the event to be detected and A is an occurrence of an alarm raised to the reaction system, such that:
P(E|A) = [P(A|E)*P(E)] / [P(A|E)*P(E) + P(A|E~)*P(E~)]
The figure of merit is P(E|A), since this is a quantification of the reliability of the detection method in terms of how likely the presence of the event to be detected is given that the detector says it is present. Furthermore, assuming P(E) can be estimated for an environment and that this event can be simulated to determine P(A|E) and P(A|E~) with event simulation, then P(E|A) can be used as a measure of confidence in detection. Finally, these figures can actually be refined "on-line" during operations such that confidence can be adjusted based on occasional review of the correctness of detection methods. This ability to detect and determine confidence is called perception in this paper.
In the DATA system, detection performance is quantified using Baye's rule prior to the mission with testbed simulation and with previous mission data from a related payload with common devices. The latency problem is dealt with in DATA by placing reactions that are most time critical in the segment where the cost of observation is the lowest. For the VE system, a ring topology is proposed with servers that serve local agents having minimum cost of observation for a subset of the entire distributed system state. The servers communicate with local client agents and remote servers servicing remote client agents selectively with status "culling" and status compression methods [2, 3, & 4]. For example, an agent which monitors a device interfaced to an embedded system may communicate with another embedded system agent and maintain a very accurate state of this local device (and vice versa). In contrast, a remote agent may receive "change only" data compressed to behavioral changes to reduce latency and conserve bandwidth. In this case, remote servers keep a "surrogate" model of the remote agents and update this surrogate model with the compressed updates and occasional check-pointing of the full state.
Reaction Order and Reliability
Reaction order can be controlled using priorities on reactions within a given segment. Between segments a priority scheme may also be used such that non-local reactions always have lower priority than local reactions. Determination of priorities to achieve a specific order and interaction is not trivial assuming that reactions may be added and priorities changed during operations. The reliability of a reaction may be determined and coded according to failure modes and effects analysis of the system to be controlled. For example, the confidence in detection can be considered in conjunction with impact of reaction to a false alarm, impact of no reaction to a true alarm and additional state information used in the decision boundary logic of the reaction. Finally, constraints on reactions may be defined and activated and deactivated in order to protect a resources managed by a local agent from remote and local reactions. This for example prevents a reaction, which may have been valid when issued at a remote site, from executing when conditions have changed since issue. The DATA system incorporates priority-based execution of local reactions. Non-local reactions are limited to simple atomic commands taking precedence over all local reactions or remote execution of more complex reactions according to local priorities. Reaction reliability is handled by encoding decision boundary logic into rule antecedents and by local activation and deactivation of constraints. The VE system will also likely employ a priority-based scheme where local reactions and remote reactions are strictly classified and ordered such that local reactions take precedence over remote. Reaction decision boundary logic will be updated in the proposed VE dynamically in addition to impact and confidence estimates. Observability windows will allow a reaction to request "upgraded" observation of state prior to reaction.
Reaction Observability and Incompleteness
In general, a simple reaction may be observed by noting the change of a logical completion flag in conjunction with a verifying cause and effect state change (where sensor values change as would be expected when a device is actuated). Reactions may however be arbitrarily complex sequences of atomic operations. Observability of the execution status of reactions must therefore include a series of such verifications, and may be further generalized to include status for commands sent between segments such as reaction sent, reaction received, reaction accepted, reaction executed, and reaction verified. The remaining problem is what to do when a complex reaction sequence encounters a problem part way through its execution. This can be handled by treating a complex reaction as an atomic transaction which can first be executed logically within the local segment (to verify it does not violate constraints or terminate due to state changes). Following logical verification, the command can then be committed and actually executed, or otherwise aborted and not committed with all logical changes "undone". In the DATA system all reactions from remote segments are atomic or are atomic executions of local complex reactions. Local complex reactions may include intermediate verification of individual commands, but can not abort when an error is encountered. Human intervention is required in such a case. For the VE system, a proposed improvement is to treat all reactions as transactions so that they may be aborted if they can not be completed and the effects of logically completed commands up to the point of the abort can be "undone".
Agent Localization
In order to localize agents according to performance such that reactions can be moved to segments with the best observability of parameters they incorporate, reactions should be code that can be migrated between agents, and agents themselves should be migratable between segments. This enables reactions which initially may be triggered in a remote segment to be migrated to a segment where they can be executed locally rather than as a remote reaction. Detection monitors should likewise be migratable to allow for complete localization and load balancing within the system. The other possibility is to allow complete agents encapsulating an entire set of reactions and detection monitors to be migrated to a remote segment. In the DATA system, rules, constraints, and scripts may be added to any agent in a given segment of operations by explicit command to add this code to an executable code database along with any new state variables. Agents themselves can not be migrated, but with controlled update to existing agents, they can effectively be metamorphosed as desired. Currently rules, constraints, and scripts can not be deleted, but rules and constraints can be deactivated. In the VE system, the capabilities of the DATA system will be enhanced so that agent rules, constraints, and scripts will be migratable and removable. Agents themselves will also be migratable.
VE Testbed Planned Implementation
The VE testbed will incorporate SELMON for complex detection, but will be coded to be fully migratable with encapsulated state, reactions, and detection methods within a single multi-threaded RT-Mach task [7]. RT-Mach should provide a consistent real-time environment that can support desktop Unix-like systems as well as embedded systems with a common kernel. Furthermore, RT-Mach provides scheduling and task management features to manage real-time deadlines. Beyond these indispensable features, by far, the most complex aspect of this design is the use of surrogates. Surrogates are intended to be used for local situated agent modeling in order to reduce WAN server status messaging. If a remote agent is too complex to warrant a surrogate (or simplified surrogate) then a local surrogate will simply be modeled with full remote status as available.
Conclusion
The problems associated with virtual situated agents found in the VE and robotic situated agents in DATA are common. With this perspective, the proposed architecture really has appeal as a generic distributed situated agent system for management of distributed devices and tasks. In essence, it is an architecture for a distributed operating system services layer built on top of common local kernels which is useful for both distributed environments incorporating physically situated and virtually situated agents.
References
[1] Buckley, B. and Wheatcraft, L., "Spacecraft Command Language - A Smart Control System," Interface and Control Systems, Melbourne Florida, March 1991.
[2] Doyle, R., "Determining the Loci of Anomalies Using Minimal Causal Models," International Joint Conference on Artificial Intelligence, Montreal, Canada, August, 1995.
[3] Doyle, R., Chien, S., Fayyad, U., and Wyatt, E., "Focused Real-Time Systems Monitoring Based on Multiple Anomaly Models," unpublished manuscript, Artificial Intelligence Group, Jet Propulsion Laboratory, 1992.
[4] Funkhouser, T., "RING: A Client-Server System for Multi-User Environments," 1995 Symposium on Interactive 3D Graphics, Association for Computing Machinery, New York, 1995.
[5] Maes, P., "Situated Agents Can Have Goals," Robotics and Autonomous Systems, Vol. 6, 1990.
[6] Singhal, M. and Shivaratri, N., Advanced Concepts in Operating Systems, McGraw-Hill, Inc., New York, 1994.
[7] Tokuda, H. and Mercer, C., "ARTS: A Distributed Real-Time Kernel", ACM Operating Systems Review, Vol. 23, No. 3, July 1989.
[8] Valhalia, U., Unix Internals: The New Frontiers, Prentice-Hall, Inc., Upper Saddle River, N.J., 1996.