The Conceptual Description of Physical Activities

A system has been designed to translate connected sequences of visual images of physical activities into conceptual descriptions. The representation of such activities is based on a canonical verb of motion so that the conceptual description will be compatible with semantic networks in natural language understanding systems. A case structure is described which is derived from the kinds of information obtainable in image data. A possible solution is presented to the problem of segmenting the temporal information stream into linguistically and physically meaningful events. An example is given for a simple scenario, showing part of the derivation of the lowest level events. The results of applying certain condensations to these events show how details can be systematically eliminated to produce simpler, more general, and hence shorter, descriptions.

This research was primarily supported by Canadian Defense Research Board grant 9820-11, and partially by National Science Foundation grant ENG75-10535.

If we view a motion picture such as illustrated in Figure 1, we are able to give a description of the physical activities in the scenario. This description is linguistic in the sense that the words used express our recognition of objects and movements as conceptual entities. A system for performing a sizeable part of this transformation of visual data into conceptual descriptions has been designed. It is described in Badler (1975); here we will present one small part of the system which is concerned with the organization of abstracted data from successive images of the scenario.

We are interested in a possible solution to the following problem: Given that a conceptual description of a scenario is to be generated, how is it decided where one verb instance starts and another ends? In other words, we seek computational criteria which separate visual experience into discrete chunks or events. By organizing the representation of an event into a case structure for a canonical motion verb, events can be described in linguistic terms. Verbs of motion have been investigated directly or indirectly by Miller (1972), Hendrix et al (1973a, 1973b), Martin (1973), and Schank (1973); semantic databases using variants of case structure verb representations (Fillmore(1968)) include Winograd (1972), Rumelhart et al (1972), and Simmons (1973).

We are concerned with physical movements of rigid or jointed objects so that motions may be restricted to translations and rotations. Objects may appear or disappear and the observer is free to move about. The resulting activities are combinations of these where observer motions are factored out if at all possible. We assume that the scenarios contain recognizable objects exhibiting physically possible, and preferably natural, motions.

A particular activity might consist of a single event, a sequence of events, sets of event sequences, or hierarchic organizations of events. The concept of walking is a good example of the last. Events are the basic building blocks of the conceptual description, and our events indicate the motions of objects. The interpretation of motion in terms of causal relationships is generally beyond the scope of the current system, although a semantic inference component could be included, Our descriptions consist mostly of observation of motion in context rather than explanation of why motion occurred.

Figure 1 - The moving car scenario

Table 1 - Adverbials

Type	Relationship	Set of Concepts
1	between the orientation and trajectory or axis of an object	BACKWARD, FORWARD, SIDEWAYS, AROUND, OVER, CLOCKWISE, COUNTERCLOCKWISE
2	between the trajectory of an object and fixed world directions	DOWNWARD, UP(WARD), NORTHWARD, SOUTHWARD, EASTWARD, WESTWARD
3	changing between objects	ACROSS ,AGAINST,ALONG, APART, AROUND, AWAY ,AWAY-FROM, BEHIND, BY, FROM, IN, INTO, OFF, OFF-OF, ON, ONTO, OUT, OUT-OF, OVER, THROUGH, TO, TOGETHER, UNDER
4	indicative of source and target	AWAY-FROM, IN-THE-DIRECTION-OF, IN(WARD , OUT(WARD), TOWARD
5	between the path of an object and other (moving) objects	AFTER, AHEAD-OF, ALONG, APART, TOGETHER, WITH
6	between an event and a previous event	BACK-AND-FORTH, TO-AND-FRO, UP-AND-DOWN, BACK, THROUGH

The general descriptive methodology is to keep only one static relational description of the scenario, that of the current image. Changes between it and the next sequential image are described by storing the names of changes in event nodes in a semantic network. In general, names of changes correspond to adverbs or prepositions (adverbials) describing directions or changing static relationships. Computational definitions for the set of adverbials in Table 1 appear in Badler (1975). We are only concerned with the senses of the adverbials pertaining to movement. Definitions are implemented as demons: procedures which are activated, the executed, by the successive appearance of certain assertions in the image description or current conceptual database. These demons are related to those of Charniak (1972), although our use of them, their numbers, and their organization are simplified and restricted. They are used to recognize or classify properties or changes and to generate the hierarchic descriptive structure. An essential feature of this methodology is that the descriptions are continually condensed by this change abstraction process; descriptions grow in depth rather than length.

The semantic information stored for each object in the scenario includes its TYPE, structural SUB-PARTs, VISIBILITY, MOBILITY, LOCATION, ORIENTATION, and SIZE. Most of these properties are determined from the image sequence, but some are stored in object models (indexed by TYPE) in the semantic network.

The events are also nodes in the semantic network. Each object is potentially the SUBJECT of an event node. A sequence of event nodes forms a history of movement of an object; only the latest node in the sequence is active. The set of active event nodes describes the current events in the scenario seen so far. The cases of the event node along with their approximate definitions follow.

These cases differ from Miller's (1972) primarily in the lack of a permissive case and our separation of the TRAJECTORY and AXIS cases. REFERENCE is new; one of its uses is to resolve descriptions of the same event from different viewpoints. The explicit times could be replaced by temporal relations. Miller's reflexive/objective distinction is not needed as each moving object has its own event nodes, regardless of the AGENT.

A few necessary definitions follow before the presentation of the event generation algorithm.

A null event node has all its cases NIL or zero except START-TIME, END-TIME, and perhaps NEXT.

The function CREATE-EVENT-NODE {property pairs) creates an event node with the indicated case values, returning the node as a result.

To compare successive values of numerical properties , a queue is associated with the case in current event nodes only. The front of the queue is represented by *; the place where new information is stored. The queues have length three; the three positions will be referenced by prefixing the case name with either NEW, CURRENT, or LAST, A function SHIFT manipulates property queues when they require updating:

Now we can present the algorithm for the demon which controls the construction of the entire event graph. It is executed once for each image when all lower level demons have finished; it creates, terminates, or updates each current event node.

A.1.1. An event node E is created when a mobile object first becomes visible and identifiable as an object.

The NIL START-TIME has the interpretation that we do not know what was happening to this object prior to time TN.

A.1.2. An event node E is created when a jointed part of the parent object with current event node EP is first observed to move relative to the parent, for example, an arm relative to a person's body.

This is interpreted as the parent object moving the part using the joint as the instrument. Any appropriate attributes are placed in the NEW-property positions. The node E is then immediately terminated (A.1.3).

A.1.3. An event node E2 is created whenever another event node E1 is terminated.

SUBJECT, AGENT, INSTRUMENT, REFERENCE, and DIRECTION are those which were present at termination of the previous node, subject to any additional conditions that changes in these may require.

A.2. Terminating event nodes. An event node E is terminated when there are significant changes in its properties. All queue structures are deleted.

The DIRECTION 1ist is unaltered except that the terminating adverbial s) may be added to DIRECTION(E) rather than to DIRECTION(NEXT(E)) (see A.2.5.).

A.2.1. Changes in SUBJECT. The assumptions of object rigidity and permanence preclude changes in an object.

A.2.2/3. Changes in AGENT and INSTRUMENT. These must be preceded by changes in CONTACT relations between objects and the SUBJECT. See A.2.5 on DIRECTION.

A.2.4. Changes in REFERENCE. A change in the REFERENCE features forces termination of every event node referencing those features, as such changes are usually caused by spatial or temporal discontinuities in the scenario,

A.2.5. Changes in DIRECTION, Changes in type (1) adverbials must be preceded by changes in TRAJECTORY, VELOCITY, AXIS, or ANGULAR-VELOCITY, because a relationship between an orientation and a TRAJECTORY or AXIS cannot change without at least one of the four cases changing. Changes in BACKWARD, FORWARD, and SIDEWAYS cause termination; this may occur with no orientation change if the TRAJECTORY has a non-zero derivative. For example, move a box in a circle while keeping its orientation constant.

Changes in type (2) adverbials must be preceded by a change in TRAJECTORY, but some of these changes may be too slight to cause termination from the TRAJECTORY criteria. (A.2.6.). Changes from UP to DOWN or vice versa are the only ones in this group causing termination.

Changes in type (3) adverbials terminate event nodes if and only if there is a change in a CONTACT relation or a VISIBILITY property. If the CONTACT is made or the VISIBILITY established, the adverbial goes into the new node's DIRECTION list. If the CONTACT is broken or VISIBILITY lost, the adverbial remains on the front of the terminated node's DIRECTION list.

Since the type (4) adverbials are only indicators of current source and target, these do not change unless the path of the SUBJECT changes or the target object moves. Therefore no terminations arise from this group.

The type (5) adverbials relate paths of the SUBJECT to other objects. They cause termination when they come into effect, and terminate their own nodes when they cease to describe the path.

The type (6) adverbials include higher level events and the basic repetitions. These all terminate the current event node. The repeated events {for example, BACK-AND-FORTH) are terminated when the repetition appears to cease.

A.2.6. Changes in TRAJECTORY. The changes in TRAJECTORY that are most important are those which change its derivative significantly. A change in the derivative from or to zero can be used {the start or end of a turn), but only the start is actually used for termination. Once the turn is begun, how it ends is unimportant since the final (current) trajectory is always saved.

The other termination case watches for a momentarily large derivative which settles back to smaller values. This indicates a probable collision. It is of crucial importance in inferring CONTACT relations between objects when none were (or could be) directly observed.

A.2.7. Changes in VELOCITY. A change in VELOCITY from zero to a positive value (from a positive value to zero) terminates the current event node and enters STARTS (STOPS) in the new node's (old node's) VELOCITY RATES list.

A.2.8. Changes in AXIS. A reversal of rotation terminates the event node. This corresponds to a change in AXIS to the opposite direction, with no intermediate values.

A.2.9. Changes in ANGULAR-VELOCITY. A change in ANGULAR_VELOCITY from zero to a positive value (from a positive value to zero) terminate the current event node and enters STARTS {STOPS) in the new node's (old node's) ANGULAR-VELOCITY RATES list.

A.2.11/12. Changes in START-TIME and END-TIME are not meaningful. the appropriate sub-event node of a REPEAT-PATH event node E, E is terminated. The definition of match for the basic repetitions appears

A.2.13. Changes in REPEAT-PATH. When new data fails to match the appropriate sub-event node of a REPEAT-PATH event node E, E is terminated. The definition of match for the basic repetitions appears in Badler (1975). The problem, in general, remains open. See, for example, Becker (1973).

A.3. Maintaining event nodes. If the new assertions do not cause termination of the event node, the property queues are merely shifted:

What does an event mean? This algorithm motivates a theorem that the events generated are the finest meaningful partition of the movements in the image sequence into distinct activities. The hypothesis of the assertion is the natural environment being observed and the linguistically-based conceptual description desired. The conclusion is that an event node produced from this algorithm describes either the lack of motion or else an unimpeded, simple linear or smoothly curving (or rotating) motion of the SUBJECT with no CONTACT changes. In addition, the orientation of the SUBJECT does not change much with respect to the trajectory. The proof of this assertion follows directly from the choice of termination conditions.

We will apply this algorithm to data obtained from each of the images in Figure 1. The lower front edge of the house is arbitrarily chosen as the REFERENCE feature; NORTH is toward the right of each image. We will not discuss the computation of the static relations from each image, only list in Table 2 the changes in the static description from image-to-image. Trajectory and rotation data are omitted for simplicity, although changes of significance are indicated.

If we write out the event node sequence using the canonical motion verbs MOVES and TURNS with the adverbial phrases from the RATES and DIRECTION lists, we obtain the following lengthy, but accurate, description:

Table 2 - Selected assertions and changes involved in the description of Figure 1.

Time	Action	Static Assertion	Event Assertion	Result
1	ADD	IN-FRONT-OF(CAR OBSERVER)		create C1
	ADD	IN-BACK-OF(CAR HOUSE)
	ADD	RIGHT-OF(CAR HOUSE)
	ADD	NEAR-TO(CAR HOUSE)
	ADD	SURROUNDED-BY(CAR DRIVEWAY)
	ADD	LEFT-OF (CAR DRIVEWAY)
	ADD	IN-BACK-OF(CAR DRIVEWAY)
	ADD	RIGHT-OF (CAR DRIVEWAY)
	ADD	AT(CAR DRIVEWAY)
	ADD	SUPPORTED-BY(CAR DRIVEWAY)
3	DELETE	IN-BACK-OF(CAR HOUSE)	VELOCITY(STARTS)	terminate C1 (A.2.7.)
			EASTWARD	--
			TOWARD OBSERVER	--
5	DELETE	IN-BACK-OF(CAR DRIVEWAY)	TRAJECTORY change	terminate C2 (A.2.6.)
	ADD	SUPPORTED-BY(CAR ROAD)	ONTO ROAD	terminate C2 (A.2.5.)
	ADD	IN-FRONT-OF(CAR DRIVEWAY)	ANGULAR-VELOCITY (STARTS)	terminate C2 (A.2.9.)
6	ADD	IN-FRONT-OF(CAR HOUSE)	NORTHWARD-AND-EASTWARD	---
7	DELETE	LEFT-OF(CAR DRIVEWAY)	OUT-OF DRIVEWAY	---
	DELETE	SURROUNDED-BY(CAR DRIVEWAY)	OUT-OF DRIVEWAY	---
	DELETE	AT(CAR DRIVEWAY)	FROM DRIVEWAY	---
	ADD	NEAR-TO(CAR DRIVEWAY)	FORWARD	---
8	DELETE	SUPPORTED-BY(CAR DRIVEWAY)	OFF-OF DRIVEWAY	terminate C3 (A.2.5.)
9			NORTHWARD	---
10	DELETE	NEAR-TO(CAR DRIVEWAY)	AROUND HOUSE	---
	ADD	LEFT-OF(CAR HOUSE)	AWAY-FROM DRIVEWAY	---
	ADD	FAR-FROM(CAR DRIVEWAY)	AWAY-FROM DRIVEWAY	---
12	DELETE	NEAR-TO(CAR HOUSE)	AWAY-FROM HOUSE	---
12	ADD	FAR-FROM(CAR HOUSE)	ANGULAR-VELOCITY (STOPS)	terminate C4(A.2.9.)
15	DELETE	VISIBILITY(CAR VISIBLE)	AWAY	terminate C5 (A.2.5.)

Notes: Relations with HOUSE use the house front orientation, not the observer's front.

The canonical form follows easily from the case representation and the DIRECTION list orderings. The directional adverbials FORWARD, BACKWARD and SIDEWAYS are interpreted as lasting the duration of the event, hence are written as while GOING ... clauses. STARTS is always interpreted at the beginning of the sentence, STOPS at the end. The termination conditions assure its correctness.

There is much redundancy in this description, but it is only the lowest level, after all, and many activities span several events. Two sets of condensations are applied by demons that watch over terminated event nodes. The first set is mostly concerned with interpreting certain null events caused by the image sampling rate and removing trajectory changes which prove to be insignificant. The second set of demons removes adverbials referring to directions in the support plane, removes RATES terms except STOPS, and generalizes redundant adverbials referring to the same object. The result of applying these condensations is:

Note that FROM the DRIVEWAY follows ONTO the ROAD. This is due to the pictorial configuration: the car is on the road before it leaves the driveway. The position of the while GOING FORWARD phrase could be shifted backwards in time to the beginning of the translatory motion, but this may be risky in general. We will leave it where it is, since this is primarily a higher level linguistic matter.

By applying demons which recognize instances of specific motion verbs to the individual event nodes, then condensing as above, we get:

The major awkwardness with this last description is that it relates the car to every other object in the scene. Normally one object or another would be the focus of attention and statements would be made regarding its role. Such manipulations of the descriptions are yet unclear.

In conclusion, we have outlined a small part of a system designed to translate sequences of images into linguistic semantic structures. Space permitted us only one example, but the method also yields descriptions for scenarios containing observer movement and jointed objects (such as walking persons). The availability of low level data has significantly shaped the definitions of the adverbials and motion verbs. Further work on these definitions, especially motion verbs, is anticipated. We expect that the integration of vision and language systems will benefit both domains by sharing in the specification of representational structures and description processes.

References

1. Badler, N (1975) Temporal scene analysis: Conceptual descriptions of object movements. University of Toronto, Department of Computer Science, Technical Report No. 80, February 1975.

2. Becker, J (1973) A model for the encoding of experiential information. In Computer Models of Thought and Languge, Schank, R. and Colby, K. (eds), W.H. Freeman & Co., San Francisco, 1973, pp 396-434.

3. Charniak, E (1972) Toward a model of children's story comprehension. MIT Artificial Intelligence Report TR-266, December 1972.

4. Fillmore, C (1968) The case for case. In Universals in Linguistic Theory, Bach, E. and Harms, R. (eds), Holt, Rinehart, and Winston, Inc., Chicago, 1968.

5. Hendrix, G (1973a) Modeling simultaneous actions and continuous processes. Artificial Intelligence 4, Winter 1973, pp 145-180.

6. Hendrix, G, Thompson, C, Slocum, J (1973b). Language processing via canonical verbs and semantic models. Third International Joint Conference on Artificial Intelligence, August 1973, pp 262-269.

7. Martin, W (1973) The things that really matter - A Theory of prepositions, semantic cases, and semantic type checking. Automatic Programming Group, Internal Memo 13, MIT Project MAC, 1973.

8. Miller, G (1972) English verbs of motion: A case study in semantics and lexical memory. In Coding Processes and Human Memory, Melton, A. and Martin, E. (eds.), V.H. Winston & Sons, Washington, D.C., 1973, pp 335-372.

9. Rumelhart, D, Lindsay, P, Norman D (1972) A process model for long term memory. In Organization of Memory, Tulving, E. and Donaldson, W. (eds.), Academic Press, New York, 1972, pp 197-246.

10. Schank, R (1973) The fourteen primitive actions and their inferences. Stanford A.I. Laboratory Memo AIM-183, 1973.

11. Simmons, R (1973) Semantic networks: Their computation and use in understanding English sentences. In Computer Models of Thought and Language, Schank, R. and Colby, K. (eds.), W. H. Freeman & Co., San Francisco, 1973, pp 63-113.

12. Winograd, T (1972) Understanding Natural Language. Academic Press, New York, 1972.