@InProceedings{Niyogi1995, author = {Niyogi, Sourabh A. }, title = { Deducing Visual Relationships from Attentional Representations}, booktitle = {American Association for Artificial Intelligence Fall Symposium}, year = {1995}, pages = {63--69} }
Abstract:
This paper proposes an architecture for encoding visual, spatial, and spatiotemporal relationships in a visual scene (attentional representation) in order to deduce visual relationships. An attentional window is defined as the position, motion, and scale of that which is being attended to, while an attentional state is the state of that window at a particular time. Further, a systems-level theory of daemons is described with a proposal for first order models of those daemons. The paper also describes how these models can be used to answer queries by deducing visual relationships.
1 Attentional Representations:
The basic theory is that attentional representations will allow an observer to encode an observed scene into memory, subsequently retrieve it from memory, and then deduce visual relations through the use of daemons observing the attentional representation.
In order to test this architecture Niyogi used a bottom up and top down methodology as a means of controlling attentional states. The bottom up methodology involved having a transient event attract the observer’s attention to a particular location, i.e. a flashing light. The top down methodology involves directing the observer to pay attention to an object in the scene via a query or instruction.
2 Language queries and attentional representations:
Attentional representation and daemons observing them can be used to answer queries about the relationships present in the observation i.e. one thing moving towards another, etc. In a directed query the observer is queried and then the observer is allowed to encode the scene. In undirected queries the observer first observes the scene and then is queried about it.
2.1 Directed queries:
There are three stages to a directed query as envisioned by Niyogi: compilation, observation, and daemon readout.
Compilation compiles the query into a sequence of visual routines, prior to the scene being observed. These routines direct the attentional window to continue scanning until all of the objects required to answer the query have been found.
During the Observation stage the scene is presented to the observer, this is when the Compiled visual routines are run.
During the daemon readout stage, the daemons monitor the attentional window to determine whether the objects have exhibited the target relationship, i.e. answer to the query.
2.2 Undirected queries:
There are six states to an undirected query as envision by Niyogi: observation, compilation, memory search, decoding, encoding, daemon readout.
Observation processes and stores the scene using a default visual routine to encode the attentional states into memory.
Compilation compiles the query into visual memory routines following a similar methodology to a directed query’s compilation, operating on encoded memories instead of active observations.
Memory search conducts a search of the attentional representations stored in memory searching for the requested objects in a specified place and time.
Decoding takes the stored attentional representations and processes them according to the compiled visual memory routines. If a daemon is able to provide an accurate answer to the query at this point, no further processing is required.
Encoding takes the processed attentional representations and observes them as if they are being seen again. This process creates new attentional representations.
Daemon readout then occurs based upon the re-observation of the scene allowing the observer to determine whether the query has been answered or not.
3 Daemon construction:
This provides a brief overview of some of the constructed daemons that allow the processing of attentional representations in order to deduce visual relations. Although, this list is neither comprehensive nor believed to be correct its purpose is to show how simple constructions can allow for most of our intuitions.
3.1 Model-observing daemons:
These allow comparisons between objects in the same time scale, that is, X is faster/larger/darker/etc. than Y at a specific time t.
3.2 Scale-observing daemons:
These daemons allow comparisons between objects across the time scale. For example, as the time changes X may become faster/larger/darker/etc. than Y which was not the case in a previous time period.
3.3 Self-relationship observing daemons:
These allow comparisons between an object across the time scale, i.e. as the time changes X may become faster/larger/darker/etc. than X was in a previous time period.
3.4 Position-observing daemons:
These daemons allow comparisons of spatial relationships between objects in the same time scale, that is, X is to the left/behind/below/etc. of Y at a specific time t.
3.5 Circuitry-observing daemons:
These daemons allow more complex comparisons beyond the operations that the simple variable can provide. They require some of the information about the state of attentional routing circuitry in the same time scale. For example at a specific time t, X is inside/outside/against/etc. Y.
3.6 Position and motion observing daemons:
These allow comparisons of spatial relations between objects across the time scale, as the time changes X may move towards/away-from/past/etc. Y.
3.7 Daemon-observing daemons:
These daemons allow comparisons of the spatiotemporal relationships that are decomposable into more primitive relations. These only activate when the required subset of lower-level daemons all activate.
4 Discussion:
The paper proposes daemons observing attentional states as a means of deducing visual relations between objects. Although Niyogi is certain that the details of these daemons and their construction is wrong, he uses them to illustrate how simple constructions might be used to arrive at our intuitions about spatiotemporal relations. The paper also proposes an architecture for how observers deduce visual relationships. Niyogi proposes that before trying to create a detailed method we need to develop a systems-level architecture that is powerful enough to account for the abilities to make visual judgments.