Coyne,Sproat 2001: WordsEye: An Automatic Text-to-Scene Conversion System

[ CogSci Summaries home | UP | email ]
http://www.jimdavies.org/summaries/

Coyne, B. & Sproat, R. (2001). WordsEye: An Automatic Text-to-Scene Conversion System. In Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH01), 487--496.

@inproceedings{CoyneSproat2001,
 	author = {Coyne, Bob and Sproat, Richard},
 	title = {WordsEye: an automatic text-to-scene conversion system},
 	booktitle = {SIGGRAPH '01: Proceedings of the 28th annual conference on Computer graphics and interactive techniques},
 	year = {2001},
 	pages = {487--496},
 	publisher = {ACM},
 	address = {New York, NY}
 }

Author of the summary: Nicholas Osborne, 2010, nico@nicholasosborne.ca

Cite this paper for:

SYSTEM: Wordseye
Language-based 3D scene generation
Semantic representation into a set of depictors
Semantic pre-presentation
Depictors - scenes defined in low-level graphic specifications
Translate high level semantic representation into low-level depictors
Transduction rules
Figurative and Metaphorical Depiction

The actual paper can be found at http://portal.acm.org/ft_gateway.cfm?id=383316&type=pdf&coll=GUIDE&dl=GUIDE&CFID=76198923&CFTOKEN=97859167>

Natural language is an effective medium for describing visual ideas and mental images

The emergence of language-based 3D scene generation, allowing anyone to quickly generate 3d scenes without knowledge of special software.[p01]

WordsEye, relying on a large database of 3d models is a system for automatically converting text into representative 3d scenes. Objects can have shape displacements, spatial tags, functional properties.

Introduction

There is a need for a new paradigm where creation of 3d scenes is immediate and easy through the use of natural languages.

A previous system was the 'Put' system, but was limited to spatial arrangements of existing objects and a limited subset of English.

WorldsEye attempts to provide an empty canvas in which users can paint a picture with worlds depicting spatial relations and actions performed by the objects from a wide range of text.[p02]

How it works: [p02]

Input text is entered, sentences are tagged and parsed
parsed text is converted to a dependency structure, which is then interpreted into a semantic representation.
depiction rules are used to convert the semantic representation into a set of depictors representing 3D objects, poses, special relations, colour attributes to represent each object
depictors are then used to adjust the 3d objects before the final 3d scene is rendered

Example: "John said that the cat was on the table. The animal was next to a bowl of apples", the software would construct a picture of a human character with a cartoon speech bubble coming out of its mouth. In the speech bubble would be a picture of a cat on a table with a bowl of apples next to it.

Limitations:
Unpredictability of the graphical result

Linguistic Analysis

Text is tagged and parsed using part-of speech-tagger and statistical parsers which generates a parse tree representing the structure of the sentence. The parse tree is then converted into a dependency representation which is a list of words in the sentence showing the words that they are dependent on and words that are dependent on them. [p02] The dependency representation allows for easier semantic analysis.

said   -> John
   |
   |----> that --> was --> the --> cat
		     |
		     | -->  on --> the --> table

Next, the dependency structure is converted into a semantic pre-presentation which is a description of the entities to be depicted in the scene and the relationships between them. Each element in the list is a representation fragment corresponding to a specific node of the dependency structure. Each node will be depicted by a given 3d object.

The appropriate semantic interpretation is found by a table lookup given the word in question, differing based upon what kind of thing the word denotes. WordNet is used to provide semantic hypernym and hyponym relations. Special prepositions are handled by semantic functions, and verbs are handled by semantic frames. WordsEye has semantic entries for 1300 english downs, and 2300 verbs.

WordsEye also interprets anaphoric or coreferring expressions (pronominals like ‘he or ‘she’) and nouns through associations in the WordNet hierarchy.

Example: cat is a subset of the denotations of animal

Depictors

Scenes are defined in terms of low-level graphic specifications called depictors, which exist to control 3D object visibility, size, position, orientation, colour, transparency. They are also used to specify human poses, Inverse Kinematics, and modify vertex displacement for facial expressions.[p03]

WorldsEye associates additional information with each 3D model

Skeletons	Skeletal control of structures
Shape displacements	Associated with the object, used to depict emotions
Parts	Polygonal faces presenting significant areas of the surface
Colour Parts	Set of parts to be coloured when the object is specified by the text as having a specific colour
Opacity parts	Parts that get transparency
Default Size	Objects are given a default size (expressed in feet)
Functional properties	Used to depict how an object can be used
Spatial tags	To help with spatial relations, the shape of the object must be known.

Spatial relations define the basic layout of the scene. Relative positions such as 'next to' 'behind' 'facing' are frequently an implicit part of actions and compound objects.

Example: "The bird is on the cat", find a top surface tag for the cat (on its back) and a base tag for the bird (under its feet). The bird is then repositioned so that its feet are on the cat's back.

Standalone poses	Consist of a character in a particular position
Specialized usage	Involve a character using a specific instrument
Generic usage	A character interacting with a generic stand-in object
Grip poses	A character holding a specific object in a certain way

Depiction Process

WordsEye's depiction module translates the high level semantic representation produced by the linguistic analysis into low-level depictors.[p06]

Process:

Convert the semantic representation from the node structure to a list of typed semantic elements.
Interpret the semantic representation
Assign depictors to each semantic element
Resolve implicit and conflicting constraints
Read in referenced 3D models
Apply each depictor, maintaining constraints, incrementally build up the scene
Add background environment
Adjust the camera (framing the scene)
Render

The main semantic element types are ENTITY (noun), ACTION (verbs), ATTRIBUTE (adjectives) and RELATION (prepositions). Additional more specialized types are PATHS, TIMESPEC, CONJUNCTION, POSSESSIVE, NEGATIOn, CARDINALITY

In order to depict a sentence, the semantic elements must be made graphically realizable. This done by applying a set of depiction rules which are test for applicability then applied to translate semantic elements into graphical depictors.

Example: "The cowboy rode the red bicycle to the store"[p06]

Entity: cowboy
Entity: bicycle
Entity: store
Attribute:

Subject: <element 2>

Property: red
Action:

Actor: <element 1>

Action: ride

Object: <element 2>

Path: <element 6>
Path:

Relation: to

Figure: <element 5>

Ground <element 3>

A set of transduction rules invoke depictors for constraints based on common sense knowledge and are not explicitly stated. If X is next to Y, X is not already on a surface, and X, put X on the same surface as Y.

Depiction specifications sometimes conflict with one another, which are resolved by a transduction rule which removes the tentative depictor.

Applying constrains are done in a prioritized manner:

Objects are initialized to their default size and shape
Apply shape changes, poses are applied.
Once objects are in their correct shapes and poses, objects are positioned and grouped
Poses/shapes are applied
Dynamic operations such as placing objects on paths and IK are performed

Interpretation, Activities, Environment

For text to be depicted, it must first be interpreted. This is done by relying on the functional properties of objects, and will depend on the setting of the scene or an activity.[p07]

Figurative and Metaphorical Depiction.

Sentences can include abstractions or non-physical properties and relations which cannot be directly depicted. This can be resolved using the following techniques:[p08]

Textualization	Generate a 3D extruded text of the word
Emblematization	When an entity cannot be depicted then some 3D object can be used as an emblem for it.
Characterization	Special type of emblematization related to human characters in their roles, and is solved using clothing or having the character hold an emblem.
Conventional icons	Thought bubbles, red circle with a slash
Literalization	Figurative or metaphorical meanings depicted in a literal manner
Personification	Metaphorical statement depicted in a human role
Degeneralalization	General categorical terms cannot be depicted directly, this is solved by picking a specified object instance of the same class.