@inproceedings{CoyneSproat2001, author = {Coyne, Bob and Sproat, Richard}, title = {WordsEye: an automatic text-to-scene conversion system}, booktitle = {SIGGRAPH '01: Proceedings of the 28th annual conference on Computer graphics and interactive techniques}, year = {2001}, pages = {487--496}, publisher = {ACM}, address = {New York, NY} }
The actual paper can be found at http://portal.acm.org/ft_gateway.cfm?id=383316&type=pdf&coll=GUIDE&dl=GUIDE&CFID=76198923&CFTOKEN=97859167>
Natural language is an effective medium for describing visual ideas and mental images
The emergence of language-based 3D scene generation, allowing anyone to quickly generate 3d scenes without knowledge of special software.[p01]
WordsEye, relying on a large database of 3d models is a system for automatically converting text into representative 3d scenes. Objects can have shape displacements, spatial tags, functional properties.
There is a need for a new paradigm where creation of 3d scenes is immediate and easy through the use of natural languages.
A previous system was the 'Put' system, but was limited to spatial arrangements of existing objects and a limited subset of English.
WorldsEye attempts to provide an empty canvas in which users can paint a picture with worlds depicting spatial relations and actions performed by the objects from a wide range of text.[p02]
How it works: [p02]
Example: "John said that the cat was on the table. The animal was next to a bowl of apples", the software would construct a picture of a human character with a cartoon speech bubble coming out of its mouth. In the speech bubble would be a picture of a cat on a table with a bowl of apples next to it.
Limitations:
Unpredictability of the graphical result
Text is tagged and parsed using part-of speech-tagger and statistical parsers which generates a parse tree representing the structure of the sentence. The parse tree is then converted into a dependency representation which is a list of words in the sentence showing the words that they are dependent on and words that are dependent on them. [p02] The dependency representation allows for easier semantic analysis.
said -> John | |----> that --> was --> the --> cat | | --> on --> the --> table
Next, the dependency structure is converted into a semantic pre-presentation which is a description of the entities to be depicted in the scene and the relationships between them. Each element in the list is a representation fragment corresponding to a specific node of the dependency structure. Each node will be depicted by a given 3d object.
The appropriate semantic interpretation is found by a table lookup given the word in question, differing based upon what kind of thing the word denotes. WordNet is used to provide semantic hypernym and hyponym relations. Special prepositions are handled by semantic functions, and verbs are handled by semantic frames. WordsEye has semantic entries for 1300 english downs, and 2300 verbs.
WordsEye also interprets anaphoric or coreferring expressions (pronominals like ‘he or ‘she’) and nouns through associations in the WordNet hierarchy.
Example: cat is a subset of the denotations of animal
Scenes are defined in terms of low-level graphic specifications called depictors, which exist to control 3D object visibility, size, position, orientation, colour, transparency. They are also used to specify human poses, Inverse Kinematics, and modify vertex displacement for facial expressions.[p03]
WorldsEye associates additional information with each 3D model
Skeletons | Skeletal control of structures |
Shape displacements | Associated with the object, used to depict emotions |
Parts | Polygonal faces presenting significant areas of the surface |
Colour Parts | Set of parts to be coloured when the object is specified by the text as having a specific colour |
Opacity parts | Parts that get transparency |
Default Size | Objects are given a default size (expressed in feet) |
Functional properties | Used to depict how an object can be used |
Spatial tags | To help with spatial relations, the shape of the object must be known. |
Spatial relations define the basic layout of the scene. Relative positions such as 'next to' 'behind' 'facing' are frequently an implicit part of actions and compound objects.
Example: "The bird is on the cat", find a top surface tag for the cat (on its back) and a base tag for the bird (under its feet). The bird is then repositioned so that its feet are on the cat's back.
Standalone poses | Consist of a character in a particular position |
Specialized usage | Involve a character using a specific instrument |
Generic usage | A character interacting with a generic stand-in object |
Grip poses | A character holding a specific object in a certain way |
WordsEye's depiction module translates the high level semantic representation produced by the linguistic analysis into low-level depictors.[p06]
Process:
The main semantic element types are ENTITY (noun), ACTION (verbs), ATTRIBUTE (adjectives) and RELATION (prepositions). Additional more specialized types are PATHS, TIMESPEC, CONJUNCTION, POSSESSIVE, NEGATIOn, CARDINALITY
In order to depict a sentence, the semantic elements must be made graphically realizable. This done by applying a set of depiction rules which are test for applicability then applied to translate semantic elements into graphical depictors.
Example: "The cowboy rode the red bicycle to the store"[p06]
A set of transduction rules invoke depictors for constraints based on common sense knowledge and are not explicitly stated. If X is next to Y, X is not already on a surface, and X, put X on the same surface as Y.
Depiction specifications sometimes conflict with one another, which are resolved by a transduction rule which removes the tentative depictor.
Applying constrains are done in a prioritized manner:
For text to be depicted, it must first be interpreted. This is done by relying on the functional properties of objects, and will depend on the setting of the scene or an activity.[p07]
Sentences can include abstractions or non-physical properties and relations which cannot be directly depicted. This can be resolved using the following techniques:[p08]
Textualization | Generate a 3D extruded text of the word |
Emblematization | When an entity cannot be depicted then some 3D object can be used as an emblem for it. |
Characterization | Special type of emblematization related to human characters in their roles, and is solved using clothing or having the character hold an emblem. |
Conventional icons | Thought bubbles, red circle with a slash |
Literalization | Figurative or metaphorical meanings depicted in a literal manner |
Personification | Metaphorical statement depicted in a human role |
Degeneralalization | General categorical terms cannot be depicted directly, this is solved by picking a specified object instance of the same class. |