In my theory the target problem has only a single knowledge state. The ``goal state'' is not represented at the start. I am dealing with insight problems, for which I assume a clear picture of the goal state is practically all one needs to solve the problem. Most of the work is in finding out what the goal state is, as opposed to how to get there. Rather, the ``solution'' is represented in terms of criteria that determine whether the problem is solved.
An agent cannot tell if the problem is solved by examining the visual representation. An agent needs to turn it back into a non-visual representation and run a simulation to determine the effectiveness of the manipulations made. For example, moving the weaker rays around and pointing them toward the tumor cannot be identified as an adequate solution to the problem unless the agent understands that the result of this would be that the tumor is destroyed while leaving the healthy tissue unharmed. Then the goal criteria can be applied. Once the knowledge state is non-visual again, its workings need to be simulated to be able to test it against the goal criteria. The agent maintains correspondences between elements of the knowledge states throughout the transfer process, and is able to use that information to return the final solution state back into a non-visual representation. Simulation means predicting the behavior of a system given knowledge of how it works.
Thus to simulate, the agent needs causal knowledge. Neither this causal knowledge, nor the goal criteria can be represented with only visual information, as causality consists of more than visual relationships between things.
By causal I mean knowledge of how things in a system change as they interact. Pre- and post-conditions are a straightforward way to represent this, but it is difficult to imagine what ``visual'' pre- and post-conditions might look like. For the reasons above I hypothesize that visual representations alone cannot enable evaluation of the solution.