@Article{LeherteGlasgowBaxterSteegFortier1997, author = {Leherte, L. and Glasgow, J. and Baxter, K. and Steeg, E. and Fortier, S.}, title = {Analysis of three-dimensional protein images}, journal = {Journal of Artificial Intelligence Research}, year = {1997}, OPTvolume = {7}, OPTpages = {125--159}, OPTnote = {}, OPTannote = {} }
Molecular scene analysis: the processes of reconstruction, classification and understanding of molecular images. [125] It uses rules of biochemistry and structural templates to interpreting images from crystallization experiments. A protein crystal is a substance in which the protein regularly repeats. Any one repeating cubic section of the crystal is termed a "unit cell." [126] A unit cell is used to create an "electron density map," (EDM)which is "a 3d array of real values that estimate the electron density at given locations in the unit cell." That is, where there are a lot of electrons means there are atoms. So the rough shape of the molecules in the unit cell can be seen in the EDM. These EDMs are noisy because of the phase problem. Interpretation of EDMs involves a biologist segmenting the image into features, and then compared with anticipated structural motifs. These guesses result in information that allows the EDM to be more refined. It takes forever and requires an expert, who must recognize motifs in the 3d representation. The eventual goal of the research this paper describes it to automate this process.
Ideally, we would be able to predict global structure from the amino acid sequence, which is relatively easy to get. Because this is a difficult and unsolved problem, x-ray crystallography and nuclear magnetic resonance are the only realistic ways to do it.[128]
The 3d annotated graph
The simple representation this work uses preserves relevant shape,
connectivity, and distance information, in 3d annotated graphs, which
trace the main and side chains of the protein (protein is made of a
chain of amino acids, the main chain, and connected side chains.) The
graph nodes are amino acid residues and the edges are bond interactions.
This graph can be used to determine secondary structure motifs in the
protein (secondary structures are alpha helices and beta sheets. Most
proteins are made of these structures and "loops," which connect
them.
Section 2 of this paper describes the basic molecular biology of proteins, which I will not summarize.
Much of vision research involves constructing a 3d model from 2d images. In contrast crystallographic data is already in 3d voxels, it's noisy and incomplete, but shadows, shading, and occlusion are not problems. [130]
From the initial low-rez EDM, the first goal is to locate where the
protein is and distinguish it from water (the "solvent").[131]
From the medium-rez EDM, the goal is to identify amino acid (residues)
and the secondary structures (alpha helices and beta sheets). At
high-rez you want to identify specific residues and perhaps the
locations of individual atoms.
The first step of scene analysis is to partition the image into regions, where each region hopefully corresponds to some meaningful part. These parts are used as input for a classifier. Critical points define a skeleton (but not by thinning). The protein data bank (PDB) is a database of protein structures which is the data for pattern recognition. Substructures repeat in protein shapes.
This paper will argue for the feasibility of topological approaches to low and medium rez EDMs.[133]
There are peaks and passes along the chain of amino acids in the EDM. The peaks are generally associated with amino acid residues and and passes are bonds that link them.[137] Where there is ambiguity, there are plans to have the system evaluate hypotheses.
That at low resolution, linear sequences of critical points are secondary structures.
SYSTEM: ORCRIT
The peaks and passes at 5 angstroms have a hierarchical relationship with the more detailed peaks and passes at 3 angstroms.[140]
Statistics were used to analyze 63 protein structures. f(ssm|g) is the probability distribution where ssm is the secondary structure motif and g is the geometrical constraints. These were computed for alpha helix, beta sheets, and turns (loops).[142]
Both a Bayesian/Minimum Message Length (MML) approach and a MYCIN-like approach were used to try to identify secondary structure in ideal data. The first Bayes method and the MYCIN approach traded off on quality depending on whether an alpha helix, beta strand, or turn was being identified.[148] The second Bayes approach had a lot of false positives. [149]
The experiment was run with experimental data as well, using a post-processed version of the ORCRIT output. No method was much better than any other. All were "relatively sucessful" at identifying secondary structures.