|
pictures.zip
18-Dec-2004 18:48 334k |
programs.zip
18-Dec-2004 18:48 97k |
programs
18-Dec-2004 18:47 |
thesis
18-Dec-2004 18:48 1.1M |
thesis.zip
18-Dec-2004 18:48 182k |
In the set of training molecules, structures such nicotine or coniine were included on purpose several times although they differ onlt by their orientation in space or by minor conformational differences. These structures provided suitable examples for testing the procedure worked out in this thesis. The latter must recognize of course if there are identical structures or structure that differ by some minor conformational changes. Nicotine has a pyridine ring, which is recognized by the program as a ring with double bonds first and second as an aromatic ring. By removing the double bonds the more important feature 62 (aromatic six-membered ring) remains. In general, a priority list can be applied to give preference to important features such as aromatic rings. In this work, ligands have been selected in such a way that the need for such a list was no longer given. Probably, this could have been handled in a more elegant way, but because of the limited time available for this work, the manual correction was considered as an acceptable alternative. For the encoding of the 56 structures given above the following list of structural features was set up.
1) Benzatropine "bensatropin" original 2) Budesonide "budesonid" original 3) Bupivacaine "bupivakain" original 4) 2-Chloro-4-hydroxy-6-amino- "2-chloro-4-hydroxy-6-1,3,5-triazine " PDB set 5) 2-Hydroxyglutarate "2-hydroxyglutarate" PDB set 6) 2-Chloroethanol "2-chloroethanol" PDB set 7) 2-Hydroxy-4,6-diamino- "2-chloro-4-hydroxy-6-1,3,5-triazine " PDB set 8) Cocaine "kokain" original 9) 4-Methylheptan-3-ol "4-Methylheptan-3-ol" PDB set 10) Lofepramine "lofepramin" original 11) Amphetamine "Amphetamine" PDB set 12) Acetophenone "ar0016" PDB set 13) Meclizine "meklozin" original 14) Methadone "metadon" original 15) Morphine "morfin" Nicotine set 16) Noscapine "noskapin" original 17) Oxotremorine "oxotremorin" original 18) Atrazine "atrazine" PDB set 19) Benzyl alcohol "Benalcohol" PDB set 20) Procaine "prokain" original 21) Simanneal "simanneal" original 22) Terfenadine "terfenadin" original 23) Tiazotienol "tiazotienol" original 24) Trimipramine "trimipramin" original 25) Meropenem "meropenem" original 26) Benzaldehyde "Benzaldehyde" PDB set 27) 2,4-Dihydroxy-6-amino- "2,4-Dihydroxy-6-amino-1,3,5-triazine" PDB set 28) 2-Chloro-4,6-diamino- "2-chloro-4-hydroxy-6-1,3,5-triazine " PDB set 29) Spirodioxaundecane "spirodioxaundecane" PDB set 30) Santene "santene" PDB set 31) S-7-Methyl-3-nonanone "S-6-Methyl" PDB set 32) R-Sulcatol "R-Sulcatol" PDB set 33) Phenol "Phenol" PDB set 34) R-Seudenol "R-Seudenol" PDB set 35) Piperonal " Piperonal" PDB set 36) Pentachlorophenol "Pentachlorophenol" PDB set 37) Nicotine "Nicotine" original + mc 38) Nicotine "Nicotine2 " original + mc 39) Nicotine "nicotine" original + mc 40) myo-Inositol "myo-inositol" PDB set 41) Mescaline "Mescaline" PDB set 42) m-Cresol "m-Cresol" PDB set 43) Linoleate "linoleate" PDB set 44) Isopropylammelide "isopropylammelide" PDB set 45) Ibuprofen "Ibuprofen" PDB set 46) Frontalin "Frontalin" PDB set 47) exo-Brevicomin "exo-Brevicomin" PDB set 48) Epinephrine "Epinaphrine" PDB set 49) Dopamine "Dopamine" PDB set 50) d-3-Hydroxyproline "D-3-hydroxyproline" PDB set 51) d-Tartrate "D-tartarate" PDB set 52) Coniine " Coniine" PDB set 53) Coniine "coniine" PDB set 54) Disulfiram Disulfiram" PDB set 55) b-D-Galactose "beta-D-Gal" PDB set 56) Benzoic acid "Benzoic-acid" PDB set
It has also to be mentioned that a series of molecules had to be excluded from the training library because of encoding problems. The molecules which were excluded from the scope because of incorrect interpretation are:
Code number Structural feature element 1 carbonyle 2 not in use 3 ether 4 alcohol 5 ester 6 carbonylic acid 7 primary amine 8 secondary amine 9 tertiary amine 10 quartery amine 11 primary amide 12 secondary amide 13 tertiary amide 14 not in use 15 primary imine 16 secondary imine 17 amideof the type R(O=)C-N=C-R 18, 19 not in use 20 doubly bonded atom 21 triply bonded atom 22-29 not in use 30 non hydrogenated carbon 31-39 not in use 40 divalent S 41 SO group 42 SO2 group 43, 44 not in use 45 F atom 46 Cl atom 47 Br atom 48 I atom 49 not in use 50 CH3 group 51 C neighbour for CH3 52-60 not in use 61 aromatic 5-membered ring center 62 aromatic 6-membered ring center 63 not in use 64 center of a 4-membered ring 65 center of a 5-membered ring 66 center of a 6-membered ring 67 center of a 7-membered ring
All information is stored in two fields, the integer vector "atomsort" of length N, which contains the atomic number for each of the N atoms and the (3,N) matrix "posxyz" containing the x,y,z coordinates of each atom. There is another matrix called "platsixyz" corresponding to "posxyz", which has the following form:ID AN x y z 1 2 3 . .
In this setup, px is connected to and sorted along with x, py with y, and pz with z. By sorting the elements of column vector px according to their position in x direction, the elements of column vector py according to their position in y direction, and those of vector pz according to their position in z direction independently, the information to which atom a certain px, py, or pz value belongs is lost, unless each px, py and pz is associated with the appropriate ID reference. Taking this into account, each set of position variables x,y,z is sorted according to all interatomic distances from large to small. All x-values are sorted placing the highest x-value of an atom first in the x-column, the second highest into position 2 and so on. All y-values are separately sorted, again with the highest y-value placed first in the y-column. The corresponding ordering is done for the z-values. The two dimensional integer field platsixzy of dimension (3,N), takes care that each x,y, or z value is associated with the atom ID it belongs to. If, for example the x value of the 5th atom with ID=5 winds up after the sorting in position 10, then element platsixyz(1,10) = 5, if the y value of atom 5 ends up in position 20, then element platsixyz(2,20) = 5, and if the z value of atom 5 is the smallest found, then element platsixyz(3,1) = 5, etc. So the correct expression to retrieve the value y value for atom 5 after the sorting is posxyz(2,platsixyz(2,5)). As it turned out, this approach may not be the most efficient for handling the actual problems of this work. However, the advantage is that sorting operations have replaced more expensive algebraic calculations which would be for example necessary if one would determine the connectivity via the distance matrix, where for each pair of atoms the Eukledian distance (square root of a sum of squares) has to be calculated. With the procedure developed in this thesis it is possible to determine the connectivity between the atoms in the molecule by choosing an ID and checking close values in x, y, z of other atoms. By proceeding stepwise in a sorted matrix, there are a lot of values that can be skipped. If a value exceeds a certain threshold (e.g. 1.44 Å for aromatic CC bonds) in either x, y or z direction, no further steps in the corresponding direction will be needed. Such a search algorithm could be of particular interest in molecular dynamic simulations. To sort a vector which is already pre-sorted takes less computer power than sorting a completely disordered vector. During successive loops within an optimization, the values x, y, z for other atoms close to a single, targeted atom are many times just refined. This facilitates the sorting approach. In program kol.f90, no distances matrix is calculated. However, program kol.f90 checks all distances between non-hydrogen atoms parallel with possible secondary connections. Two atoms that are covalently bonded to a common neighbour atom are viewed as having a secondary connection to each other. In the search for atom bonds and secondary connections, first all atoms are considered as base atoms. For each base atoms, neighbors in the range of atom, atom bonds and secondary connections are found by searching in the corresponding pre-sorted vectors x, y and z vectors. Each pair of secondary neighbors is considered as an angle point of a triangle which implies the possibility of a ring, where the base atom is the third angle point. Atom candidates for the latter are then compared among each other with respect to distance. A constellation of tree atoms and tree appropriate distances (which does not exclude the possibility of having the atoms incorporated in a common ring) gives a potential center defined by the average xa, ya, za of all involved positions. The number of atoms involved in the ring can be calculated from the distance between the two atoms related to the base atom by the secondary connectivity. Potential centers, which are closely located to atom neighbors of the respective base atom from which they are derived or to other potential centers that originate from the same base atom, are deleted. Definite ring centers are obtained through a confirmation process where in space closely related centers are flocked in groups. Flocking means that several units with respect to a certain aspect float together and either are treated as a new unit, or at least become less easy to distinguish as individual units. If the number of potential centers in a group is high enough, a definite center is taken from the average x, y, z values for all participating potential centers. The flocking process is performed in a similar fashion as if the atoms were related to each other in space, accordingly by a sort process and stepwise consideration of different dimensions. The direct connectivity is used when information about functional relations between atoms are stored and transferred to determine functional complexity, e.g. the characterization of functional groups. An ester is for instance determined by detection of the typical C=O distance and setting of the information to the -O- atom from C in combination with the own base atom setting for -O-. All features except rings are defined with respect to measurements which are given by direct connectivity, i.e. covalent bonds. But to achieve functional groups including more than three atoms and two covalent bonds, it is necessary to rely on some kind of transfer of information above the limits of a base atom and its closest neighbors. This applies to features like esters vs carbonyles and also primary vs secondary amides, and to more complex groups in general. However, the coverage of complex groups in general is difficult in the framework of a F90 program, and, therefore would be beyond the scope of this thesis. (The JavaScript-programs in the appendix are, in contrast, easy adaptable to new, complex groups of any reasonable size.) All typical structural features of a given molecule, such as rings, amines, amides, ethers and so on used in this work, are listed in the output file. The format of the output files is similar to that of the input file:px py pz x y z 1 1 1 2 2 2 3 3 3 . .
F x y x
This setting is used to stress the biological activity, because only if a structural feature connected with bioactivity is found for both ligand and reference molecule, the biological activity of the ligand will be possible. If however similarity as such is stressed, then a second setting will be more appropriate:1 .AND. 1 -> 1 0 .AND. 0 -> 0 1 .AND. 0 -> 0 0 .AND. 1 -> 0
The latter settings must be used if for instance only one distance and two structural features are counted. The used settings are fine when looking for a special part of a pattern and when all occurrences are describing a 50% probability.1 .AND. 1 -> 1 0 .AND. 0 -> 1 1 .AND. 0 -> 0 0 .AND. 1 -> 0
Example bond lengths from molecules, energetically minimized with the program Sybyl, were taken as basis for expected variations. There are cases when this approach does not lead to reasonable values. In the ligand meropenem, there is a 4-membered ring with nitrogen as a neighbour to a double-bonded carbon in the next ring. The bond between the two atoms is shorter than the upper limit for double bonds, probably due to the unusually high ring strain. Resonance phenomena, except for aromatic ring systems, are also not included in the algorithm. The variations depending on the type of atom may be included in the overall structural variations. Intervals that cover secondary bond relation distances are of course less easy to set than primary bond relation intervals. In program kol.f90, all ring features are made up by considering secondary bond distances. To handle the problem with secondary bond distance variations, rings are first suggested from different data, which is brought together and confirmed by redundancy. A structure like ecgonineH 0.37 C(s) 0.77; C(d) 0.67; C(t) 0.60 N(s) 0.74; N(d) 0.65 O(s) 0.66; O(d) 0.57 F 0.64 S(s) 1.04; S(d) 0.95 Cl 0.99