| 1. Title: MUSK "Clean1" database | | 2. Sources: | (a) Creators: AI Group at Arris Pharmaceutical Corporation | contact: David Chapman or Ajay Jain | Arris Pharmaceutical Corporation | 385 Oyster Point Blvd. | South San Francisco, CA 94080 | 415-737-8600 | zvona@arris.com, jain@arris.com | (b) Donor: Tom Dietterich | Department of Computer Science | Oregon State University | Corvallis, OR 97331 | 503-737-5559 | tgd@cs.orst.edu | (c) Date received: September 12, 1994 | | 3. Past Usage: | Dietterich, T. G., Lathrop, R. H., Lozano-Perez, T. (submitted) | Solving the multiple-instance problem with axis-parallel rectangles. | Submitted to Artificial Intelligence. | | This paper compares several axis-parallel rectangle algorithms and | includes the following table: | | Algorithm TP FN FP TN errs %correct [CI] | iterated-discrim APR 42 5 2 43 7 92.4 [87.0--97.8] | GFS elim-kde APR 46 1 7 38 8 91.3 [85.5--97.1] | GFS elim-count APR 46 1 8 37 9 90.2 [84.2--96.3] | GFS all-positive APR 47 0 15 30 15 83.7 [76.2--91.2] | all-positive APR 36 11 7 38 18 80.4 [72.3--88.5] | backpropagation 45 2 21 24 23 75.0 [66.2--83.8] | C4.5 (pruned) 42 5 24 21 29 68.5 [40.9--61.3] | | key: TP = true positives | FN = false negatives | FP = false positives | TN = true negatives | errs = errors = FN+FP | %correct = 10-fold cross-validation %correct. | CI = 95% confidence interval on proportion of correct | predictions. | For explanations of the various algorithms, see the | paper. | | C4.5 and backprop were applied ignoring the multiple instance | problem (see below) during training, but obeying it during | testing. | | This paper also gives more details on the construction of the | data set. | | This paper also describes an artificial generator that can | generate data sets with statistics and properties similar to | this one. | | 4. Relevant Information: | This dataset describes a set of 92 molecules of which 47 are judged | by human experts to be musks and the remaining 45 molecules are | judged to be non-musks. The goal is to learn to predict whether | new molecules will be musks or non-musks. However, the 166 features | that describe these molecules depend upon the exact shape, or | conformation, of the molecule. Because bonds can rotate, a single | molecule can adopt many different shapes. To generate this data | set, the low-energy conformations of the molecules were generated | and then filtered to remove highly similar conformations. This left | 476 conformations. Then, a feature vector was extracted that | describes each conformation. | | This many-to-one relationship between feature vectors and molecules | is called the "multiple instance problem". When learning a | classifier for this data, the classifier should classify a molecule | as "musk" if ANY of its conformations is classified as a musk. A | molecule should be classified as "non-musk" if NONE of its | conformations is classified as a musk. | | 5. Number of Instances 476 | | 6. Number of Attributes 168 plus the class. | | 7. For Each Attribute: | | Attribute: Description: | molecule_name: Symbolic name of each molecule. Musks have names such | as MUSK-188. Non-musks have names such as | NON-MUSK-jp13. | conformation_name: Symbolic name of each conformation. These | have the format MOL_ISO+CONF, where MOL is the | molecule number, ISO is the stereoisomer | number (usually 1), and CONF is the | conformation number. | f1 through f162: These are "distance features" along rays (see | paper cited above). The distances are | measured in hundredths of Angstroms. The | distances may be negative or positive, since | they are actually measured relative to an | origin placed along each ray. The origin was | defined by a "consensus musk" surface that is | no longer used. Hence, any experiments with | the data should treat these feature values as | lying on an arbitrary continuous scale. In | particular, the algorithm should not make any | use of the zero point or the sign of each | feature value. | f163: This is the distance of the oxygen atom in the | molecule to a designated point in 3-space. | This is also called OXY-DIS. | f164: OXY-X: X-displacement from the designated | point. | f165: OXY-Y: Y-displacement from the designated | point. | f166: OXY-Z: Z-displacement from the designated | point. | class: 0 => non-musk, 1 => musk | | Please note that the molecule_name and conformation_name attributes | should not be used to predict the class. | | 8. Missing Attribute Values: none. | | 9. Class Distribution: | Musks: 47 | Non-musks: 45 | | 0,1. molecule_name: MUSK-jf78,MUSK-jf67,MUSK-jf59,MUSK-jf58,MUSK-jf47,MUSK-jf46,MUSK-jf17,MUSK-j51,MUSK-j33,MUSK-f205,MUSK-f184,MUSK-f159,MUSK-f158,MUSK-f152,MUSK-344,MUSK-333,MUSK-331,MUSK-330,MUSK-323,MUSK-322,MUSK-321,MUSK-316,MUSK-315,MUSK-314,MUSK-311,MUSK-301,MUSK-293,MUSK-292,MUSK-285,MUSK-284,MUSK-273,MUSK-272,MUSK-256,MUSK-254,MUSK-246,MUSK-240,MUSK-238,MUSK-236,MUSK-228,MUSK-227,MUSK-224,MUSK-219,MUSK-213,MUSK-212,MUSK-211,MUSK-190,MUSK-188,NON-MUSK-jp13,NON-MUSK-jp10,NON-MUSK-j97,NON-MUSK-j96,NON-MUSK-j93,NON-MUSK-j90,NON-MUSK-j84,NON-MUSK-j83,NON-MUSK-j81,NON-MUSK-j148,NON-MUSK-j147,NON-MUSK-j146,NON-MUSK-j130,NON-MUSK-j129,NON-MUSK-j100,NON-MUSK-f209,NON-MUSK-f164,NON-MUSK-f161,NON-MUSK-f150,NON-MUSK-334,NON-MUSK-327,NON-MUSK-320,NON-MUSK-319,NON-MUSK-318,NON-MUSK-309,NON-MUSK-308,NON-MUSK-305,NON-MUSK-297,NON-MUSK-296,NON-MUSK-295,NON-MUSK-290,NON-MUSK-289,NON-MUSK-288,NON-MUSK-286,NON-MUSK-271,NON-MUSK-257,NON-MUSK-253,NON-MUSK-249,NON-MUSK-247,NON-MUSK-232,NON-MUSK-226,NON-MUSK-220,NON-MUSK-208,NON-MUSK-200,NON-MUSK-199. conformation_name: 188_1+1,188_1+2,188_1+3,188_1+4,190_1+1,190_1+2,190_1+3,190_1+4,211_1+1,211_1+2,212_1+1,212_1+2,212_1+3,213_1+1,213_1+2,213_1+3,213_1+4,219_1+1,219_1+2,224_1+1,224_1+2,227_1+1,227_1+2,228_1+1,228_1+2,228_1+3,228_2+1,228_2+2,236_1+1,236_1+2,236_1+3,236_2+1,236_2+2,236_2+3,238_1+1,238_1+2,238_1+3,238_2+1,238_2+2,240_1+1,240_1+2,240_2+1,240_2+2,240_3+1,240_3+2,240_4+1,240_4+2,246_1+1,246_1+2,246_2+1,246_2+2,254_1+1,254_1+2,256_1+1,256_1+2,256_1+3,256_1+4,272_1+1,272_1+2,272_1+3,273_1+1,273_1+2,273_1+3,273_1+4,273_1+5,284_1+1,284_1+2,284_2+1,284_2+2,285_1+1,285_1+2,285_1+3,285_1+4,285_2+1,285_2+2,285_2+3,285_2+4,292_1+1,292_1+2,292_2+1,292_2+2,293_1+1,293_1+2,293_2+1,293_2+2,301_1+1,301_1+2,301_2+1,301_2+2,311_1+1,311_1+2,314_1+1,314_1+2,314_2+1,314_2+2,314_3+1,314_3+2,314_4+1,314_4+2,315_1+1,315_1+2,315_2+1,315_2+2,316_1+1,316_1+2,316_2+1,316_2+2,321_1+1,321_1+2,322_1+1,322_1+2,322_2+1,322_2+2,322_3+1,322_3+2,322_4+1,322_4+2,323_1+1,323_1+2,323_2+1,323_2+2,330_1+1,330_1+2,330_2+1,330_2+2,331_1+1,331_1+2,331_2+1,331_2+2,333_1+1,333_1+2,333_2+1,333_2+2,333_3+1,333_3+2,333_4+1,333_4+2,344_1+1,344_1+2,f152_1+1,f152_1+2,f152_1+3,f152_1+4,f158_1+1,f158_1+2,f158_1+3,f158_1+4,f159_1+1,f159_1+2,f184_1+1,f184_1+2,f184_1+3,f184_1+4,f184_2+1,f184_2+2,f184_2+3,f184_2+4,f205_1+1,f205_1+2,f205_1+3,f205_1+4,f205_2+1,f205_2+2,f205_2+3,f205_2+4,j33_1+1,j33_1+2,j51_1+1,j51_1+2,j51_2+1,j51_2+2,jf17_1+1,jf17_1+2,jf17_1+3,jf46_1+1,jf46_1+2,jf46_2+1,jf46_2+2,jf46_2+3,jf47_1+1,jf47_1+2,jf47_2+1,jf47_2+2,jf58_1+1,jf58_1+2,jf58_2+1,jf58_2+2,jf58_2+3,jf59_1+1,jf59_1+2,jf59_2+1,jf59_2+2,jf59_2+3,jf67_1+1,jf67_1+2,jf67_2+1,jf67_2+2,jf67_3+1,jf67_3+2,jf67_4+1,jf67_4+2,jf78_1+1,jf78_1+2,jf78_1+3,jf78_2+1,jf78_2+2,jf78_2+3,199_1+1,199_1+2,199_1+3,199_1+4,200_1+1,200_1+2,200_1+3,200_1+4,208_1+1,208_1+2,220_1+1,220_1+2,220_1+3,220_1+4,226_1+1,226_1+2,232_1+1,232_1+2,232_1+3,232_2+1,232_2+2,232_3+1,232_3+2,232_4+1,232_4+2,247_1+1,247_1+2,249_1+1,249_1+2,253_1+1,253_1+2,257_1+1,257_1+2,257_1+3,257_1+4,271_1+1,271_1+2,286_1+1,286_1+2,286_1+3,286_1+4,286_2+1,286_2+2,286_2+3,286_2+4,286_2+5,288_1+1,288_1+2,288_1+3,288_1+4,288_1+5,288_1+6,288_1+7,288_1+8,288_2+1,288_2+2,288_2+3,288_2+4,288_2+5,288_2+6,288_2+7,288_2+8,288_3+1,288_3+2,288_3+3,288_3+4,288_3+5,288_3+6,288_3+7,288_3+8,288_4+1,288_4+2,288_4+3,288_4+4,288_4+5,288_4+6,288_4+7,288_4+8,289_1+1,289_1+2,289_1+3,289_1+4,290_1+1,290_1+2,295_1+1,295_1+2,296_1+1,296_1+2,296_2+1,296_2+2,297_1+1,297_1+2,297_2+1,297_2+2,305_1+1,305_1+2,308_1+1,308_1+2,309_1+1,309_1+2,318_1+1,318_1+2,319_1+1,319_1+2,319_2+1,319_2+2,320_1+1,320_1+2,327_1+1,327_1+2,327_2+1,327_2+2,334_1+1,334_1+2,f150_1+1,f150_1+2,f161_1+1,f161_1+2,f164_1+1,f164_1+2,f209_1+1,f209_1+2,f209_1+3,f209_1+4,f209_2+1,f209_2+2,f209_2+3,f209_2+4,j100_1+1,j100_1+2,j100_2+1,j100_2+2,j129_1+1,j129_1+2,j129_2+1,j129_2+2,j129_3+1,j129_3+2,j129_4+1,j129_4+2,j130_1+1,j130_1+2,j146_1+1,j146_1+10,j146_1+2,j146_1+3,j146_1+4,j146_1+5,j146_1+6,j146_1+7,j146_1+8,j146_1+9,j146_2+1,j146_2+10,j146_2+2,j146_2+3,j146_2+4,j146_2+5,j146_2+6,j146_2+7,j146_2+8,j146_2+9,j146_3+1,j146_3+10,j146_3+2,j146_3+3,j146_3+4,j146_3+5,j146_3+6,j146_3+7,j146_3+8,j146_3+9,j146_4+1,j146_4+10,j146_4+2,j146_4+3,j146_4+4,j146_4+5,j146_4+6,j146_4+7,j146_4+8,j146_4+9,j147_1+1,j147_1+10,j147_1+2,j147_1+3,j147_1+4,j147_1+5,j147_1+6,j147_1+7,j147_1+8,j147_1+9,j147_2+1,j147_2+10,j147_2+2,j147_2+3,j147_2+4,j147_2+5,j147_2+6,j147_2+7,j147_2+8,j147_2+9,j147_3+1,j147_3+10,j147_3+2,j147_3+3,j147_3+4,j147_3+5,j147_3+6,j147_3+7,j147_3+8,j147_3+9,j147_4+1,j147_4+10,j147_4+2,j147_4+3,j147_4+4,j147_4+5,j147_4+6,j147_4+7,j147_4+8,j147_4+9,j148_1+1,j148_1+2,j81_1+1,j81_1+2,j83_1+1,j83_1+2,j84_1+1,j84_1+2,j90_1+1,j90_1+2,j90_1+3,j90_1+4,j93_1+1,j93_1+2,j93_1+3,j93_1+4,j93_2+1,j93_2+2,j93_2+3,j93_2+4,j93_3+1,j93_3+2,j93_3+3,j93_3+4,j93_4+1,j93_4+2,j93_4+3,j93_4+4,j96_1+1,j96_1+2,j96_2+1,j96_2+2,j97_1+1,j97_1+2,j97_2+1,j97_2+2,jp10_1+1,jp10_1+2,jp10_1+3,jp13_1+1,jp13_1+2,jp13_1+3,jp13_1+4,jp13_2+1,jp13_2+2,jp13_2+3,jp13_2+4. f1: continuous. f2: continuous. f3: continuous. f4: continuous. f5: continuous. f6: continuous. f7: continuous. f8: continuous. f9: continuous. f10: continuous. f11: continuous. f12: continuous. f13: continuous. f14: continuous. f15: continuous. f16: continuous. f17: continuous. f18: continuous. f19: continuous. f20: continuous. f21: continuous. f22: continuous. f23: continuous. f24: continuous. f25: continuous. f26: continuous. f27: continuous. f28: continuous. f29: continuous. f30: continuous. f31: continuous. f32: continuous. f33: continuous. f34: continuous. f35: continuous. f36: continuous. f37: continuous. f38: continuous. f39: continuous. f40: continuous. f41: continuous. f42: continuous. f43: continuous. f44: continuous. f45: continuous. f46: continuous. f47: continuous. f48: continuous. f49: continuous. f50: continuous. f51: continuous. f52: continuous. f53: continuous. f54: continuous. f55: continuous. f56: continuous. f57: continuous. f58: continuous. f59: continuous. f60: continuous. f61: continuous. f62: continuous. f63: continuous. f64: continuous. f65: continuous. f66: continuous. f67: continuous. f68: continuous. f69: continuous. f70: continuous. f71: continuous. f72: continuous. f73: continuous. f74: continuous. f75: continuous. f76: continuous. f77: continuous. f78: continuous. f79: continuous. f80: continuous. f81: continuous. f82: continuous. f83: continuous. f84: continuous. f85: continuous. f86: continuous. f87: continuous. f88: continuous. f89: continuous. f90: continuous. f91: continuous. f92: continuous. f93: continuous. f94: continuous. f95: continuous. f96: continuous. f97: continuous. f98: continuous. f99: continuous. f100: continuous. f101: continuous. f102: continuous. f103: continuous. f104: continuous. f105: continuous. f106: continuous. f107: continuous. f108: continuous. f109: continuous. f110: continuous. f111: continuous. f112: continuous. f113: continuous. f114: continuous. f115: continuous. f116: continuous. f117: continuous. f118: continuous. f119: continuous. f120: continuous. f121: continuous. f122: continuous. f123: continuous. f124: continuous. f125: continuous. f126: continuous. f127: continuous. f128: continuous. f129: continuous. f130: continuous. f131: continuous. f132: continuous. f133: continuous. f134: continuous. f135: continuous. f136: continuous. f137: continuous. f138: continuous. f139: continuous. f140: continuous. f141: continuous. f142: continuous. f143: continuous. f144: continuous. f145: continuous. f146: continuous. f147: continuous. f148: continuous. f149: continuous. f150: continuous. f151: continuous. f152: continuous. f153: continuous. f154: continuous. f155: continuous. f156: continuous. f157: continuous. f158: continuous. f159: continuous. f160: continuous. f161: continuous. f162: continuous. f163: continuous. f164: continuous. f165: continuous. f166: continuous.