论文部分内容阅读
针对两组数据进行了比较讨论,试图说明在QSAR/QSPR研究中经常碰到的一个基本问题。第一组为一散布度(diver- sity)很大分子结构多样化的大样本数据;第二组则是按照分子结构相似度筛选出来的散布度较小结构相似的小样本数据。对于第一组数据,因数据集分散,全局模型难以完全描述物质结构特征与其性质之间的关系,所得回归结果很差(检验集相关系数Q2=0.68、平均预报偏差(RMSEP)=40.65)。试采用新近提出的局部懒惰回归(Local lazy regression,LLR)对其进行改善,但实际结果是局部模型的效果更差(Q2=0.60、RMSEP=45.05)。继对散布度较小且相对均匀(结构相似)的数据集用LLR方法建立局部模型,此时得到的预报结果(Q2=0.90、RMSEP=24.66)却明显优于全局模型(Q2=O.86、RMSEP=29.37)。
A comparative discussion of two sets of data is attempting to illustrate one of the basic issues often encountered in QSAR / QSPR studies. The first group consists of a large sample of diversified molecules with diver- sity. The second group consists of small sample data with similar scatter degrees that are screened by similarity of molecular structures. For the first set of data, due to the scattered data sets, the global model is difficult to fully describe the relationship between the material structure and its properties, and the regression results obtained are very poor (test set correlation coefficient Q2 = 0.68, mean forecast deviation (RMSEP) = 40 .65). The local lazy regression (LLR) was used to improve it, but the practical result is that the local model is less effective (Q2 = 0.60, RMSEP = 45.05). After the local model was established by using the LLR method for the data sets with small scatter and relatively uniform (similar in structure), the forecast result (Q2 = 0.90, RMSEP = 24.66) obtained by this method is obviously better than the global model = O.86, RMSEP = 29.37).