问题
I'm trying to build a model that would predict the caco-2 coefficient of a molecule given its smiles string representation.
My solution is based on this example.
Since I need to predict a real value, I use a RandomForestRegressor
.
With some molecules added to the code manually, everything works (although the predictions themselves are wildly wrong):
from rdkit import Chem, DataStructs #all the nice chemical stuff, ConvertToNumpyArray
from rdkit.Chem import AllChem
from sklearn.ensemble import RandomForestRegressor #our regressor
from sklearn.model_selection import train_test_split
import numpy as np
# generate molecules
m1 = Chem.MolFromSmiles('Cc1ccc(NNC(=O)c2ccc(CN3C(=O)CCC3=O)cc2)cc1Cl')
m2 = Chem.MolFromSmiles('Nc1ccc(C(=O)N2CCN(c3cc[nH+]cc3)CC2)cc1[N+](=O)[O-]')
m3 = Chem.MolFromSmiles('CN(Cc1[nH+]ccn1C)C(=O)CCc1ccsc1')
m4 = Chem.MolFromSmiles('COc1ccc([N+](=O)[O-])cc1C(=O)NCCC[NH+]1CCCC1')
m5 = Chem.MolFromSmiles('C[NH+]1CCN(S(=O)(=O)c2ccc(NC(=O)Cc3ccc([N+](=O)[O-])cc3)cc2)CC1')
m6 = Chem.MolFromSmiles('CCc1ccc(S(=O)(=O)Nc2ccc(NC(C)=O)cc2)cc1')
m7 = Chem.MolFromSmiles('O=C(COC(=O)c1ccc(S(=O)(=O)N2CCCCC2)cc1)c1ccc(F)cc1')
m8 = Chem.MolFromSmiles('COC(=O)c1ccc(S(=O)(=O)NCc2csc3ccc(Cl)cc23)n1C')
m9 = Chem.MolFromSmiles('CCC(C)N1C(=O)C(=CNc2ccccc2C(=O)[O-])C(=O)NC1=S')
m10 = Chem.MolFromSmiles('Cn1c(CNC(=O)C(=O)Nc2cccc(Cl)c2Cl)nc2ccccc21')
mols = [m1, m2, m3, m4, m5 ,m6, m7, m8, m9, m10]
# generate fingeprints: Morgan fingerprint with radius 2
fps = [AllChem.GetMorganFingerprintAsBitVect(m, 2) for m in mols]
# convert the RDKit explicit vectors into numpy arrays
np_fps = []
for fp in fps:
arr = np.zeros((1,))
DataStructs.ConvertToNumpyArray(fp, arr)
np_fps.append(arr)
# get a random forest regressor with 100 trees
rndf_rgsr = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1, warm_start=False)
#train the random forest
#ys are the caco-2 coefficients we wish to predict
ys_fit = [379.724, 101.644, 3154.167, 97.437, 21.152, 569.981, 150.55, 690.843, 78.866, 984.371]
rndf_rgsr.fit(np_fps, ys_fit)
#use the random forest to predict a new molecule
m_new = Chem.MolFromSmiles('Cc1n[nH]c(Cc2ccc(-n3cnnc3)cc2)n1') #actual caco2 is 410.037
fp = np.zeros((1,))
DataStructs.ConvertToNumpyArray(AllChem.GetMorganFingerprintAsBitVect(m_new, 2), fp)
print(rndf_rgsr.predict((fp,)))
But when I try to work with a lot of molecules imported from a file, which contains a lot of lines that look like Cc1ccc(NNC(=O)c2ccc(CN3C(=O)CCC3=O)cc2)cc1Cl,379.724
, using the following code:
from rdkit import Chem, DataStructs
from rdkit.Chem import AllChem
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor #our regressors
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
from pandas import DataFrame, read_csv
#import our data from file
df = pd.read_csv('test_db.csv', delimiter=',' ) #a pandas DataFrame
#get the values of variables and targets
X = df["smiles"].values
y = df["Caco2"].values
#split our data set into two parts
x_train, x_eval, y_train, y_eval = train_test_split(X, y, test_size = 0.2, random_state = 42)
#convert our smiles string into actual molecular graphs
mols_ready_train = [Chem.MolFromSmiles(x_train[i]) for i in range(len(x_train))]
mols_ready_eval = [Chem.MolFromSmiles(x_eval[i]) for i in range(len(x_eval))]
# generate fingeprints: Morgan fingerprint with radius 2
fing_prints_train = [AllChem.GetMorganFingerprintAsBitVect(m, 2) for m in mols_ready_train]
fing_prints_eval = [AllChem.GetMorganFingerprintAsBitVect(m, 2) for m in mols_ready_eval]
# convert the RDKit explicit vectors into numpy arrays
np_fps_train = []
for fp in fing_prints_train:
arr = np.zeros((1,))
DataStructs.ConvertToNumpyArray(fp, arr)
np_fps_train.append(arr)
np_fps_eval = []
for fp in fing_prints_eval:
arr = np.zeros((1,))
DataStructs.ConvertToNumpyArray(fp, arr)
np_fps_eval.append(arr)
# get a random forest regressor with 100 trees
rndf_rgsr = RandomForestRegressor(n_estimators=1000, random_state=42, n_jobs=-1, warm_start=False)
#train our random forest regressor
rndf_rgsr.fit(np_fps_train, y_train)
# use the random forest to predict a new molecule
m_new = Chem.MolFromSmiles('Cc1n[nH]c(Cc2ccc(-n3cnnc3)cc2)n1')
fp = numpy.zeros((1,))
DataStructs.ConvertToNumpyArray(AllChem.GetMorganFingerprintAsBitVect(m_new, 2), fp)
print(rndf_rgsr.predict((fp,)))
it crashes with the following error:
File "/home/me/predictor.py", line 55, in rndf_rgsr.fit(np_fps_train, y_train) File "/usr/local/lib/python2.7/dist-packages/sklearn/ensemble/forest.py", line 248, in fit y = check_array(y, accept_sparse='csc', ensure_2d=False, dtype=None) File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 407, in check_array _assert_all_finite(array) File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 58, in _assert_all_finite " or a value too large for %r." % X.dtype) ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
I've checked that no vectors I use contain nan
s or inf
s. The fingerprints used here are 2048 bits long, but I doubt they're the source of the problem.
Something is going wrong with validation, but I can't really see what.
Could you provide any hints?
ETA: test_db.csv
has 50,000 lines. I created a tiny_db.csv
with only 10 lines, and on it my model works great (that is, its predictions are wrong, but it works at all).
It also works with a 100 lines file, but with a 1000 it doesn't, and crashes with the above mentioned error.
Further experiments show that 250 lines work, but 500 don't.
ETA: the first 250 lines work, but the next 250 lines (250 to 500) don't.
With more than a 100 lines read, print(y_train.mean(), y_train.min(), y_train.max())
returns (nan,nan,nan)
.
All in all, I strongly suspect the issue to come from pandas.Dataframe.values
, which upcast my nice coefficients to float64
, which lead to arithmetics errors, which in turn caused the validation procedures to crash.
I think I'd be better off using the python csv
module instead of pandas.read_csv
in conjunction with DataFrame.values
.
来源:https://stackoverflow.com/questions/44864729/valueerror-when-doing-validation-with-random-forests