NMR (Nuclear Magnetic Resonance) spectroscopy is one of the most popular tools to elucidate chemical structures of an unknown molecule in solution. In practical application, NMR can also be used to validate new structures because it provides information about scalar coupling, which is an indirect interaction of the nuclei of atoms in a magnetic field.
These scalar coupling, or J coupling constants obtained from an NMR spectrum contain information about relative bond distances and angles in a molecule. This is useful for determining the connectivity of atoms in a molecule.
For the seasoned chemist, confirming a basic chemical structure from the peaks on an NMR spectra alone is definitely possible, such as the 1H NMR spectrum for 1,1,2-trichloroethane below. The peaks below correspond to the hydrogens in the molecule and can easily be assigned. In this case, J coupling constants weren’t necessary to validate the structure.
But what happens when you are dealing with more complex molecules or are working to synthesize new ones where you can’t easily determine what splitting patterns mean? How can you validate that what you’ve isolated is what you think it is?
This is where J coupling constants come in.
If you know what J coupling constant you’re looking for prior to getting the NMR spectrum, you can use that number as a confirmation that you’ve isolated your target molecule, or at the very least, confirm that you’re on the right track. Incorrect structures can be reported if there is no way to validate the coupling constants with the results found.
Therefore, software tools that can predict these constants accurately will be useful in validation of structures in practice.
In a collaboration with Chance Dare, this project aims to predict scalar coupling constants using machine learning models given known properties of molecules so that they can be used in the application of research in chemistry.
We used this dataset that is part of the CHAMPS (Chemistry and Mathematics in Phase Space) Kaggle competition. The train dataset contained 4,658,147 scalar coupling observations of 85,003 unique molecules, and the test dataset contained 2,505,542 scalar coupling observations of 45,772 unique molecules. These molecules contained only the atoms: carbon (C), hydrogen (H), nitrogen (N), fluorine (F), and oxygen (O). There were 8 different types of scalar coupling: 1JHC, 1JHN, 2JHH, 2JHC, 2JHN, 3JHH, 3JHC, 3JHH. Fluorine coupling was not represented in this dataset.
Looking at the data, we can see that the train and test sets had relatively even distributions of scalar coupling type and of the number of atoms present in each dataset. This tells us that the train data is a good enough representation of the test data in order to create a model that predicts the scalar coupling constants.
What are the different types of scalar coupling constants?
J coupling is an indirect interaction between the nuclear spins of 2 atoms in a magnetic field. The number that comes before the J in the J coupling types (1J, 2J, 3J) denotes the number of bonds between the atoms that are coupling. So 1J, 2J, 3J coupling will have 1, 2, and 3 bonds between the atoms, respectively.
If we look at the distribution of the distance between atoms in the different types of coupling below, we can see that 1J has the lowest distance between atoms, and 3J has the highest distance between atoms, with 2J somewhere in the middle. The increase in bonds between atoms is observed as an increase in the distance between them. The distance feature contains information about the arrangement of atoms in space which could help a model predict the J coupling constant more accurately.
The distribution of the scalar coupling constant values isolated by type also reveals that there are clear differences in the ranges that these values appear in. This gives us the insight that different molecular properties affect each type of J coupling differently and unique models should be used for all 8 coupling types found in the dataset.
What factors affect the scalar coupling constant?
Understanding what properties of molecules affect the scalar coupling constant is the key to training a model that can accurately predict these values for future experiments.
Some properties that affect the scalar coupling constant are:
- dihedral angles — the angle between two intersecting planes
- substituent electronegativities — the tendency of an atom to attract a shared pair of electrons)
- hybridization of atoms — contains information about the number of atoms bonded and the coordination of the molecule
- ring strain — instability in a molecule due to abnormal bonding angles found in a ring
Since our dataset didn’t explicitly include this information, we needed to engineer features that would bring information about the factors impacting coupling values into the data set.
Our dataset was extremely limited in features, so our model was going to rely heavily on engineering a lot of new features. The important consideration here was understanding what could potentially be useful and working from principles. For instance, knowing that hybridization had an impact on the J coupling constant, we understood that the number of bonds on each atom would be an important feature to engineer for our model.
We engineered ~60 features — some based off of basic math and some with pretty dense calculations. A few of these features included:
- distance — the distance between the given cartesian points of each atom
- n_bonds — the number of bonds on a specific atom
- mu — the square root of the sum of the squared Cartesian values
- delta_en — the difference between the electronegativites of two atoms
J type subsetting + Train/Validation splitting
In order to build 8 models, first we created subsets of the data containing each J type: 1JHC, 1JHN, 2JHH, 2JHC, 2JHN, 3JHH, 3JHC, 3JHH.
Then, we further split each subset into a train/val set. Since our training dataset was so big (4M+ observations), we opted for a 75/25 train/val split instead of doing a cross validation to save time in the initial phases of building a model, even though a cross validation would likely produce better results (Note: This is an ongoing project and cross validation will be used to validate the model when it is closer to final).
The training data includes more than 1 observation under each
molecule_name. Because of this nature, we had to be considerate to not leak data from train molecules into the validation set. We did a
train_test_split() on the
molecule_name instead, and created train/val subsets with the data from each J type — now 16 subsets total!
Our hypothesis was that the coupling constant from each J type would be impacted differently by each feature, so we created 8 different models to get the most accurate predictions possible.
We first started with a simple Linear Regression for our baseline model, but needed something with decision trees for better accuracy. We also confirmed that using 8 models was significantly better than 1 by running a test model with all of the J types in the data for comparison.
Ultimately, we used a LightGBM Regressor model for all 8 J types. We then tuned the hyper parameters, by running the LGBM Regressor with
RandomizedSearchCV to give us the best score we could get.
Below you can find the summary of the modeling, which includes each model broken down by its respective validation score and permutation importances:
Validation + Score
Validation error was calculated based on the log of the mean absolute error (MAE). The function below takes all models into account and returns the average score:
groups = val['type'] def group_lmae(y_true, y_pred, groups, floor=1e-9): maes = (y_true - y_pred).abs().groupby(groups).mean() return np.log(maes.map(lambda x: max(x, floor))).mean()
We also calculated separate validation scores for each model to better understand how each J type model scored. If one model was scored significantly lower than the other, this would give us the insight to potentially explore other modeling options for that coupling type. To calculate the validation score for separate J types, we used:
np.log((val_pred — y_val).abs().mean())
For this metric, the MAE for any group has a floor of 1e-9, which means the best possible score is -20.732. The best publicly available score on the Kaggle competition leaderboard is currently -3.069.
Results + Discussion
Using data that included known molecular properties of molecules, we created machine learning models to predict the scalar coupling constant of a pair of atoms. 8 different LightGBM Regressor models were used to predict the target for each type of coupling: 1JHC, 1JHN, 2JHH, 2JHC, 2JHN, 3JHH, 3JHC, 3JHH, resulting in a final score of -0.64.
We engineered ~60 features to help train our models accurately and improve the validation scores. Some features were very helpful for our models, some not so much, and the ones that did not impact the score at all were discarded. We also found that over cluttering the data with bunch of features could also hurt the score, so a few of those features were removed as well.
The validation scores for each J coupling type were:
- 1JHC : 0.258
- 1JHN : -0.388
- 2JHH : -1.179
- 2JHC : -0.352
- 2JHN : -0.973
- 3JHH : -0.991
- 3JHC : -0.259
- 3JHN : -1.242
The model predicting the 3JHN coupling constants performed the best with a validation score of -1.242, and the model predicting the 1JHC coupling constants performed the worst with a validation score of 0.25.
The significant differences in validation scores between models is due to features that are unaccounted for in our dataset that likely have a large effect on that specific type of coupling. To improve those scores further, some features we can try engineering are ones that take dipole interactions, magnetic shielding, and potential energy into account.
All of the models had different permutation importances depending on the type of coupling. Some of the most common features with high importance for each model were the distance, mulliken_charges, and both of the bond_lengths (x and y).
Below are two shapely plots showing two different predictions. The one on top is a prediction for molecule
dsgdb9nsd_133884 with 1JHC, the worst model, and the one on the bottom is for molecule
dsgdb9nsd_105770 with 3JHN, the best model. The 1JHC model predicted 90.14 when the actual value was 99.69, resulting in an error of 9.55. The 3JHN model predicted 0.77, which was way closer to the actual value of 0.06, and only had an error of 0.71.
Data + Code
The data for this project was provided by CHAMPS (Chemistry and Mathematics in Phase Space) for a Kaggle competition. All code was written in python and visualizions were created using Matplotlib, ASE, Eli5, and Seaborn libraries. You find the notebooks to engineer features, create models, and visualizations in this github repo.
Since this project is part of an ongoing Kaggle competition we will continue to work toward improving the models until the competition deadline of August 28, 2019. Our team is currently in the top 50% on the public leaderboard.