Activity 1: Working with molecular data
In this activity, we will learn about the different representations of molecules.
Molecules as a graph
Let’s import rdkit
and set-up a few things to make structures look nice in notebooks
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Draw
from rdkit.rdBase import BlockLogs
from rdkit.DataStructs.cDataStructs import TanimotoSimilarity
import numpy as np
# turn off rdkit warnings
block = BlockLogs()
# use SVGs and make molecules fill image rather than all be same scale
Draw.IPythonConsole.ipython_useSVG = True
Draw.IPythonConsole.drawOptions.drawMolsSameScale = False
This block allows us to search for molecules by name and get molecule record from pubchem
import requests
def get_cids(text):
"""
Search pubchem and return best matching record
"""
url = 'https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/{}/cids/TXT'.format(text)
response = requests.get(url)
cids = response.text.split()
if len(cids) == 0:
return None
else:
return cids
def get_record(cid):
"""
Get pubchem record for a given cid and returns molecule as rdkit
"""
url = 'https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{}/record/SDF'.format(cid)
response = requests.get(url)
mol = Chem.MolFromMolBlock(response.text)
return mol
def get_molecule(text, n_results=1):
"""
Search pubchem and return best matching record
"""
cids = get_cids(text)
if cids is None:
return None
else:
if n_results == 1:
return get_record(cids[0])
else:
return [get_record(cid) for cid in cids[:n_results]]
get_molecule('buckyball')
get_molecule('octanol')
The code below shows how we can access atoms and covalent bonds – which will be useful to make graphs
mol = get_molecule('toluene')
mol
for a in mol.GetAtoms():
print(f'Atom {a.GetIdx()} with atomic number {a.GetAtomicNum()}')
neighbors = [n.GetIdx() for n in a.GetNeighbors()]
print(f'\tBonded to {" ".join([str(n) for n in neighbors])}')
Atom 0 with atomic number 6
Bonded to 1 2 3
Atom 1 with atomic number 6
Bonded to 0 4
Atom 2 with atomic number 6
Bonded to 0 5
Atom 3 with atomic number 6
Bonded to 0
Atom 4 with atomic number 6
Bonded to 1 6
Atom 5 with atomic number 6
Bonded to 2 6
Atom 6 with atomic number 6
Bonded to 4 5
Can you revise the code above to build an adjacency matrix and node feature vector?
Molecules as line notation
mol = get_molecule('methanol')
smiles = Chem.MolToSmiles(mol)
print('SMILES:', smiles)
SMILES: CO
import selfies as sf
selfies = sf.encoder(smiles)
print('SELFIES:', selfies)
SELFIES: [C][O]
inchi = Chem.MolToInchi(mol)
print('InChi:', inchi)
InChi: InChI=1S/CH4O/c1-2/h2H,1H3
Molecules as coordinates
mol = get_molecule('aspirin')
mol
Chem.AllChem.EmbedMolecule(mol)
0
c = mol.GetConformer()
c.GetPositions()
array([[ 1.26894284, 0.20805596, 0.74563655],
[ 0.24971495, 2.28300294, 0.05136469],
[-1.92553153, 2.15845285, -0.63762849],
[ 1.92208607, -0.23271318, -1.37290903],
[ 0.11362331, -0.5577966 , 0.50569099],
[-0.98861265, 0.13135426, 0.02525759],
[ 0.0754972 , -1.90284483, 0.73139822],
[-2.14498962, -0.5729312 , -0.22619491],
[-1.06902025, -2.61813247, 0.48577564],
[-2.15749664, -1.92553194, 0.00915115],
[-0.90866288, 1.57794548, -0.20318262],
[ 2.16660558, 0.32993954, -0.28867706],
[ 3.39784362, 1.12119919, -0.09723405]])
Comparing molecules
The most common way to compare molecules is Morgan Fingerprints — also known as Extended Connectivity FingerPrint (ECFP). These are vectors that indicate presence of specific substructures. Pairwise similarity is then computed with Tanimoto similarity.
m1 = get_molecule('aspirin')
m2 = get_molecule('benzene')
fp1 = AllChem.GetMorganFingerprint(m1, 2)
fp2 = AllChem.GetMorganFingerprint(m2, 2)
dist = TanimotoSimilarity(fp1, fp2)
print('similarity', dist)
---------------------------------------------------------------------------
ArgumentError Traceback (most recent call last)
Input In [13], in <cell line: 4>()
2 m2 = get_molecule('benzene')
3 fp1 = AllChem.GetMorganFingerprint(m1, 2)
----> 4 fp2 = AllChem.GetMorganFingerprint(m2, 2)
5 dist = TanimotoSimilarity(fp1, fp2)
6 print('similarity', dist)
ArgumentError: Python argument types in
rdkit.Chem.rdMolDescriptors.GetMorganFingerprint(NoneType, int)
did not match C++ signature:
GetMorganFingerprint(RDKit::ROMol mol, unsigned int radius, boost::python::api::object invariants=[], boost::python::api::object fromAtoms=[], bool useChirality=False, bool useBondTypes=True, bool useFeatures=False, bool useCounts=True, boost::python::api::object bitInfo=None, bool includeRedundantEnvironments=False)
Finding nearby molecules
One of the common strategies in modern methods is to generate a local neighborhood around a specific structure. We can do this using exmol
via the STONED method.
import exmol
smiles = Chem.MolToSmiles(m1)
near_smi, near_tanimoto = exmol.run_stoned(smiles)
# sort them by tanimoto
descend_order = np.argsort(near_tanimoto)[::-1]
near_smi = [near_smi[i] for i in descend_order]
We can now draw a few of them to see what they look like.
near_mol = [Chem.MolFromSmiles(s) for s in near_smi]
Draw.MolsToGridImage(near_mol, molsPerRow=5)