Activity 1: Working with molecular data

treachery of images

In this activity, we will learn about the different representations of molecules.

Molecules as a graph

Let’s import rdkit and set-up a few things to make structures look nice in notebooks

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Draw
from rdkit.rdBase import BlockLogs
from rdkit.DataStructs.cDataStructs import TanimotoSimilarity 
import numpy as np

# turn off rdkit warnings
block = BlockLogs()

# use SVGs and make molecules fill image rather than all be same scale
Draw.IPythonConsole.ipython_useSVG = True
Draw.IPythonConsole.drawOptions.drawMolsSameScale = False

This block allows us to search for molecules by name and get molecule record from pubchem

import requests

def get_cids(text):
    """
    Search pubchem and return best matching record
    """  
    url = 'https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/{}/cids/TXT'.format(text)
    response = requests.get(url)
    cids = response.text.split()
    if len(cids) == 0:
        return None
    else:
        return cids
def get_record(cid):
    """
    Get pubchem record for a given cid and returns molecule as rdkit
    """
    url = 'https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{}/record/SDF'.format(cid)
    response = requests.get(url)
    mol = Chem.MolFromMolBlock(response.text)
    return mol

def get_molecule(text, n_results=1):
    """
    Search pubchem and return best matching record
    """  
    cids = get_cids(text)
    if cids is None:
        return None
    else:
        if n_results == 1:
            return get_record(cids[0])
        else:
            return [get_record(cid) for cid in cids[:n_results]]
get_molecule('buckyball')
get_molecule('octanol')

The code below shows how we can access atoms and covalent bonds – which will be useful to make graphs

mol = get_molecule('toluene')
mol
for a in mol.GetAtoms():
    print(f'Atom {a.GetIdx()} with atomic number {a.GetAtomicNum()}')
    neighbors = [n.GetIdx() for n in a.GetNeighbors()]
    
    print(f'\tBonded to {" ".join([str(n) for n in neighbors])}')
Atom 0 with atomic number 6
	Bonded to 1 2 3
Atom 1 with atomic number 6
	Bonded to 0 4
Atom 2 with atomic number 6
	Bonded to 0 5
Atom 3 with atomic number 6
	Bonded to 0
Atom 4 with atomic number 6
	Bonded to 1 6
Atom 5 with atomic number 6
	Bonded to 2 6
Atom 6 with atomic number 6
	Bonded to 4 5

Can you revise the code above to build an adjacency matrix and node feature vector?

Molecules as line notation

mol = get_molecule('methanol')
smiles = Chem.MolToSmiles(mol)
print('SMILES:', smiles)
SMILES: CO
import selfies as sf

selfies = sf.encoder(smiles)
print('SELFIES:', selfies)
SELFIES: [C][O]
inchi = Chem.MolToInchi(mol)
print('InChi:', inchi)
InChi: InChI=1S/CH4O/c1-2/h2H,1H3

Molecules as coordinates

mol = get_molecule('aspirin')
mol
Chem.AllChem.EmbedMolecule(mol)
0
c = mol.GetConformer()
c.GetPositions()
array([[ 1.26894284,  0.20805596,  0.74563655],
       [ 0.24971495,  2.28300294,  0.05136469],
       [-1.92553153,  2.15845285, -0.63762849],
       [ 1.92208607, -0.23271318, -1.37290903],
       [ 0.11362331, -0.5577966 ,  0.50569099],
       [-0.98861265,  0.13135426,  0.02525759],
       [ 0.0754972 , -1.90284483,  0.73139822],
       [-2.14498962, -0.5729312 , -0.22619491],
       [-1.06902025, -2.61813247,  0.48577564],
       [-2.15749664, -1.92553194,  0.00915115],
       [-0.90866288,  1.57794548, -0.20318262],
       [ 2.16660558,  0.32993954, -0.28867706],
       [ 3.39784362,  1.12119919, -0.09723405]])

Comparing molecules

The most common way to compare molecules is Morgan Fingerprints — also known as Extended Connectivity FingerPrint (ECFP). These are vectors that indicate presence of specific substructures. Pairwise similarity is then computed with Tanimoto similarity.

m1 = get_molecule('aspirin')
m2 = get_molecule('benzene')
fp1 =  AllChem.GetMorganFingerprint(m1, 2)
fp2 =  AllChem.GetMorganFingerprint(m2, 2)
dist = TanimotoSimilarity(fp1, fp2)
print('similarity', dist)
---------------------------------------------------------------------------
ArgumentError                             Traceback (most recent call last)
Input In [13], in <cell line: 4>()
      2 m2 = get_molecule('benzene')
      3 fp1 =  AllChem.GetMorganFingerprint(m1, 2)
----> 4 fp2 =  AllChem.GetMorganFingerprint(m2, 2)
      5 dist = TanimotoSimilarity(fp1, fp2)
      6 print('similarity', dist)

ArgumentError: Python argument types in
    rdkit.Chem.rdMolDescriptors.GetMorganFingerprint(NoneType, int)
did not match C++ signature:
    GetMorganFingerprint(RDKit::ROMol mol, unsigned int radius, boost::python::api::object invariants=[], boost::python::api::object fromAtoms=[], bool useChirality=False, bool useBondTypes=True, bool useFeatures=False, bool useCounts=True, boost::python::api::object bitInfo=None, bool includeRedundantEnvironments=False)

Finding nearby molecules

One of the common strategies in modern methods is to generate a local neighborhood around a specific structure. We can do this using exmol via the STONED method.

import exmol

smiles = Chem.MolToSmiles(m1)
near_smi, near_tanimoto = exmol.run_stoned(smiles)
# sort them by tanimoto
descend_order = np.argsort(near_tanimoto)[::-1]
near_smi = [near_smi[i] for i in descend_order]

We can now draw a few of them to see what they look like.

near_mol = [Chem.MolFromSmiles(s) for s in near_smi]
Draw.MolsToGridImage(near_mol, molsPerRow=5)