To be able to edit code and run cells, you need to run the notebook yourself. Where would you like to run the notebook?

This notebook takes about 2 minutes to run.

In the cloud (experimental)

Binder is a free, open source service that runs scientific notebooks in the cloud! It will take a while, usually 2-7 minutes to get a session.

On your computer

(Recommended if you want to store your changes.)

  1. Download the notebook:
  2. Run Pluto

    (Also see: How to install Julia and Pluto)

  3. Open the notebook file

    Type the saved filename in the open box.

Frontmatter

If you are publishing this notebook on the web, you can set the parameters below to provide HTML metadata. This is useful for search engines and social media.

Author 1

Preprocessing

MolecularGraph.jl version: 0.17.1

This tutorial includes following preprocessing strategies.

  • Remove hydrogen vertices

  • Extract molecules of interest

  • Standardize charges

  • Dealing with resonance structure

  • Customize property updater

👀 Reading hidden code
45.3 ms

Public databases (e.g. PubChem, ChEMBL) and flat file databases (e.g. SDFile) have different formats and may not always be used for your analysis as is. For example,

  • whether hydrogens are explicitly written or omitted

  • whether salt and water molecules are included in the molecular graph or provided as metadata

  • representation of resonance structure (e.g. diazo group; [C-]-[N+]#N <-> C-[N+]=[N-])

  • charges depend on the condition - powder, dissolved or in physiological condition

👀 Reading hidden code
184 ms
using Graphs, MolecularGraph
👀 Reading hidden code
3.8 s
"_data"
data_dir = let
# Create data directory
data_dir = "_data"
isdir(data_dir) || mkdir(data_dir)
data_dir
end
👀 Reading hidden code
79.9 μs
fetch_mol! (generic function with 1 method)
function fetch_mol!(cid, name, datadir)
url = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/$(cid)/SDF"
dest = joinpath(data_dir, "$(name).mol")
isfile(dest) || download(url, dest);
return dest
end
👀 Reading hidden code
1.2 ms
"_data/Cefditoren Pivoxil.mol"
molfile = fetch_mol!("6437877", "Cefditoren Pivoxil", data_dir)
👀 Reading hidden code
56.2 μs

Remove hydrogen vertices

  • SDFiles downloaded from PubChem have hydrogen nodes. In practice, hydrogens which is not important are removed from molecular graphs for simplicity.

  • remove_hydrogens!(mol) removes hydrogen vertices that are not important (no charge, no unpaired electron, no specific isotope composition and not involved in stereochemistry).

  • remove_all_hydrogens!(mol) removes all hydrogen vertices.

👀 Reading hidden code
116 ms
SSSOOOOOOONNNNNNHHHHHHHHHHHHHHHHHHHHHHHHHHHH
mol = sdftomol(molfile)
👀 Reading hidden code
13.2 s
SSSOOOOOOONNHNNNNH2HH
let
remove_hydrogens!(mol)
mol
end
👀 Reading hidden code
358 ms

Extract molecules of interest

connected_components(mol) returns connected components that are sets of vertices of the individual molecules in the molecular graph object.

👀 Reading hidden code
665 μs
mol2 = smilestomol("CC(=O)OC1=CC=CC=C1C(=O)O.O.CCO"); nothing
👀 Reading hidden code
6.5 s
OOOOHH2OHO 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
html_fixed_size(mol2, 250, 250, atomindex=true)
👀 Reading hidden code
1.2 s
connected_components(mol2)
👀 Reading hidden code
36.9 μs
  • To extract the molecule of interest, you can iterate over the connected components and apply induced_subgraph(mol, vertices) to extract the molecules and filter them one by one.

  • Or simply extract_largest_component!(mol) can be used. This removes vertices not belong to the largest component (connected component which has the largest number of vertices) from the graph.

👀 Reading hidden code
798 μs
OOOOH
let
mol = copy(mol2)
extract_largest_component!(mol)
mol
end
👀 Reading hidden code
1.0 s

Standardize charges

  • protonate_acids!(mol) removes charges on oxo/thio acid anions

  • deprotonate_oniums!(mol) removes charges on ammonium/oxonium cations

👀 Reading hidden code
747 μs
OONH3++H3N
charged = smilestomol("CCCC(=O)[O-].[N+]CCCC[N+]")
👀 Reading hidden code
624 μs
OONH3H3N
let
mol = copy(charged)
protonate_acids!(mol)
deprotonate_oniums!(mol)
mol
end
👀 Reading hidden code
242 ms

Dealing with resonance structure

  • Substructure match methods in this library compares atom symbols and the number of $\pi$ electrons, so in many cases you don't have to care about fluctuations in resonance structure.

👀 Reading hidden code
9.2 ms
N
quinoline1 = smilestomol("N1=CC=CC2=C1C=CC=C2")
👀 Reading hidden code
437 μs
N
quinoline2 = smilestomol("N=1C=CC=C2C1C=CC=C2")
👀 Reading hidden code
643 μs
Loading more cells...