ChemArchitect

A program for machine learning data generation of chemical systems.

View project on GitHub

Tutorial

Introduction

ChemArchItect is a software package designed to make the dataset generation of computational chemical data easier for those wishing to utilize machine learning for chemistry. Included therefore, are programs to generate initial structures, create input files for ab initio software, sub-divide your dataset, prepare files for submission and resubmission, and extract features and outputs from input and output files. All of these files will be outlined in this tutorial, with detailing of dependences and potential improvements/limitations of the code. As with any codebase, development and collaboration is encouraged. If anyone wishes to add to the ChemArchItect package, please utilize the standard Github features do so.

Making Many Gaussian Input Files From a List of SMILES.

In this section it is detailed how to take a list of SMILES molecules and make Gaussian input files that are ready to be submitted. First and foremost, a SMILES molecule is a way to describe a molecule in a one-dimensional string. While the three dimensional representations generated from SMILES structures may be slightly different depending on the table of bond lengths the 1D-3D generator is using, the SMILES format remains a compact and reliable way to communicate and generate molecular structure. An example of SMILES being converted to 3D is shown in Figure 1.

Figure 1. Depiction of a 1D SMILES string and its 3D counterpart. Figure 1. Depiction of a 1D SMILES string and its 3D counterpart.

SmileToMol.py Required Input:
<SMILES String><space><Molecular ID Integer>
<SMILES String><space><Molecular ID Integer>
                    .
                    .
                    .
<SMILES String><space><Molecular ID Integer>

If you do not yet have a list of SMILES formatted molecules, you can obtain them from various repositories and papers. For this tutorial we are taking a dataset of 14,641 quaternary carbon type molecules that have been substituted with the ligands [H,CH3,OCH3,SCH3,I,Br,Cl,F,C≡N,Si3,BO2]. These strings were generated by permuting all possible ligand substitutions on a central quaternary carbon. These molecules are all organic, but due to the locations of some of the halides on the periodic table, the choice of basis set should be considered carefully to ensure the treatment of electrons is linear as the element size increases. This can be a problem for Popel basis sets which around Br stop treating electrons (core and valence) equally to save on computational cost [cite](). This matters for machine learning applications because while the theory of machine learning states that neural networks are universal function machines that can in an infinite limit of model size find any function, computational tractability on the model and dataset acquisition sides remain key blockades that should be avoided if possible. Additionally, having simple models, especially linear models, lend towards easier determination of new chemical understanding due to the simplicity of analyzing the resulting coefficients trained by the model. Therefore, in all aspects of dataset generation and representaiton, the complexity that is inherently encoded into the data much be considered.

Make sure the dataset file smiles_gen_full_quaternaryCarbon.txt is in the same directory as SmileToMol.py, or that you know the path to access it from. A smaller test file is also available under the name SmileToMol_tester.txt. Open a terminal and run:
python SmileToMol.py

You may need to give it permissions first by running the command:
chmod +x SmileToMol.py

Follow the prompts to enter the number of files you would like to convert, and the name of the file with extension.

Be aware that the number of files selects files in the order they are in the file so if you desire a random selection of files from your full list then you will need to use RandInputSelect.py, which will be explained shortly, first. The number of files is also important for alerting you from the beginning how much space you are about to take up on your computer, which is an important warning to prevent computer bricking (ie filling up your hard disk so that it cannot operate or boot). There are a multitude of unit conversion calculators online if you are unfamiliar with the byte unit. If the size the program mentions during its pause is too big, then you can stop the program with ctrl+c.

Upon hitting enter, the program will convert all of your SMILES strings into .mol formatted files. If you are aware of the fact that OpenBabel also has this capability and wondering why we do not use that utility for this step, please refer to the [justifications](https://jfreeze95.github.io/ChemArchitect/Justification) section. Note that if you run the program again, the files will be overwritten if they have the same name.

Congratulations! You now have files in the mol specification!

Converting .mol to .gjf or .com with OBabel


The next step is to convert your .mol files into .gjf or .com files so they may be read by Gaussian. This can be done very simply by executing the following command in a terminal that is located in the same directory as your mol files:
  obabel -m -imol *.mol -ogjf *.gjf

This command:
1. Calls the obabel program
2. Requests the operation be performed for multiple files with the -m tag
3. Says the input format with be mol. (Input formats follow the syntax i)
4. Says to do this to all files in the directory with a .mol extension (* is a wildcard that means I don't care what comes before)
5. Says the output format with be gjf. (for com use -ocom)
4. Says all files being converted should take whatever name came before the old extension and just tak on the new extension.

This may take a bit of time to execute the command on all files. Be aware of any warnings as they may require troubleshooting. Most often the troubleshooting is generally because you had a malformed file from some bad or poorly interpreted SMILES string. Standard output looks something linke this:

Figure 2. Example output from running the Obabel command. Figure 2. Example output from running the Obabel command.
Now you have successfully made files structured for Gaussian input.

Adding Route Card Specifications Programatically

The next step is to add calculation specifications to the route card and end of the input file. This part currently requires a bit of hardcoding in FileCreator.py as it was decided this would be the easiest for users. So the first step is to make a copy of FileCreator.py with a slightly different name. In the new file, find the lines that set data[0] through data[4]. These correspond to the route card, file descriptor line, and charge and multiplicity line. Adjust the values in these lines to your desired specifications. Note that \n is the python newline specifier and %% will write a single percent sign.
data[0] contains processor and memory specs, checkpoint file name, and the route card.
data[1] is a necessary blank line
data[2] is the file descriptor which does not impact the running of the calculation
data[3] is a necessary blank line
data[4] contains the charge and multiplicity which are specified when the program is run through user input

If you would like to simplify your user input you can toggle the commenting of the user input and hard code option lines for charge and multiplicity. Your user input will include additional file descriptors you would like to add to the filename, whether your file extension is .gjf or .com, and the index in your filename that contains the file ID. As an example to explain what is name by filename index, a filename such as chg_pos_1_mult_2_QuartDgdzvp_0012.gjf would have it's identifier (0012) at index 6 because the program splits on underscores. The program will remove the extension if it is connected to the ID. Pseudopotentials can also be added to the file through uncommenting the pseudopotenial section or following to format of that section to include your desired after atomic coordinate lines.

Be aware that this operation overwrites the previous .gjf or .com files and mistakes may require regeneration of the base files. However, regardless of the number of reruns, the file will overwrite on running so as long as the only mistakes were in the first five lines and not in the atomic coordinates, then overwriting by rerunning the program with fixed specifications will acceptably fix the mistake. Also, since the name of the file is being changed, the filename index will change if you rerun the program on the same files. Each filename will be printed out with a warning so that if the file is too big to be held in memory you know its identity and can either adjust the file or request a new way to handle the files from the ChemArchItect developers.

Congratulations! You now have files ready to submit to Gaussian!

Related Utilities

The next section details additional utilities that will assist in tasks related to preparing Gaussian input files from SMILES.

Select a random subset of SMILES from the list you have

The RandInputSelect.py program will select a random subset of SMILES from a list of SMILES based a ratio given by the user. As an example, if you would like 20% of the data in your new file then you would input 5 as in 1 in 5. If you submit a float or double then it will be truncated to an interger following python integer casting rules. This program will make a new file unless the same name is given as the old file.

Make train and test set files from a single dataset

The Separate_Train_Test.py program will split one file into two new files based on the percentage of the data you would like to set aside for training. This file is useful because it makes a permanent split of training and testing data unlike the on-the-fly utilities in sklearn which impermanently split a dataset and require additional code to save the split. This program therefore allows you to analyze how the data split and determine if there are any biases in the split. The program will request the filename of the file to be split and the percentage you would like to use for training. It will output two files with the _train and _test suffixes added to the names. This file works wit any stage of file so long as the entire datapoint is specified on a single line. As an example, lists of smiles or lists of featurized datapoints will work, while Gaussian input files will not.

Change the Route Card details for already made Gaussian input files

The BasisChanger_FileCreator.py program is very similar to FileCreator.py but simpler in it's design as it only seeks to change the first five lines of the input file and add any specifiers to the end of the file. All route card details besides the extension are specified through hardcoding.

Create files with shifted atomic cartesian coordinates from existing atomic coordinates

The ShiftedInputCreator.py program is used for the generation of input files with atom positions that are slightly and randomly shifted from the original input file. This is useful for creating non-optimized molecules. Be aware that there are no features in the program to prevent atom overlapping, and that the main avoidance of this error is implicit through the size of the shift the user requests. This however does mean that "atom to close" errors may arise when the new files are run by Gaussian. Just like the previous "Creator" files, the route card is specified via hardcoding, so making a copy of the file is recommended. After asking the user for the usual "Creator" questions, the user if prompted for the percentage of atoms in a molecule they would like to see shifted and what the maximum shift in angstroms they would like tho occur. Files are not overwritten unless no additional identifier for the filename is given.

Utilities to Speed Up Gaussian Job Submission and Fixing

The next batch of programs are designed to make restarting failed Gaussian calculations faster.

Make new input files for files that underwent a TIMEOUT failure.

The TimeOutRestarter.py program is designed to make new input files from optimization log files that ran out of cluster time using the restart option in Gaussian. To get the list of timed out files, follow the instructions on [this link](https://www.makeuseof.com/tag/use-downloaded-gmail-data/) after making a filter in your gmail to send all jobs with TIMEOUT to a single mail folder. Name your .mbox folder TIMEOUT.mbox or else change the name in the TimeOutRestarter.py program.

The input and output file need to be in the same folder for this program to work. Alternatively, and more reasonably, Obabel can be used to make a new input file with the updated coordinates from your optimization using the following command.
  obabel -m -ilog *.log -ogjf *.gjf

Use this command carefully so that you don't overwrite your original input file. If you don't want to use the restart command in Gaussian then you can run the above OBabel command with "r.gjf" instead of ".gjf" and then use BasisChanger_FileCreator.py to update the route card.

When the program is run, the user is asked for a string that exists in all of the files to assist in finding the line of the mbox file with the titles. Using the additional filename identifier that you input for "Creator" files is an easy choice. The file then asks how many characters come before the identifier in the title. If your identifier is at the beginning then the answer will be 6 because every filename in the mbox has " Name=" in front. For example: " Name=[Ru]" has six for ' Name=' with an identifier of [Ru].

The program will then search for the files with those names. It will put the names of all files it finds in your current directory in one file, and files it can't find in another file. You can then run this program in other folders with the shortened list of yet unhandled files to find the rest. New files will be generated with an "r" added to the end of the filename. Update the route card specification in the program to your specifications, importantly leaving the "restart" keyword in the opt specifications.

TIMEOUT.mbox is provided for testing.

Make new input files for files that failed due to a bad angle.

The AngleRestarter.py program is very similar in its requirements as TimeOutRestarter.py. Write the mailbox file to FAILED.mbox. This program will handle angle and dihedral errors, as well as "atoms too close" errors and additional miscellaneous errors by separating the lists of files into different files for each type of error.

Making Input for Machine Learning

Making Encoding Dictionaries For Machine Learning Input

Dictionaries detail the what each feature column of the encoding means.

Spherical Radii Dictionary The DictionaryMaker_SR.py program makes the dictionary for Spherical Radii at a given granularity and for a set of elements. This encoding describes pairwise interactions between any two atoms of the given elements at distance ranges specified by the granularity and range. The file has three specifiers, two of which, the elements and granularity, are easier to adjust in the hardcoding, and the name of the dictionary file which is requested from the user upon running. The elements are specified in an array that can be updates. The granularity, in Angstroms, is specified in a range function command, where the start, end, and step size between those two points are specified. The dictionary is formed such that no reversibly identical permutations are included in the list, ie H-C-1.8 and C-H-1.8 are the same so only the entry with whichever atom that comes first in the element array first is in the dictionary.



Angular Arc Dictionary The DictionaryMaker_AA.py program is just like the dictionary for Spherical Radii, but for angles formed between three atoms. Angles are specified by degrees and kept between 0 and 180 as above 180 the measurement becomes one of convention and dihedrals handle chirality. Reversibly identical permutations are again limited to 1 case per dictionary. An example would be C-H-B-20 and B-H-C-20 are the same but C-B-H-20 is not.



Dihedral Arc Dictionary The DictionaryMaker_DA.py program is just like the dictionary for Spherical Radii, but for dihedrals formed between four atoms. Dihedrals are specified by degrees and kept between 0 and 180 as above 180 the measurement becomes one of convention. Reversibly identical permutations are again limited to 1 case per dictionary. An example would be C-H-F-B-20 and B-F-H-C-20 are the same but C-F-H-B-20 is not.

Extracting Encoded Interactions for Machine Learning Input

The Extract Interactions programs are designed to extract a wide array of interactions between atoms from coordinates input files. Of course, because we often want optimized coordinates, these input files are best obtained by performing the OBabel commands to convert log files to gjf or com files are previously specified above in the Timeout Restarter section. Much of these files have hard coded specifications that are clearly outline in comments in the ExtractAnglesAndInts_SepPt1_Portion.py file. All files extract the element occurrences which are simply how many times an element appears in the molecule, in addition to other interactions as outlined by the dictionaries. The dictionaries can be updated from their hard coded forms to whatever new dictionary you may make by simply putting the copied csv string in between quotes in the appropriate line. The programs are designed to handle the fact that the dictionary maker programs only have unique permutations. In addition to extracting the different interactions at a specified granularity, these files also offer the option to have proportional or discrete occurrence encodings for each instance of an interaction. For clarity, occurrence enodings are different from one-hot encodings because they can contain numbers other than 1 and are based off of the number of occurrences of each type of interaction. More about the encodings can be seem in the paper: GiFE: A Molecular-Size Agnostic and Understandable Gibbs Free Energy Function.

Extraction of Energies and Computed Properties

The ExtractProperties.py program takes in Gaussian log files and outputs files with extracted computed properties. This output file is useful as labels for training data. Currently the list contains 37 different properties, with the full list outlined in the file. This program also checks for if all forces have converged or not since files may reach normal termination without this condition, despite its importance to finding fully optimized structures. Currently, three hard coded numbers exist in the file to select the atom numbers that will have their Mulliken charges extracted, and that will have bond distances found between them.

Analysis Tools for Already Trained Models

Predict the energies for a new dataset on a trained model

The Predictor.py program is designed to take linear coeficients saved from an Sklearn model, with the no intercept option and multiply them with new datapoints to generate output from a trained model. The program then also takes in the ground truth values for these datapoints and appends them to the predictions for easy comparison. This works even for multiple different output values, with 37 currently being a hard coded number to correspond with selecting the 37th feature column out of an output file generated by ExtractProperties.py.

With that, you are ready to use all current features in the ChemArchItect Program set. Good luck and please let the developers know of any improvements you wish to contribute.