Skip to content

Conversation

@Boranadas
Copy link
Owner

@Boranadas Boranadas commented Aug 8, 2018

GSoC 2018 - New fingerprinting API

This PR is the outcome of the Google Summer of Code project "Create a generalized fingerprinting API"

Summary

The RDKit already supports a broad variety of fingerprinting functionality but is limited by the fact that the current fingerprint implementations are all independent and have slightly different APIs. This makes things unnecessarily complicated for users and requires people who are interested in exploring new types of fingerprints to write far more code than it seems like they should have to. In this project we have created a generic decoupled fingerprinting module with a consistent and flexible API.
The changes contained here introduces this new fingerprinting API that makes it possible to use different types of molecular fingerprinting methods with a similar interface. It allows the creation of fingerprint generator objects with the same structure hence enabling users to easily substitute generator objects with each other. It also allows implementation of new fingerprinting methods by providing an abstract structure to work on and offering common fingerprint functionality such as size-limiting, count simulation, etc. out of the box.
The new API also offers more flexibility in the fingerprinting process: different fingerprint methods/algorithms can now be combined like using the Morgan atom invariants in the AtomPair fingerprint.

Outcomes

  • Abstract classes for implementing new fingerprint methods for the new API
  • Implementation of AtomPairs, Morgan, Topological Torsion and the RDKit fingerprint using the new API
  • Convenience functions to easily generate common fingerprints with default arguments and to generate a vector of fingerprints for a list of molecules in one call
  • Python wrappers for the new API
  • Unit tests and documentation for all new components

Tutorial

A Jupyter notebook is added to the repository exemplifying the usage of the new API and showing the combination possibilities using the new API. The tutorial can be found in: rdkit/docs/Notebooks/FingerprintGenerators.ipynb

Tests

Unit tests for the existing fingerprinting functions are duplicated with necessary adjustments for the new API. Some additional tests have been added for newly introduced components and the test cases for the functionality that does not currently exist in the new API is skipped.

Todo

  • Java wrappers and PostgreSQL Cartridge implementations are currently missing and needs to be implemented in future.
  • Make additional output and InfoString available in Python

Future work

Apart from adding the mentioned missing parts there are also possible improvements that can be built on the new API.

  • Methods for enabling generating fingerprints in bulk in parallel
  • Adding additional fingerprinting methods like the Layered/Pattern fingerprint

Copy link

@NadineSchneider NadineSchneider left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of it is only minor changes or are suggestions for changing the order of arguments
I'll continue with the "python part" later

class ROMol;

struct AdditionalOutput {
// will review this structure once more fignerprint types are implemented

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor, but this comment could be deleted

@@ -0,0 +1,272 @@
#ifndef RD_FINGERPRINTGEN_H_2018_05

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The copyright text should be added here and to all other files, see for example; rdkit@cb81660

public:
FingerprintArguments(const bool countSimulation,
const std::vector<std::uint32_t> countBounds,
const std::uint32_t foldedSize);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should avoid the word 'folded' here to and just take 'size'?

*bitInfoMap;
// morgan fp
// maps bitId -> vector of (atomId, radius)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should have better names here for the bitInfoMap and bitInfo? Maybe just bitInfoMorgan and bitInfoRDKitFP?

const std::vector<std::uint32_t> *atomInvariants = nullptr,
const std::vector<std::uint32_t> *bondInvariants = nullptr,
const bool hashResults = false) const = 0;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we should change the order of the arguments here to: mol, arguments, fromAtoms, addtionalOutput, atomInvariant, bondInvariant, ignoreAtoms, confId, hasResults
I would expect this rather reflects the "importance" of the arguments

const std::uint32_t foldedSize = 2048,
const std::vector<std::uint32_t> countBounds = {1, 2, 4, 8},
const bool ownsAtomInvGen = false, const bool ownsBondInvGen = false);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I would suggest to change the order of the arguments to: radius, foldedSize , includeChirality , useBondTypes, atomInvariantsGenerator, bondInvariantsGenerator, ownsAtomInvGen, ownsBondInvGen, countSimulation, onlyNonzeroInvariants, countBounds,

template <typename OutputType>
std::vector<AtomEnvironment<OutputType> *>
MorganEnvGenerator<OutputType>::getEnvironments(
const ROMol &mol, FingerprintArguments<OutputType> *arguments,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could part of this code also be moved to the FingerprintUtils file since most of the code is the same as for the 'old' version?

std::vector<AtomEnvironment<OutputType> *> getEnvironments(
const ROMol &mol, FingerprintArguments<OutputType> *arguments,
const std::vector<std::uint32_t> *fromAtoms,
const std::vector<std::uint32_t> *ignoreAtoms, const int confId,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest moving ignoreAtoms and confId at the end of the arguments

*/
template <typename OutputType>
FingerprintGenerator<OutputType> *getRDKitFPGenerator(
const unsigned int minPath = 1, const unsigned int maxPath = 7,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I also would move 'foldedSize' to position 3

AtomInvariantsGenerator *atomInvariantsGenerator = nullptr,
const bool countSimulation = true,
const std::vector<std::uint32_t> countBounds = {1, 2, 4, 8},
const std::uint32_t foldedSize = 2048, const bool ownsAtomInvGen = false);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here also, I would suggest moving foldedSize to position 3 and ownsAtomInvGen after the atomInvariantGenerator

@NadineSchneider
Copy link

Thanks for implementing the bulkFP generator. Looks good to me!

fp = g.GetSparseCountFingerprint(m)
nz = fp.GetNonzeroElements()
self.assertEqual(len(nz), 1)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the additional output available in Python? If yes, could we add a test for that too. If not, could we add it to Python?

In general we should try to have a few more tests in Python, like the mixing of different FPs.

}

template <typename OutputType>
ExplicitBitVect *getFingerprint(const FingerprintGenerator<OutputType> *fpGen,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the additional output available in Python? It seems like it is missing, it would be important to add this to the Python interface.

}
}

void testAtomPairFoldedBitvector() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should change the name of this test.

}

// Old version test name testHashedAtomPairs
void testFoldedAtomPairs() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here also, 'folded' should be deleted or replaced.

BOOST_LOG(rdErrorLog) << " done" << std::endl;
}

void testChiralPairs() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to change the name of this test to something else like testAtomPairsIncludingChirality

delete m1;
delete generator;
}
// additional outputs are not fully functional yet

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment still true? If yes, we should try to have it functional.

"source": [
"# Fingerprint Generators"
]
},

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit introduction to this notebook would be great. Something mentioning that this is a new implementation of four different FPs and that it will allow more flexibility.

@Boranadas Boranadas force-pushed the dev/GSOC2018_Generalized_Fingerprints_Final branch from 2969716 to f8cd911 Compare August 14, 2018 14:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants