rebuttal.tex

\documentclass[11pt,a4paper]{article}
\usepackage{times}
\usepackage{fullpage}
\setlength{\parindent}{0cm}


\begin{document}
\newcommand{\comment}[1]{\textit{``\ldots #1''}\par\vspace{0.5em}}
\newcommand{\response}[1]{#1\vspace{1em}}

We would like to thank the anonymous reviewers for their constructive
comments which have served to significantly improve our manuscript. 
We have updated the main text in several places and have
also included a supplemental document that provides a brief
description of terminology and a list of some historic milestones in
the field. The following sections provide a detailed response to
individual reviewer comments.
\vspace{2em}

\fbox{\textbf{Reviewer 1}}

\comment{when the authors state 'together with an even larger number
  of annotations', they shold explain what are annotations and how
  much is the large number;}
\response{We have updated the text on page 1 to be more explicit in
  terms of what annotations are and provide a concrete example of the
  sizes involved.}

\comment{For an overview I would expect a more comprehensive set of
  references.} 
 \response{We agree with the reviewer that a more
  comprehensive list of references would be appropriate and
  useful. However, due to format limitations of the article, we have
  been restricted to the use of 40 references. To alleviate this
  problem, we have included a supplement that provides a more
  extensive list of references, specifically including a number of
  texts covering cheminformatics broadly.}

\comment{1) The connection between sections could be improved,
  i.e. the authors should try to enhance the correlation that exists
  between the topics described in each section and map them to a
  global flow of cheminfomatics changes. 2) The authors state in the
  beginning that the paper will be around the concept of risk
  minimization in drug discovery, but I found that this topic is not
  sufficiently explored. In many sections is not straightforward teh
  connection between the challenges described and their impact on risk
  minimization in drug discovery; 3) The paper misses some important
  efforts and more detail on translational medicine and semantic web }
\response{The reviewer is correct in noting the disjointed character
  of the original article. We have updated the text throughout to flow
  smoothly. As part of this we have made more explicit how each topic
  that is discussed plays a role in derisking various stages of the
  drug disocvery process. Finally, we have updated the text on page 4
  to address the issue of linked data and the role that semantic
  technologies can play in this area. We have, however, chosen to not
  address the issue of translational medicine in the main article. 
  Still, we have added a paragraph plus references to the historic 
  supplement.}

%%%%%%%%%%%%%%%%%%%%%%%%%%%

\fbox{\textbf{Reviewer 2}}

\comment{The authors mention limitations of SMILES implementations,
  but do not mention canonicalization in this context, often using the
  Morgan algorithm published in 1965 in J. Chem. Doc.}
\response{We have updated the text on page 3 to note the Morgan
  algorithm in the context of chemical structure representation}

\comment{The authors state that 3D information is lost when only
  considering the molecular graph. However, they do not make it clear
  that the 3D space is implicitly encoded in the molecular graph –
  this is essential to cover properly.}
\response{We have update the text on page 2 to note that 3D
  information is implicit in 2D representations}

\comment{ I entirely dispute that cheminformatics methods have been
  'closely guarded secrets of companies...' since many methods have been
  published in the public domain. The authors suggest that only in the
  last decade has the field gained access to freely available
  software, etc.}
\response{We agree that secret was probably not the best choice of
  words. However we do believe that much cheminformatics software and
  data was proprietary till recently. We have updated the text on
  pages 7 and 8 to expand on this and provide a more detailed
  discussion}

\comment{If this manuscript were to be published it would be
  completely rewriting the history of cheminformatics}
\response{While we agree with the reviewer that some key milestones in
  the history of cheminformatics were not included in the original
  article, we believe that the addition of the supplemental history
  addresses this problem. Furthermore, our original aim was \emph{not}
  to provide a complete historical overview of the field. Others have
  already provided such overviews. Our aim was to present a brief
  overview of \emph{current topics} in cheminformatics, selected because we
  believed that they represent on going challenges for the field and
  thus amenable to coss-disciplinary efforts.}

%%%%%%%%%%%%%%%%%%%%%%%

\fbox{\textbf{Reviewer 3}}

\comment{The manuscript provides a very specific overview on the
  aspects of the open source community in cheminformatics. }
\response{As we have noted below, the current article is not a
  comprehensive historical review of the field. We have included a
  supplement  that does provide a very brief overview of historic
  milestones. Furthermore, we have updated the text to note commercial
  vendors and solutions; but we still emphasise Open Source and Open
  Access solutions as these allow readers to explore the problems and
  topics we have discussed with a minimum of hindrances.}

\comment{The manuscript is poor in relevant citations with none of the
  classical references provided, such as Morgan 1965, Weininger 1988,
  Johnson and Maggiora 1990, Barnard 1993, Willett et al. 1998, etc.}
\response{The text has been updated to include the ``classic''
  references. In addition the supplemental history document lists a
  number of milestones in cheminformatics and their relevant references.}

\comment{Molecular docking is also not mentioned, but is regarded as
  one of the key methods applied in cheminformatics. Docking should be
  included, along with other molecular modelling methods such as
  pharmacophores, shape searching, molecular dynamics, as a distinct
  section.}
\response{These methods are mentioned on page 5 (column
  1). However, we have chosen not to place a more in depth discussion
  of these methods in their own section in the main document for a few
  reasons. First, due to space limitations this would require us to
  compress or drop other sections which we believe to be more
  relevant. Second, we are of the view that docking, for example, is
  not really a cheminformatics technique, though it makes extensive
  use of cheminformatic methods. Instead we place it under molecular
  modeling. We realize that this is a somewhat subjective
  ``classification'' of methods. However we do make explicit note of
  these methods in the supplemental document where we make a
  distinction between cheminformatics and related subject areas such
  as quantum mechanics and molecular modeling.}

\comment{Proteochemometric methods are mentioned a couple of times but
  these are largely un-noticed by the community currently. I do not
  think that it is necessarily appropriate to include these as
  tried-and-tested methods in an over-arching review such as this.}
\response{It is true that proteochemometric methodologies
  might not be as well known as say docking or pharmacophore modeling.
  Though, what is considered as proteochemometric, is in docking known as 
  'interaction fingerprint', e.g. for Kinase databases, and in the field of HIV the
  term 'phenotypic modeling' has a long tradition.
  We choose, along with multiple reviews and the books written by 
  Qing Yan (Pharmacogenomics) and Kubinyi/M\"{u}ller (Chemogenomics), 
  that the term proteochemometrics is worth being mentioned, since it unifies the 
  (statistical) mining/QSAR techniques that do explicitly include multiple ligand 
  activities and multiple receptor features.\\
  Finally, we also agree with reviewer number one that proteochemometric modeling
  plays an important role in translational medicine, e.g. HIV-HAART therapy. 
  A paragraph and references about translational medicine was added to the 
  historic supplement.}

\comment{RDKit is not mentioned at all in this article and is widely
  used as a cheminformatics toolkit in industry and academia and free
  and open-source (although from commercial beginnings).}
\response{We agree that this toolkit should have been included. We
  have updated the text on page 7 as well as Table 2 to take note of RDKit, 
  along with the newer Indigo toolkit.}

\comment{The authors could more appropriately cover the graph theory
  aspects of molecular structures by correctly applying the
  terminology to which computer scientists and cheminformaticians will
  most certainly be aware}
\response{The text has been updated.}

\comment{The inclusion of ontologies is not really useful here as the
  conclusions drawn are vague and do not really inform the current
  state-of-the-art and usefulness of ontologies in cheminformatics.}
\response{The text has been updated on page 4 to make the discussion
  of ontologies more focused and the use cases clearer, specifically 
  identifying challenges in cross-domain querying, classification 
  and integration within the context of whole-systems biology, and the 
  growing essental role that ontologies and semantic technologies are 
  playing in this matter.}

\comment{Section 2.2. Structure Enumeration: it is not clear what this
  section is trying to convey to the reader \ldots}
\response{We have rewritten this section to simplify the language as
  well as be more explicit in applications of and challenges in this
  technique. We have also provide relevant references to key topics in
  this area.}


\comment{p2, c1: I think the authors could benefit from a re-write of
  the section detailing molecular structure representations. \ldots
  Figure 1: what is meant by 'closer to reality' This needs to be
  clarified. This figure and its caption requires a little more
  consideration to appropriately convey the message, }
\response{We have updated the caption to remove the phrase ``closer to
  reality'' since this is not the main point. Instead we have noted
  that the representations contain differing amounts and types of
  information that are equally valid, but suited for different
  purposes. The text on page 2 has been updated to better represent
  this view.}

\comment{p1, c2, l31: it is not clear to this reviewer what software
  and algorithms have only recently been made available. Certainly,
  the referenced book to which they refer contains a large number of
  algorithms that have already been published in the primary
  literature and frequently many decades ago. If the open source
  community has provided true novelty then please articulate this,
  otherwise cite the primary references that report these
  discoveries.}
\response{The text has been updated to note that most of the
  algorithms in cheminformatics have been known since the mid 20th
  century, but that freely accessible toolkits implementing them are
  relatively recent, as well as public and open benchmark data sets.}

\comment{The paragraph in the second column discussing descriptors and
  the proper description requires rewriting too since it provides a
  very vague overview that is not particularly informative.}
\response{We have updated the discussion of descriptors to be more
  focused.}

\comment{page 1, column 1, line 24: the authors need to provide a
  reference for the assertion that cheminformatics is an older field,
  particularly given the comments later in this manuscript.}
\response{We have removed the assertion that cheminformatics is older
  than bioinformatics. Primarily, because there is no obvious
  reference that claims this. But also because, depending on what one
  considers bioinformatics and cheminformatics, one field can be
  identified as being older or younger than the other. We feel that
  claims on age do not strengthen the paper and have thus removed it.}

\comment{p1, c1, l29-37: I would dispute that until recently the
  cheminformatics techniques have been closely guarded
  secrets. J. Chem. Doc. was founded in 1961 and other techniques even
  older, such as the Wiener index (1947), Wiswesser Line Notation
  (1949), etc. Not to mention the actual foundations of what we call
  cheminformatics back to the atomistic theory of the 19th century.}
\response{We have the updated the text on page 1 to attenuate this
  statement, stressing the fact that much of the data in
  cheminformatics is proprietary, whereas only some of the techniques (such
  as SMILES canonicalization) are. Regarding the statement that atomistic
  theory laid the foundations of cheminformatics - this is certainly
  true in a very broad sense. For that matter the atomistic theory
  laid the ground for computational chemistry in general. In the
  supplemental history document, we make a distinction between
  computational chemistry and cheminformatics.}

\comment{p1, c1, l32: I think the phrase miracle molecule could be
  more usefully replaced with new drugs or new small molecule
  therapeutics.}
\response{The text has been updated to use the term ``therapeutic
  molecule''.}

\comment{p2, c2, l57: it might be worthwhile here clarifying in the text
precisely the properties that require satisfaction to deliver a small
molecule therapeutic, not ?promising?. A drug must be safe and
efficacious ultimately; perhaps this should be mentioned first
followed by the typical pitfalls and how they are assessed?}
\response{The text has been updated to provide some examples of
  properties that would characterize a therapeutically useful small molecule.}

\comment{P3, c1, l41: the authors need to cover the Morgan algorithm
  (published in 1965) here and explain the canonicalization
  process. The issues mentioned in different canonicalization
  implementations providing different SMILES strings is also a
  challenge with InChI codes with different softwares giving different
  representation. Therefore, this is still not a solved problem as
  suggested here. Could the authors clarify this in the text?}
\response{We have updated the text to reference the Morgan
  algorithm. Regarding the issue of InChI codes, we believe that this
  is not the case. Currently, there is only one implementation of the
  InChI algorithm and even if alternative implementations were to be
  developed, the InChI specification is publically available. As a
  result, there should be no differing InChI representations for a
  given input structure.}

\comment{P3, c2: structure-based fingerprints referred to here sound
  like Daylight-style fingerprints to this reviewer. Is this the case?
  I would emphasise the distinction of structure-key and hash-key
  fingerprints. My understanding is that Daylight-style fingerprints
  enumerate paths of 7 in length, not 8, could the authors provide a
  reference for this?}
\response{The section on fingerprints has been rewritten to be more
  clearer as well as distinguish between the two broad classes of
  fingerprints. We have also mentioned the use of fingerprints in
  similarity searches and the Tanimoto coefficient.}

\comment{P3, c2, l42: is this true given the lower complexity of
  molecular graphs compared with other much larger and denser graphs?
  I think a reference to a known paper that states the problem clearly
  would be of benefit here.}
\response{While it is true that isomorphism on smaller graphs that are
  characteristic of small molecule is in an absolute sense, not very
  slow, there are a number of cases (polycyclic hydrocarbons, steroids
  etc.) that can take significantly longer. More generally, the fact
  that subgraph isomorphism is NP-complete, means that no time guarantees
  can be provided. Thus when performing this on \emph{millions} of
  molecules as in a database search, we may not be able to complete
  the operation. In practice this is usually not the case, but still,
  isomorphism algorithms do take much longer than fingerprint
  screening. Hence in practice performing isomorphism tests for large
  databases is infeasible. We have included a reference that
  explicitly talks about this problem.}

\comment{P4, c2, l17: citation needed on the size of chemistry space.}
\response{The appropriate reference has been added.}

\comment{P4, c2, l23: the isomorphism problem has already been
  mentioned previously in this manuscript.}
\response{The text has been updated and simplified.}

\comment{P4, c2, l41: the GDB-13 database contains 970 million
  molecules, which is nearly a billion, not a trillion.}
\response{The text has been updated to use the correct number.}


\comment{P5, c1, l29: the normal phrase used to describe this concept
  is the similar property principle. The authors should also provide a
  reference.}
\response{This portion of the text has been restructured to refer to
  the similarity property principle and also include the relevant
  reference.}

\comment{P5, c1, l40: QSAR should be referenced to Hansch et al. Also
  perhaps some discussion on what the authors mean by referring to
  these approaches as traditional.}
\response{We have updated text to include references to Hansch and
  Free \& Wilson. Regarding the use of ``traditional'', we believe the
  text explicitly explains why - the fact that QSAR as originaly
  defined only considers ligand features. However, we have rephrased it to use
  ``traditionally'' since one can argue that methods such as docking
  and pharmacophores are also QSAR methods, but consider both ligand
  and receptor.}

\comment{P5, c1, l41: I would dispute the assertion that these methods
  "ignore reception interactions" since the aim is to identify a
  correlation of biological response (e.g. pIC50) with chemical
  structure. The biological response is an explicit measurement of
  receptor interactions on the protein.}
\response{We have updated
  the text to explicitly note that QSAR models do not usually take
  into account \emph{receptor features}. However we note that while
  receptor information is implicit in the $IC_{50}$, the value also
  includes other non-receptor related features such as
  permeability. Furthermore, lack of receptor information in QSAR
  models has been noted as the origin of activity cliffs (Guha and Van
  Drie, \textit{J.~Chem.~Inf.~Model.}, \textbf{2008}, \textit{48},
  1716--1728), thus supporting the statement that traditional QSAR
  models do indeed miss important information on receptor-ligand
  interactions.}

\comment{P5, c2, l22: I would mention naive Bayesian classifiers as well as
this is perhaps the most widely applied method in the field.}
\response{We have included the Na\"{i}ve Bayes as well.}

\comment{P5, c2, l42: formal citation for one of Hopfinger?s papers is
  required here.}
\response{The relevant reference has been added.}

\comment{P5, c2, l58: this assertion is made with no evidence to back
  it up. Why should multi-target models be more reliable? It is not
  clear to me that they should.}
\response{We have rephrased this statement to be less dogmatic. While
  it is true that for the case of linear models, a multivariate
  multiple regression is in general, equivalent to multiple individual
regression models, one could argue that taking into account the
correlation structure between the multiple y's could lead to improved
models in scenarios where a molecule has affinities for multiple
related targets. We admit that the answer is not clear at this point
and hence suggest that this could a topic for research. }

\comment{P6, c1, l7: I think it is important for the authors to cover
  ELNs by using actual recorded data. Do we know that many scientists
  changed to ELNs ?many years ago? as stated? Is there a reference for
  this or data to back it up?}
\response{We have expanded our original wording, and we now cite two
  references about the use and growth of ELNs in chemistry and
  elsewhere.}

\comment{P6, c1, l7: the authors do not mention the leading players in
  ELNs (e.g. CambridgeSoft, Accelrys, etc.) \ldots I think the
  key software houses that develop and supply these ELN systems should
  be mentioned. Furthermore, systems such as Reaxys should be
  mentioned \ldots} 
 \response{We now mention CambridgeSoft, Accelrys and
  Reaxys in the article, and provide a citation to a review of 35 ELNs
  that was published in November 2011.}

\comment{P7, c2, l25: the authors should go back even further in the
  history of cheminformatics here before DENDRAL. While this is an
  excellent example of crossover, it is not by any means the founding
  of our field. Aspects of mathematical chemistry should also be
  mentioned in this article, which date back even further, such as
  1894 with the publication of ?The Principles of Mathematical
  Chemistry? published by Helm.}
\response{As noted previously, we have included a supplemental
  document that lists some of the milestones in the history of
  cheminformatics. However, we note that the suggested reference by
  Helm is not specifically related to cheminformatics. Rather, it is a
  mathematical treatment of physical chemistry. Given that DENDRAL is
  referenced in the context of structure enumeration, we feel that
  inclusion of this reference would be somewhat irrelevant. We have
  included this reference in the supplemental history.}

\comment{P7, c2, l51: why is chemically in inverted commas and not
  algorithmically? I would prefer inverted commas for neither.}
\response{Quotes have been removed.}

\comment{P8, c1, l10: ELN's should be ELNs.}
\response{Corrected.}

\comment{P8, c1, l5: the users emphasise open source, open data but
  this should be an article on all of cheminformatics not just a
  particular subset.}
\response{We have restructured the section on
  open source and open data. It is true that we still focus on open
  source/open data, because this ensures that a reader will be able to
  explore cheminformatics problems with a minimum of hindrances.}

\comment{P7, c1, l9: the authors make an interesting point here
  regarding why cheminformatics software has historically been from
  commercial suppliers. Given the scope of this manuscript it would be
  interesting to expand on this discussion.} 
\response{We have
  restructured the Conclusions section to expand on the role of
  cheminformatics in industry versus bioinformatics, and why much
  of cheminformatics tools and data have been commercial, in contrast
  to freely available tools and data in bioinformatics.}

\end{document}