From 0836e76a5a6ca22011d0f1dd8c25080caf2c6253 Mon Sep 17 00:00:00 2001 From: rbruggner Date: Fri, 2 Aug 2013 15:42:42 -0700 Subject: [PATCH] Updating the package to be current. Also turning into an Rstudio project. --- .Rbuildignore | 2 ++ .gitignore | 1 + DESCRIPTION | 17 +++++++------ Rclusterpp.Rproj | 16 ++++++++++++ inst/doc/Rclusterpp.Rnw | 56 ++++++++++++++++++++--------------------- src/hclust.cpp | 5 ++-- 6 files changed, 59 insertions(+), 38 deletions(-) create mode 100644 .Rbuildignore create mode 100644 Rclusterpp.Rproj diff --git a/.Rbuildignore b/.Rbuildignore new file mode 100644 index 0000000..91114bf --- /dev/null +++ b/.Rbuildignore @@ -0,0 +1,2 @@ +^.*\.Rproj$ +^\.Rproj\.user$ diff --git a/.gitignore b/.gitignore index 1e52b44..25da9d8 100644 --- a/.gitignore +++ b/.gitignore @@ -9,3 +9,4 @@ config.status *.o inst/unit_tests/report* autom4te.cache +.Rproj.user diff --git a/DESCRIPTION b/DESCRIPTION index 413410c..8d87fc7 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -1,15 +1,18 @@ Package: Rclusterpp Type: Package -Title: Linkable C++ clustering -Version: 0.2.0 +Title: Linkable C++ clustering +Version: 0.2.1 Date: 2011-08-24 -Author: Michael Linderman -Maintainer: Michael Linderman -Description: Provide flexible native clustering routines that can be linked against in downstream packages. +Author: Michael Linderman +Maintainer: Michael Linderman +Description: Provide flexible native clustering routines that can be + linked against in downstream packages. License: MIT License LazyLoad: yes Depends: R (>= 2.12.0), Rcpp (>= 0.9.6), RcppEigen (>= 0.1.2) Suggests: RUnit, rbenchmark, fastcluster, inline LinkingTo: Rcpp, RcppEigen -Packaged: 2011-08-25 01:17:08 UTC; mlinderm -URL: https://github.com/nolanlab/Rclusterpp +Packaged: 2012-12-09 18:25:28 UTC; mlinderm +URL: https://bitbucket.org/mlinderm/rclusterpp +Repository: CRAN +Date/Publication: 2012-12-15 07:31:47 diff --git a/Rclusterpp.Rproj b/Rclusterpp.Rproj new file mode 100644 index 0000000..96bdc3d --- /dev/null +++ b/Rclusterpp.Rproj @@ -0,0 +1,16 @@ +Version: 1.0 + +RestoreWorkspace: Default +SaveWorkspace: Default +AlwaysSaveHistory: Default + +EnableCodeIndexing: Yes +UseSpacesForTab: Yes +NumSpacesForTab: 2 +Encoding: UTF-8 + +RnwWeave: Sweave +LaTeX: pdfLaTeX + +BuildType: Package +PackageInstallArgs: --no-multiarch diff --git a/inst/doc/Rclusterpp.Rnw b/inst/doc/Rclusterpp.Rnw index 5e0e923..e2b46f2 100644 --- a/inst/doc/Rclusterpp.Rnw +++ b/inst/doc/Rclusterpp.Rnw @@ -52,24 +52,25 @@ Rclusterpp.setThreads(1) Hierarchical clustering is a fundamental data analysis tool. However, the $O(n^2)$ memory footprint of commonly available implementations, such as \code{stats::hclust}, which maintain the dissimilarity matrix in memory -(stored-distance) limit these implementations to tens of thousands of +(colloquially stored-distance) limit these implementations to tens of thousands of observations or less. In the motivating domain for this work, flow cytometry, datasets are hundreds of thousands or even millions of observations in size -(but with low dimensionality, e.g., less than 30). In this and other similar +(but with comparatively low dimensionality, e.g., less than 30). In this and other similar contexts building out the complete distance matrix is not possible and alternative implementations with $O(n)$ memory footprint are needed. The memory requirements of hierarchical clustering have motivated the development of alternative clustering algorithms that do not require the full dissimilarity matrix. Such algorithms are not the focus of \pkg{Rclusterpp}. -Often, a much larger application is built and validated around a standard -hierarchical clustering algorithm, e.g. average-link, and only later scaled to large -datasets. In these cases, we wish to maintain the same algorithm -but scale efficiently. Thus the goal for \pkg{Rclusterpp} is to provide -efficient ``stored data'' implementations for common hierarchical clustering -routines, e.g., single, complex, average and Ward's linkage, that scale to -hundreds of thousands of observations while delivering identical results as the -``stock'' \code{stats::hclust} implementation. +Instead we focus on the common situation wherein a complex data analysis +pipeline, which includes hierarchical clustering, is first designed and +validated on smaller datasets, and only later scaled to larger in inputs. In +these cases, we wish to maintain the same functionality, an if possible the +same results, but scale efficiently. Thus the goal for \pkg{Rclusterpp} is to +provide efficient ``stored data'' implementations for common hierarchical +clustering routines, e.g., single, complex, average and Ward's linkage, that +scale to hundreds of thousands of observations while delivering +results identical to the ``stock'' \code{stats::hclust} implementation. As an example, the following two statements produce identical results: <>= @@ -78,16 +79,13 @@ r <- Rclusterpp.hclust(USArrests, method="average", distance="euclidean") # Check equality of the dedrogram tree and agglomeration heights identical(h$merge, r$merge) && all.equal(h$height, r$height) @ -however, in the latter, the memory footprint is on the order of -$O(n)$ as opposed to $O(n^2)$, for $n$ observations (ignoring the footprint of -the data itself). The trade-off can but does not have to be increased -computation. In such cases, \pkg{Rclusterpp} purposely trades time for space. -Fortunately there exist algorithms for some commonly used clustering methods, -specifically Ward's minimum variance method and single-link, that achieve -optimal time complexity with only $O(n)$ space. \pkg{Rclusterpp} implements -those more efficient algorithms when possible. Section~\ref{sec:data} includes -a summary of the complexity of each linkage method as implemented. - +however, in the latter, the memory footprint is on the order of $O(n)$ as +opposed to $O(n^2)$, for $n$ observations (ignoring the footprint of the data +itself). When required, such as in the example above, \pkg{Rclusterpp} +purposely trades time for space to maintain a $O(n)$ memory footprint. +Section~\ref{sec:data} includes a summary of the complexity of each linkage +method as implemented. + The computational demanding components of \pkg{Rclusterpp} are implemented in \proglang{C++} using OpenMP\footnote{OpenMP is only enabled on Linux and OSX due to issues with the pthreads compatibility DLL on Windows} to take advantage @@ -126,8 +124,8 @@ fastcluster & 2.6246 & \\ \end{tabular} \end{table} -In some applications, such as the WGNCA~\cite{Zhang2005} algorithm that -also motivated this work, the distance matrix is already computed in a previous +In some applications, such as the WGCNA~\cite{Zhang2005} algorithm that +also motivated this work, the dissimilarity matrix is already computed in a previous stage of the workflow and thus there is no advantage to be gained with stored-data approaches. However, memory footprint is still a concern. Those individuals who have attempted to cluster more than 46340 observations have @@ -157,11 +155,11 @@ can implemented exactly using the {\it recursive nearest neighbor (RNN)} algorithm~\cite{Murtagh1983}. Table~\ref{tab:complexity} shows the estimated worst-case time and space -complexities~\cite{Murtagh1984} for the algorithms used in \pkg{Rclusterpp}. As -described previously, Ward's and single-link are implemented with optimal time -and space using RNN and SLINK~\cite{Sibson1973} algorithms respectively. While -average and complete-link trade increased time bounds, in exchange for reducing -the memory footprint to $O(n)$ from $O(n^2)$. +complexities~\cite{Murtagh1984} for the algorithms used in \pkg{Rclusterpp}. +Ward's and single-link are implemented with optimal time and space using the RNN +and SLINK~\cite{Sibson1973} algorithms respectively; while average and +complete-link trade increased time bounds, in exchange for reducing the memory +footprint to $O(n)$ from $O(n^2)$. \begin{table} \centering @@ -242,12 +240,12 @@ Implementation & Exec. Time (s) & $n$ \\ that are setup to link against the \pkg{Rclusterpp} library. Alternately one can use the \pkg{inline} package to compile C++ code from within R. The \pkg{Rclusterpp} package includes an example ``inline" function, shown below, -which we will use that as our working example in this document. +which we will use as our working example in this document. <>= cat(readLines(system.file("examples","clustering.R",package="Rclusterpp")),sep="\n") @ -Rclusterpp makes extensive use of \pkg{Rcpp} to build the interface between +\pkg{Rclusterpp} makes extensive use of \pkg{Rcpp} to build the interface between \proglang{R} and \proglang{C++}, and the \pkg{Eigen} library (via \pkg{RcppEigen}) for matrix and vector operations. A working knowledge of both libraries will be needed to effectively use \pkg{Rclusterpp} as this lower diff --git a/src/hclust.cpp b/src/hclust.cpp index 44b7920..60d4007 100644 --- a/src/hclust.cpp +++ b/src/hclust.cpp @@ -156,8 +156,9 @@ namespace { if (TYPEOF(data) != RTYPE) throw std::invalid_argument("Wrong R type for mapped vector"); - typedef ::Rcpp::traits::storage_type::type STORAGE; - double *d_start = ::Rcpp::internal::r_vector_start(data); + //typedef ::Rcpp::traits::storage_type::type STORAGE; + //double *d_start = ::Rcpp::internal::r_vector_start(data); + double *d_start = REAL(data); for (ssize_t c=0; c(d_start,N-(c+1),1);