Updating the package to be current. Also turning into an Rstudio proj…

…ect.
nolanlab · Aug 2, 2013 · 0836e76 · 0836e76
1 parent 325a85c
commit 0836e76
Show file tree

Hide file tree

Showing 6 changed files with 59 additions and 38 deletions.
diff --git a/.Rbuildignore b/.Rbuildignore
@@ -0,0 +1,2 @@
+^.*\.Rproj$
+^\.Rproj\.user$
diff --git a/.gitignore b/.gitignore
@@ -9,3 +9,4 @@ config.status
 *.o
 inst/unit_tests/report*
 autom4te.cache
+.Rproj.user
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,15 +1,18 @@
 Package: Rclusterpp
 Type: Package
-Title: Linkable C++ clustering 
-Version: 0.2.0
+Title: Linkable C++ clustering
+Version: 0.2.1
 Date: 2011-08-24
-Author: Michael Linderman 
-Maintainer: Michael Linderman <[email protected]> 
-Description: Provide flexible native clustering routines that can be linked against in downstream packages. 
+Author: Michael Linderman
+Maintainer: Michael Linderman <[email protected]>
+Description: Provide flexible native clustering routines that can be
+        linked against in downstream packages.
 License: MIT License
 LazyLoad: yes
 Depends: R (>= 2.12.0), Rcpp (>= 0.9.6), RcppEigen (>= 0.1.2)
 Suggests: RUnit, rbenchmark, fastcluster, inline
 LinkingTo: Rcpp, RcppEigen
-Packaged: 2011-08-25 01:17:08 UTC; mlinderm
-URL: https://github.com/nolanlab/Rclusterpp 
+Packaged: 2012-12-09 18:25:28 UTC; mlinderm
+URL: https://bitbucket.org/mlinderm/rclusterpp
+Repository: CRAN
+Date/Publication: 2012-12-15 07:31:47
diff --git a/Rclusterpp.Rproj b/Rclusterpp.Rproj
@@ -0,0 +1,16 @@
+Version: 1.0
+
+RestoreWorkspace: Default
+SaveWorkspace: Default
+AlwaysSaveHistory: Default
+
+EnableCodeIndexing: Yes
+UseSpacesForTab: Yes
+NumSpacesForTab: 2
+Encoding: UTF-8
+
+RnwWeave: Sweave
+LaTeX: pdfLaTeX
+
+BuildType: Package
+PackageInstallArgs: --no-multiarch
diff --git a/inst/doc/Rclusterpp.Rnw b/inst/doc/Rclusterpp.Rnw
@@ -52,24 +52,25 @@ Rclusterpp.setThreads(1)
 Hierarchical clustering is a fundamental data analysis tool. However, the
 $O(n^2)$ memory footprint of commonly available implementations, such as
 \code{stats::hclust}, which maintain the dissimilarity matrix in memory
-(stored-distance) limit these implementations to tens of thousands of
+(colloquially stored-distance) limit these implementations to tens of thousands of
 observations or less. In the motivating domain for this work, flow cytometry,
 datasets are hundreds of thousands or even millions of observations in size
-(but with low dimensionality, e.g., less than 30). In this and other similar
+(but with comparatively low dimensionality, e.g., less than 30). In this and other similar
 contexts building out the complete distance matrix is not possible and
 alternative implementations with $O(n)$ memory footprint are needed. 
 
 The memory requirements of hierarchical clustering have motivated the
 development of alternative clustering algorithms that do not require the full
 dissimilarity matrix. Such algorithms are not the focus of \pkg{Rclusterpp}.
-Often, a much larger application is built and validated around a standard
-hierarchical clustering algorithm, e.g. average-link, and only later scaled to large
-datasets. In these cases, we wish to maintain the same algorithm
-but scale efficiently. Thus the goal for \pkg{Rclusterpp} is to provide
-efficient ``stored data'' implementations for common hierarchical clustering
-routines, e.g., single, complex, average and Ward's linkage, that scale to
-hundreds of thousands of observations while delivering identical results as the
-``stock'' \code{stats::hclust} implementation. 
+Instead we focus on the common situation wherein a complex data analysis
+pipeline, which includes hierarchical clustering, is first designed and
+validated on smaller datasets, and only later scaled to larger in inputs. In
+these cases, we wish to maintain the same functionality, an if possible the
+same results, but scale efficiently. Thus the goal for \pkg{Rclusterpp} is to
+provide efficient ``stored data'' implementations for common hierarchical
+clustering routines, e.g., single, complex, average and Ward's linkage, that
+scale to hundreds of thousands of observations while delivering 
+results identical to the ``stock'' \code{stats::hclust} implementation. 
 
 As an example, the following two statements produce identical results:
 <<simple>>=
@@ -78,16 +79,13 @@ r <- Rclusterpp.hclust(USArrests, method="average", distance="euclidean")
 # Check equality of the dedrogram tree and agglomeration heights
 identical(h$merge, r$merge) && all.equal(h$height, r$height)
 @
-however, in the latter, the memory footprint is on the order of
-$O(n)$ as opposed to $O(n^2)$, for $n$ observations (ignoring the footprint of
-the data itself). The trade-off can but does not have to be increased
-computation. In such cases, \pkg{Rclusterpp} purposely trades time for space.
-Fortunately there exist algorithms for some commonly used clustering methods,
-specifically Ward's minimum variance method and single-link, that achieve
-optimal time complexity with only $O(n)$ space. \pkg{Rclusterpp} implements
-those more efficient algorithms when possible. Section~\ref{sec:data} includes
-a summary of the complexity of each linkage method as implemented.
-
+however, in the latter, the memory footprint is on the order of $O(n)$ as
+opposed to $O(n^2)$, for $n$ observations (ignoring the footprint of the data
+itself). When required, such as in the example above, \pkg{Rclusterpp}
+purposely trades time for space to maintain a $O(n)$ memory footprint.
+Section~\ref{sec:data} includes a summary of the complexity of each linkage
+method as implemented.
+
 The computational demanding components of \pkg{Rclusterpp} are implemented in
 \proglang{C++} using OpenMP\footnote{OpenMP is only enabled on Linux and OSX
 due to issues with the pthreads compatibility DLL on Windows} to take advantage
@@ -126,8 +124,8 @@ fastcluster &   2.6246 & \\
 \end{tabular}
 \end{table}
 
-In some applications, such as the WGNCA~\cite{Zhang2005} algorithm that
-also motivated this work, the distance matrix is already computed in a previous
+In some applications, such as the WGCNA~\cite{Zhang2005} algorithm that
+also motivated this work, the dissimilarity matrix is already computed in a previous
 stage of the workflow and thus there is no advantage to be gained with
 stored-data approaches. However, memory footprint is still a concern. Those
 individuals who have attempted to cluster more than 46340 observations have
@@ -157,11 +155,11 @@ can implemented exactly using the {\it recursive nearest neighbor (RNN)}
 algorithm~\cite{Murtagh1983}.
 
 Table~\ref{tab:complexity} shows the estimated worst-case time and space
-complexities~\cite{Murtagh1984} for the algorithms used in \pkg{Rclusterpp}. As
-described previously, Ward's and single-link are implemented with optimal time
-and space using RNN and SLINK~\cite{Sibson1973} algorithms respectively. While
-average and complete-link trade increased time bounds, in exchange for reducing
-the memory footprint to $O(n)$ from $O(n^2)$.
+complexities~\cite{Murtagh1984} for the algorithms used in \pkg{Rclusterpp}.
+Ward's and single-link are implemented with optimal time and space using the RNN
+and SLINK~\cite{Sibson1973} algorithms respectively; while average and
+complete-link trade increased time bounds, in exchange for reducing the memory
+footprint to $O(n)$ from $O(n^2)$.
 
 \begin{table}
 	\centering
@@ -242,12 +240,12 @@ Implementation & Exec. Time (s) & $n$ \\
 that are setup to link against the \pkg{Rclusterpp} library. Alternately one
 can use the \pkg{inline} package to compile C++ code from within R. The
 \pkg{Rclusterpp} package includes an example ``inline" function, shown below,
-which we will use that as our working example in this document. 
+which we will use as our working example in this document. 
 <<example>>=
 cat(readLines(system.file("examples","clustering.R",package="Rclusterpp")),sep="\n")
 @
 
-Rclusterpp makes extensive use of \pkg{Rcpp} to build the interface between
+\pkg{Rclusterpp} makes extensive use of \pkg{Rcpp} to build the interface between
 \proglang{R} and \proglang{C++}, and the \pkg{Eigen} library (via
 \pkg{RcppEigen}) for matrix and vector operations. A working knowledge of both
 libraries will be needed to effectively use \pkg{Rclusterpp} as this lower

diff --git a/src/hclust.cpp b/src/hclust.cpp
@@ -156,8 +156,9 @@ namespace {
 		if (TYPEOF(data) != RTYPE)
 			throw std::invalid_argument("Wrong R type for mapped vector");
 
-		typedef ::Rcpp::traits::storage_type<RTYPE>::type STORAGE;
-		double *d_start = ::Rcpp::internal::r_vector_start<RTYPE,STORAGE>(data);
+		//typedef ::Rcpp::traits::storage_type<RTYPE>::type STORAGE;
+		//double *d_start = ::Rcpp::internal::r_vector_start<RTYPE,STORAGE>(data);
+		double *d_start = REAL(data);
 
 		for (ssize_t c=0; c<N-1; c++) {
 			m.block(c+1 /* starting row */, c, N-(c+1) /* numer of rows*/, 1) = Eigen::Map<Eigen::NumericMatrix>(d_start,N-(c+1),1);