diff --git a/.gitattributes b/.gitattributes new file mode 100644 index 0000000..9d77925 --- /dev/null +++ b/.gitattributes @@ -0,0 +1,25 @@ +# Set the default behavior (used when a rule below doesn't match) +* text=auto +*.R text +*.Rd text +*.Rmd text +*.*proj text +*.targets text +*.settings text +*.vssettings text + +*.dll -text +*.lib -text +*.sln -text +*.ico -text +*.bmp -text +*.png -text +*.snk -text +*.mht -text +*.pickle -text +*.Rdata -text +*.Rhistory -text + +# Some Windows-specific files should always be CRLF +*.bat eol=crlf +*.cmd eol=crlf diff --git a/DESCRIPTION b/DESCRIPTION index 6126c99..69fa197 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -1,15 +1,19 @@ Package: dplyrXdf -Title: Interface to xdf files for the dplyr package -Version: 0.10.0 +Title: Tools for working with Microsoft R Server Xdf files and the dplyr package +Version: 1.0.0 Authors@R: c( person("Hong", "Ooi", , "hongooi@microsoft.com", role = c("aut", "cre")), person("Microsoft", role="cph"), person("Hadley", "Wickham", role = "ctb", comment = "Some functions based on code in dplyr and httr"), - person("Ali-Kazim", "Zaidi", role = "ctb", comment = "Invaluable assistance on Spark") + person("Ali-Kazim", "Zaidi", role = "ctb", comment = "Invaluable assistance on Spark"), + person("Mario", "Inchiosa", role = "ctb", comment = "Invaluable assistance on Spark") ) -Description: Interface to xdf files for the dplyr package +Description: A suite of tools for working with Microsoft R Server. Its most + visible feature is a dplyr interface for the Xdf file format and other MRS + data sources. It supports Hadoop and Spark clusters, as well as in-database + processing with Microsoft SQL Server. Depends: - R (>= 3.3.2), + R (>= 3.3), dplyr (>= 0.7), RevoScaleR (>= 8.0) Imports: diff --git a/R/dplyr_pkg.R b/R/dplyr_pkg.R index f48dda9..39df3f5 100644 --- a/R/dplyr_pkg.R +++ b/R/dplyr_pkg.R @@ -1,8 +1,8 @@ #' dplyr backend for Xdf files #' -#' The dplyrXdf package is a suite of tools to facilitate working with Microsoft R Server (MRS). Its features include: +#' The dplyrXdf package is a suite of tools to facilitate working with \href{https://www.microsoft.com/en-au/cloud-platform/r-}{Microsoft Machine Learning Server}, previously known as Microsoft R Server (MRS). Its features include: #' \itemize{ -#' \item A backend to the popular \href{https://cran.r-project.org/web/packages/dplyr/index.html}{dplyr package} for the Xdf file format. Xdf files are a technology provided by MRS to break R's memory barrier: instead of keeping data in-memory in data frames, it is saved on disk. The data is then processed in chunks, so that you only need enough memory to handle each chunk. +#' \item A backend to the popular \href{http://dplyr.tidyverse.org}{dplyr package} for the Xdf file format. Xdf files are a technology provided by MRS to break R's memory barrier: instead of keeping data in-memory in data frames, it is saved on disk. The data is then processed in chunks, so that you only need enough memory to handle each chunk. #' \item Interfaces to Microsoft SQL Server and HDInsight Hadoop and Spark clusters. dplyrXdf, in conjunction with dplyr, provides the ability to execute pipelines natively in-database and in-cluster, which for large datasets can be much more efficient than executing them locally. #' \item Several functions to ease working with Xdf files, including functions for file management and for transferring data to and from remote backends. #' \item Workarounds for various glitches and unexpected behaviour in MRS. diff --git a/README.md b/README.md index 6bd410b..7568785 100644 --- a/README.md +++ b/README.md @@ -1,32 +1,35 @@ # dplyrXdf -The [dplyr package](https://cran.r-project.org/package=dplyr) is a toolkit for data transformation and manipulation. Since its introduction, dplyr has become very popular in the R community, for the way in which it streamlines and simplifies many common data manipulation tasks. +The dplyrXdf package is a suite of tools to facilitate working with [Microsoft Machine Learning Server](https://www.microsoft.com/en-au/cloud-platform/r-server), previously known as Microsoft R Server (MRS). Its features include: -The dplyrXdf package implements a dplyr backend for [Microsoft R Server](https://www.microsoft.com/en-au/cloud-platform/r-server) (MRS). A key feature of MRS is that it allows you to break R's memory barrier. Instead of storing data in memory as data frames, it is stored on disk, in a file format identifiable by the `.xdf` extension. The data is then processed in chunks, so that you only need enough memory to store each chunk. This allows you to work with datasets of potentially unlimited size. +- A backend to the popular [dplyr package](http://dplyr.tidyverse.org) for the Xdf file format. Xdf files are a technology provided by MRS to break R's memory barrier: instead of keeping data in-memory in data frames, it is saved on disk. The data is then processed in chunks, so that you only need enough memory to handle each chunk. +- Interfaces to Microsoft SQL Server and HDInsight Hadoop and Spark clusters. dplyrXdf, in conjunction with dplyr, provides the ability to execute pipelines natively in-database and in-cluster, which for large datasets can be much more efficient than executing them locally. +- Several functions to ease working with Xdf files, including functions for file management and for transferring data to and from remote backends. +- Workarounds for various glitches and unexpected behaviour in MRS and dplyr. -MRS includes a suite of data transformation and modelling functions in the RevoScaleR package that can handle xdf files. These functions are highly optimised and efficient, but their user interface can be complex. dplyrXdf allows you to work with xdf files within the framework supplied by dplyr, which reduces the learning curve and allows you to become productive more quickly. It works with data in the native filesystem and in HDFS, and can take advantage of a Spark or Hadoop cluster. - -_Note that dplyrXdf is a shell on top of the existing functions provided by Microsoft R Server, which is a commercial distribution of R. You must have MRS installed to make use of dplyrXdf. In particular, Microsoft R Open does not include support for xdf files._ ## Obtaining dplyrXdf -The current version of dplyrXdf is **0.10.0 beta**. You can download and install dplyrXdf from within R via the devtools package: +The current version of dplyrXdf is **1.0.0**. You can download and install dplyrXdf from within R via the devtools package: ```r install.packages("devtools") devtools::install_github("RevolutionAnalytics/dplyrXdf") ``` -dplyrXdf 0.10 requires dplyr 0.7 and Microsoft R Server release 8.0 or higher. If you are on an earlier release of MRS and/or dplyr, you can install dplyrXdf 0.9.2 instead: `install_github("RevolutionAnalytics/dplyrXdf@v0.9.2")`. - -## Obtaining dplyr +dplyrXdf requires Microsoft R Server release 8.0 or later, and dplyr 0.7 or later. If you want to use sparklyr and SQL Server integration, you will also have to install the dbplyr, sparklyr and odbc packages (and their dependencies). -At the moment, dplyr 0.7 is not in the MRAN snapshot that is the default repo for MRS users. You can install it from CRAN instead: +If you are using MRS 9.1 or earlier, the necessary packages will not be in the MRAN snapshot that is your default repo. You can install them from CRAN instead: ```r -install.packages("dplyr", repos="https://cloud.r-project.org") +install.packages(c("dplyr", "dbplyr", "sparklyr", "odbc"), repos="https://cloud.r-project.org") ``` -Make sure you install dplyr 0.7 before you install dplyrXdf. +Make sure you install dplyr 0.7 _before_ you install dplyrXdf. + +## Earlier versions + +The previous version of dplyrXdf, 0.9.2, is also available. You can install this with `install_github("RevolutionAnalytics/dplyrXdf@v0.9.2")`. This version requires dplyr 0.5 or earlier; it may run into problems with dplyr 0.7. + diff --git a/man/dplyrXdf.Rd b/man/dplyrXdf.Rd index 738acda..5519c2e 100644 --- a/man/dplyrXdf.Rd +++ b/man/dplyrXdf.Rd @@ -7,9 +7,9 @@ \alias{dplyrXdf-package} \title{dplyr backend for Xdf files} \description{ -The dplyrXdf package is a suite of tools to facilitate working with Microsoft R Server (MRS). Its features include: +The dplyrXdf package is a suite of tools to facilitate working with \href{https://www.microsoft.com/en-au/cloud-platform/r-}{Microsoft Machine Learning Server}, previously known as Microsoft R Server (MRS). Its features include: \itemize{ - \item A backend to the popular \href{https://cran.r-project.org/web/packages/dplyr/index.html}{dplyr package} for the Xdf file format. Xdf files are a technology provided by MRS to break R's memory barrier: instead of keeping data in-memory in data frames, it is saved on disk. The data is then processed in chunks, so that you only need enough memory to handle each chunk. + \item A backend to the popular \href{http://dplyr.tidyverse.org}{dplyr package} for the Xdf file format. Xdf files are a technology provided by MRS to break R's memory barrier: instead of keeping data in-memory in data frames, it is saved on disk. The data is then processed in chunks, so that you only need enough memory to handle each chunk. \item Interfaces to Microsoft SQL Server and HDInsight Hadoop and Spark clusters. dplyrXdf, in conjunction with dplyr, provides the ability to execute pipelines natively in-database and in-cluster, which for large datasets can be much more efficient than executing them locally. \item Several functions to ease working with Xdf files, including functions for file management and for transferring data to and from remote backends. \item Workarounds for various glitches and unexpected behaviour in MRS.