Skip to content

Commit

Permalink
update pkg description
Browse files Browse the repository at this point in the history
  • Loading branch information
hongooi73 committed Sep 28, 2017
1 parent 5a7312c commit cb9e001
Show file tree
Hide file tree
Showing 5 changed files with 53 additions and 21 deletions.
25 changes: 25 additions & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Set the default behavior (used when a rule below doesn't match)
* text=auto
*.R text
*.Rd text
*.Rmd text
*.*proj text
*.targets text
*.settings text
*.vssettings text

*.dll -text
*.lib -text
*.sln -text
*.ico -text
*.bmp -text
*.png -text
*.snk -text
*.mht -text
*.pickle -text
*.Rdata -text
*.Rhistory -text

# Some Windows-specific files should always be CRLF
*.bat eol=crlf
*.cmd eol=crlf
14 changes: 9 additions & 5 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,15 +1,19 @@
Package: dplyrXdf
Title: Interface to xdf files for the dplyr package
Version: 0.10.0
Title: Tools for working with Microsoft R Server Xdf files and the dplyr package
Version: 1.0.0
Authors@R: c(
person("Hong", "Ooi", , "[email protected]", role = c("aut", "cre")),
person("Microsoft", role="cph"),
person("Hadley", "Wickham", role = "ctb", comment = "Some functions based on code in dplyr and httr"),
person("Ali-Kazim", "Zaidi", role = "ctb", comment = "Invaluable assistance on Spark")
person("Ali-Kazim", "Zaidi", role = "ctb", comment = "Invaluable assistance on Spark"),
person("Mario", "Inchiosa", role = "ctb", comment = "Invaluable assistance on Spark")
)
Description: Interface to xdf files for the dplyr package
Description: A suite of tools for working with Microsoft R Server. Its most
visible feature is a dplyr interface for the Xdf file format and other MRS
data sources. It supports Hadoop and Spark clusters, as well as in-database
processing with Microsoft SQL Server.
Depends:
R (>= 3.3.2),
R (>= 3.3),
dplyr (>= 0.7),
RevoScaleR (>= 8.0)
Imports:
Expand Down
4 changes: 2 additions & 2 deletions R/dplyr_pkg.R
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
#' dplyr backend for Xdf files
#'
#' The dplyrXdf package is a suite of tools to facilitate working with Microsoft R Server (MRS). Its features include:
#' The dplyrXdf package is a suite of tools to facilitate working with \href{https://www.microsoft.com/en-au/cloud-platform/r-}{Microsoft Machine Learning Server}, previously known as Microsoft R Server (MRS). Its features include:
#' \itemize{
#' \item A backend to the popular \href{https://cran.r-project.org/web/packages/dplyr/index.html}{dplyr package} for the Xdf file format. Xdf files are a technology provided by MRS to break R's memory barrier: instead of keeping data in-memory in data frames, it is saved on disk. The data is then processed in chunks, so that you only need enough memory to handle each chunk.
#' \item A backend to the popular \href{http://dplyr.tidyverse.org}{dplyr package} for the Xdf file format. Xdf files are a technology provided by MRS to break R's memory barrier: instead of keeping data in-memory in data frames, it is saved on disk. The data is then processed in chunks, so that you only need enough memory to handle each chunk.
#' \item Interfaces to Microsoft SQL Server and HDInsight Hadoop and Spark clusters. dplyrXdf, in conjunction with dplyr, provides the ability to execute pipelines natively in-database and in-cluster, which for large datasets can be much more efficient than executing them locally.
#' \item Several functions to ease working with Xdf files, including functions for file management and for transferring data to and from remote backends.
#' \item Workarounds for various glitches and unexpected behaviour in MRS.
Expand Down
27 changes: 15 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,32 +1,35 @@
# dplyrXdf

The [dplyr package](https://cran.r-project.org/package=dplyr) is a toolkit for data transformation and manipulation. Since its introduction, dplyr has become very popular in the R community, for the way in which it streamlines and simplifies many common data manipulation tasks.
The dplyrXdf package is a suite of tools to facilitate working with [Microsoft Machine Learning Server](https://www.microsoft.com/en-au/cloud-platform/r-server), previously known as Microsoft R Server (MRS). Its features include:

The dplyrXdf package implements a dplyr backend for [Microsoft R Server](https://www.microsoft.com/en-au/cloud-platform/r-server) (MRS). A key feature of MRS is that it allows you to break R's memory barrier. Instead of storing data in memory as data frames, it is stored on disk, in a file format identifiable by the `.xdf` extension. The data is then processed in chunks, so that you only need enough memory to store each chunk. This allows you to work with datasets of potentially unlimited size.
- A backend to the popular [dplyr package](http://dplyr.tidyverse.org) for the Xdf file format. Xdf files are a technology provided by MRS to break R's memory barrier: instead of keeping data in-memory in data frames, it is saved on disk. The data is then processed in chunks, so that you only need enough memory to handle each chunk.
- Interfaces to Microsoft SQL Server and HDInsight Hadoop and Spark clusters. dplyrXdf, in conjunction with dplyr, provides the ability to execute pipelines natively in-database and in-cluster, which for large datasets can be much more efficient than executing them locally.
- Several functions to ease working with Xdf files, including functions for file management and for transferring data to and from remote backends.
- Workarounds for various glitches and unexpected behaviour in MRS and dplyr.

MRS includes a suite of data transformation and modelling functions in the RevoScaleR package that can handle xdf files. These functions are highly optimised and efficient, but their user interface can be complex. dplyrXdf allows you to work with xdf files within the framework supplied by dplyr, which reduces the learning curve and allows you to become productive more quickly. It works with data in the native filesystem and in HDFS, and can take advantage of a Spark or Hadoop cluster.

_Note that dplyrXdf is a shell on top of the existing functions provided by Microsoft R Server, which is a commercial distribution of R. You must have MRS installed to make use of dplyrXdf. In particular, Microsoft R Open does not include support for xdf files._

## Obtaining dplyrXdf

The current version of dplyrXdf is **0.10.0 beta**. You can download and install dplyrXdf from within R via the devtools package:
The current version of dplyrXdf is **1.0.0**. You can download and install dplyrXdf from within R via the devtools package:

```r
install.packages("devtools")
devtools::install_github("RevolutionAnalytics/dplyrXdf")
```

dplyrXdf 0.10 requires dplyr 0.7 and Microsoft R Server release 8.0 or higher. If you are on an earlier release of MRS and/or dplyr, you can install dplyrXdf 0.9.2 instead: `install_github("RevolutionAnalytics/[email protected]")`.

## Obtaining dplyr
dplyrXdf requires Microsoft R Server release 8.0 or later, and dplyr 0.7 or later. If you want to use sparklyr and SQL Server integration, you will also have to install the dbplyr, sparklyr and odbc packages (and their dependencies).

At the moment, dplyr 0.7 is not in the MRAN snapshot that is the default repo for MRS users. You can install it from CRAN instead:
If you are using MRS 9.1 or earlier, the necessary packages will not be in the MRAN snapshot that is your default repo. You can install them from CRAN instead:

```r
install.packages("dplyr", repos="https://cloud.r-project.org")
install.packages(c("dplyr", "dbplyr", "sparklyr", "odbc"), repos="https://cloud.r-project.org")
```

Make sure you install dplyr 0.7 before you install dplyrXdf.
Make sure you install dplyr 0.7 _before_ you install dplyrXdf.

## Earlier versions

The previous version of dplyrXdf, 0.9.2, is also available. You can install this with `install_github("RevolutionAnalytics/[email protected]")`. This version requires dplyr 0.5 or earlier; it may run into problems with dplyr 0.7.



4 changes: 2 additions & 2 deletions man/dplyrXdf.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

0 comments on commit cb9e001

Please sign in to comment.