Skip to content

Commit

Permalink
[jvm-packages] [breaking] rework xgboost4j-spark and xgboost4j-spark-…
Browse files Browse the repository at this point in the history
…gpu (#10639)

- Introduce an abstract XGBoost Estimator
- Update to the latest XGBoost parameters
  - Add all XGBoost parameters supported in XGBoost4j-spark.
  - Add setter and getter for these parameters.
  - Remove the deprecated parameters
- Address the missing value handling
- Remove any ETL operations in XGBoost
- Rework the GPU plugin
- Expand sanity tests for CPU and GPU consistency
  • Loading branch information
wbo4958 authored Sep 11, 2024
1 parent d94f667 commit 67c8c96
Show file tree
Hide file tree
Showing 75 changed files with 4,545 additions and 7,564 deletions.
1 change: 1 addition & 0 deletions doc/jvm/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ Contents
XGBoost4J-Spark-GPU Tutorial <xgboost4j_spark_gpu_tutorial>
Code Examples <https://github.com/dmlc/xgboost/tree/master/jvm-packages/xgboost4j-example>
API docs <api>
How to migrate to XGBoost-Spark jvm 3.x <xgboost_spark_migration>

.. note::

Expand Down
162 changes: 162 additions & 0 deletions doc/jvm/xgboost_spark_migration.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,162 @@
########################################################
Migration Guide: How to migrate to XGBoost-Spark jvm 3.x
########################################################

XGBoost-Spark jvm packages underwent significant modifications in version 3.0,
which may cause compatibility issues with existing user code.

This guide will walk you through the process of updating your code to ensure
it's compatible with XGBoost-Spark 3.0 and later versions.

**********************
XGBoost Spark Packages
**********************

XGBoost-Spark 3.0 introduced a single uber package named xgboost-spark_2.12-3.0.0.jar, which bundles
both xgboost4j and xgboost4j-spark. This means you can now simply use `xgboost-spark`` for your application.

* For CPU

.. code-block:: xml
<dependency>
<groupId>ml.dmlc</groupId>
<artifactId>xgboost-spark_${scala.binary.version}</artifactId>
<version>3.0.0</version>
</dependency>
* For GPU

.. code-block:: xml
<dependency>
<groupId>ml.dmlc</groupId>
<artifactId>xgboost-spark-gpu_${scala.binary.version}</artifactId>
<version>3.0.0</version>
</dependency>
When submitting the XGBoost application to the Spark cluster, you only need to specify the single `xgboost-spark` package.

* For CPU

.. code-block:: bash
spark-submit \
--jars xgboost-spark_2.12-3.0.0.jar \
... \
* For GPU

.. code-block:: bash
spark-submit \
--jars xgboost-spark_2.12-3.0.0.jar \
... \
**************
XGBoost Ranking
**************

Learning to rank using XGBoostRegressor has been replaced by a dedicated `XGBoostRanker`, which is specifically designed
to support ranking algorithms.

.. code-block:: scala
// before 3.0
val regressor = new XGBoostRegressor().setObjective("rank:ndcg")
// after 3.0
val ranker = new XGBoostRanker()
******************************
XGBoost Constructor Parameters
******************************

XGBoost Spark now categorizes parameters into two groups: XGBoost-Spark parameters and XGBoost parameters.
When constructing an XGBoost estimator, only XGBoost-specific parameters are permitted. XGBoost-Spark specific
parameters must be configured using the estimator's setter methods. It's worth noting that
`XGBoost Parameters <https://xgboost.readthedocs.io/en/stable/parameter.html>`_
can be set both during construction and through the estimator's setter methods.

.. code-block:: scala
// before 3.0
val xgboost_paras = Map(
"eta" -> "1",
"max_depth" -> "6",
"objective" -> "binary:logistic",
"num_round" -> 5,
"num_workers" -> 1,
"features" -> "feature_column",
"label" -> "label_column",
)
val classifier = new XGBoostClassifier(xgboost_paras)
// after 3.0
val xgboost_paras = Map(
"eta" -> "1",
"max_depth" -> "6",
"objective" -> "binary:logistic",
)
val classifier = new XGBoostClassifier(xgboost_paras)
.setNumRound(5)
.setNumWorkers(1)
.setFeaturesCol("feature_column")
.setLabelCol("label_column")
// Or you can use setter to set all parameters
val classifier = new XGBoostClassifier()
.setNumRound(5)
.setNumWorkers(1)
.setFeaturesCol("feature_column")
.setLabelCol("label_column")
.setEta(1)
.setMaxDepth(6)
.setObjective("binary:logistic")
******************
Removed Parameters
******************

Starting from 3.0, below parameters are removed.

- cacheTrainingSet

If you wish to cache the training dataset, you have the option to implement caching
in your code prior to fitting the data to an estimator.

.. code-block:: scala
val df = input.cache()
val model = new XGBoostClassifier().fit(df)
- trainTestRatio

The following method can be employed to do the evaluation.

.. code-block:: scala
val Array(train, eval) = trainDf.randomSplit(Array(0.7, 0.3))
val classifier = new XGBoostClassifer().setEvalDataset(eval)
val model = classifier.fit(train)
- tracker_conf

The following method can be used to configure RabitTracker.

.. code-block:: scala
val classifier = new XGBoostClassifer()
.setRabitTrackerTimeout(100)
.setRabitTrackerHostIp("192.168.0.2")
.setRabitTrackerPort(19203)
- rabitRingReduceThreshold
- rabitTimeout
- rabitConnectRetry
- singlePrecisionHistogram
- lambdaBias
- objectiveType
2 changes: 1 addition & 1 deletion jvm-packages/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@
<use.cuda>OFF</use.cuda>
<cudf.version>24.06.0</cudf.version>
<spark.rapids.version>24.06.0</spark.rapids.version>
<cudf.classifier>cuda12</cudf.classifier>
<spark.rapids.classifier>cuda12</spark.rapids.classifier>
<scalatest.version>3.2.19</scalatest.version>
<scala-collection-compat.version>2.12.0</scala-collection-compat.version>
<skip.native.build>false</skip.native.build>
Expand Down
1 change: 1 addition & 0 deletions jvm-packages/xgboost4j-spark-gpu/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,7 @@
<groupId>com.nvidia</groupId>
<artifactId>rapids-4-spark_${scala.binary.version}</artifactId>
<version>${spark.rapids.version}</version>
<classifier>${spark.rapids.classifier}</classifier>
<scope>provided</scope>
</dependency>
<dependency>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -55,9 +55,9 @@ public static CudfColumn from(ColumnVector cv) {
DType dType = cv.getType();
String typeStr = "";
if (dType == DType.FLOAT32 || dType == DType.FLOAT64 ||
dType == DType.TIMESTAMP_DAYS || dType == DType.TIMESTAMP_MICROSECONDS ||
dType == DType.TIMESTAMP_MILLISECONDS || dType == DType.TIMESTAMP_NANOSECONDS ||
dType == DType.TIMESTAMP_SECONDS) {
dType == DType.TIMESTAMP_DAYS || dType == DType.TIMESTAMP_MICROSECONDS ||
dType == DType.TIMESTAMP_MILLISECONDS || dType == DType.TIMESTAMP_NANOSECONDS ||
dType == DType.TIMESTAMP_SECONDS) {
typeStr = "<f" + dType.getSizeInBytes();
} else if (dType == DType.BOOL8 || dType == DType.INT8 || dType == DType.INT16 ||
dType == DType.INT32 || dType == DType.INT64) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -35,11 +35,39 @@ public QuantileDMatrix(
float missing,
int maxBin,
int nthread) throws XGBoostError {
this(iter, null, missing, maxBin, nthread);
}

/**
* Create QuantileDMatrix from iterator based on the cuda array interface
*
* @param iter the XGBoost ColumnBatch batch to provide the corresponding cuda array
* interface
* @param refDMatrix The reference QuantileDMatrix that provides quantile information, needed
* when creating validation/test dataset with QuantileDMatrix. Supplying the
* training DMatrix as a reference means that the same quantisation
* applied to the training data is applied to the validation/test data
* @param missing the missing value
* @param maxBin the max bin
* @param nthread the parallelism
* @throws XGBoostError
*/
public QuantileDMatrix(
Iterator<ColumnBatch> iter,
QuantileDMatrix refDMatrix,
float missing,
int maxBin,
int nthread) throws XGBoostError {
super(0);
long[] out = new long[1];
String conf = getConfig(missing, maxBin, nthread);
long[] ref = null;
if (refDMatrix != null) {
ref = new long[1];
ref[0] = refDMatrix.getHandle();
}
XGBoostJNI.checkCall(XGBoostJNI.XGQuantileDMatrixCreateFromCallback(
iter, null, conf, out));
iter, ref, conf, out));
handle = out[0];
}

Expand Down Expand Up @@ -85,6 +113,7 @@ public void setGroup(int[] group) throws XGBoostError {

private String getConfig(float missing, int maxBin, int nthread) {
return String.format("{\"missing\":%f,\"max_bin\":%d,\"nthread\":%d}",
missing, maxBin, nthread);
missing, maxBin, nthread);
}

}

This file was deleted.

This file was deleted.

Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
ml.dmlc.xgboost4j.scala.spark.GpuXGBoostPlugin
Loading

0 comments on commit 67c8c96

Please sign in to comment.