Skip to content

Commit

Permalink
update
Browse files Browse the repository at this point in the history
  • Loading branch information
wbo4958 committed Jun 11, 2024
1 parent f5e1d0e commit 1382f17
Showing 1 changed file with 6 additions and 4 deletions.
10 changes: 6 additions & 4 deletions doc/tutorials/spark_estimator.rst
Original file line number Diff line number Diff line change
Expand Up @@ -283,12 +283,14 @@ Advanced Usage
XGBoost needs to repartition the input dataset to the num_workers to ensure there will be
num_workers training tasks running at the same time. However, repartition is a costly operation.

To avoid the need for repartitioning, users can set the Spark configuration parameters
``spark.sql.files.maxPartitionNum`` and ``spark.sql.files.minPartitionNum`` to num_workers.
This tells Spark to automatically partition the dataset into the desired number of partitions.
If there is a scenario where reading the data from source and directly fitting it to XGBoost
without introducing the shuffle stage, users can avoid the need for repartitioning by setting
the Spark configuration parameters ``spark.sql.files.maxPartitionNum`` and
``spark.sql.files.minPartitionNum`` to num_workers. This tells Spark to automatically partition
the dataset into the desired number of partitions.

However, if the input dataset is skewed (i.e. the data is not evenly distributed), setting
the partition number to num_workers may not be sufficient. In this case, users can set
the partition number to num_workers may not be efficient. In this case, users can set
the ``force_repartition=true`` option to explicitly force XGBoost to repartition the dataset,
even if the partition number is already equal to num_workers. This ensures the data is evenly
distributed across the workers.

0 comments on commit 1382f17

Please sign in to comment.