update

wbo4958 · Jun 11, 2024 · 1382f17 · 1382f17
1 parent f5e1d0e
commit 1382f17
Showing 1 changed file with 6 additions and 4 deletions.
diff --git a/doc/tutorials/spark_estimator.rst b/doc/tutorials/spark_estimator.rst
@@ -283,12 +283,14 @@ Advanced Usage
 XGBoost needs to repartition the input dataset to the num_workers to ensure there will be
 num_workers training tasks running at the same time. However, repartition is a costly operation.
 
-To avoid the need for repartitioning, users can set the Spark configuration parameters
-``spark.sql.files.maxPartitionNum`` and ``spark.sql.files.minPartitionNum`` to num_workers.
-This tells Spark to automatically partition the dataset into the desired number of partitions.
+If there is a scenario where reading the data from source and directly fitting it to XGBoost
+without introducing the shuffle stage, users can avoid the need for repartitioning by setting
+the Spark configuration parameters ``spark.sql.files.maxPartitionNum`` and
+``spark.sql.files.minPartitionNum`` to num_workers. This tells Spark to automatically partition
+the dataset into the desired number of partitions.
 
 However, if the input dataset is skewed (i.e. the data is not evenly distributed), setting
-the partition number to num_workers may not be sufficient. In this case, users can set
+the partition number to num_workers may not be efficient. In this case, users can set
 the ``force_repartition=true`` option to explicitly force XGBoost to repartition the dataset,
 even if the partition number is already equal to num_workers. This ensures the data is evenly
 distributed across the workers.