-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trouble Scaling XGBoost beyond in-memory training on databricks #10853
Comments
Thank you for sharing your experience. Based on your description, you are trying to find some "best practices" for distributed training. I will try to do some more experiments and come up with something more comprehensive like a blog post. But for now, let's start with some checks:
|
Maybe starting with observing the CPU/Memory usage across workers, there's a "workers" tag in the dask dashboard. I can't provide a definite answer on why a specific run is slow without running it myself, but in general, it's the data balance issue. On GPU, sometimes I just repartition the data and the problem can be mitigated. The XGBoost train function in the dashboard task view is actually a lambda function, one for each worker, do you see dask waiting for them to finish? I'm also working on improving the scaling at the moment and will add some logging facilities to XGBoost to help debug. |
This is a combination of dask data balancing issues, dask memory usage and data spilling issues (the read/partition), XGBoost training performance issues, and optimization in Databricks. Let's get some of the easy issues resolved first.
I would suggest that the first thing that needs to be done is to ensure the data is well-balanced based on these hints from the screenshots:
|
Excellent progress! No, it's not required to have the same number of partitions. Preferably both of them have partitions for all workers (no worker is being starved for either dataset). XGBoost takes what's given, it doesn't move data or anything. Internally, it just iterates over partitions for each dataset independently. As long as partitions within each dataset are aligned (comes from the same dataframe,for instance), then it's fine. |
So I have managed to scale to 2 billion rows, but this does seem to be the point where it started to struggle. Dask started to complain about the size of the graph Any ideas how to mitigate this? [edit: one idea I had to mitigate this was to write the model to disk and then load it back in, as that might break the graph. And it did, in the sense that the warning now said my graph was only 13Mb compared to 54mb previously (and it kicked in after more loops of the training procedure), but it doesn't seem to have solved the problem, as in I'm still getting a warning about large dask graphs] Second edit: Interestingly, increasing the max_depth seems to significantly increase the size of the graph in the warning |
My first guess is the |
Code is unchanged from original post, other than that I've created a loop to increase the learning rate:
(I added a few more bells and whistles to make sure that if it early stops for one of the learning rates before 0.8, that the next boosting round starts from the optimal model rather than the final one, but I don't think that should affect anything) Broadly speaking this won't train for more than 400 rounds for the data I have, and I'm exploring maxdepths from ~8-16. So 400 trees of depth 16 are not trivial in terms of memory consumption, but also still a fraction of the data volume being handled on each worker. I'm happy to accept this as harmless if you don't think this is a problem. I haven't had any more problems with training failing or being erratic. |
Thank you for sharing! The code looks fine.
Dask doesn't usually send large objects across workers, which can hurt performance due to network constraints. But gathering a single booster for the client process should be fine. Feel free to close the issue if you have no further questions. ;-) |
I'm wondering if we can have a doc for running xgboost with spark/dask on different cloud environments? @trivialfis |
small update on this (not sure it's worth re-opening the issue over), I am struggling when the tree maxdepths are increased. The aforementioned warning about graph sizes is exacerbated by having a higher maxdepth, which might be related (potentially the trees themselves taking up more room in memory?) The problem is, that I would anticipate that the way you're going to get more performance out of scaling training to bigger datasets, is precisely that optimal performance will happen at higher maxdepths, i.e. due to having more data, you can fit more complex models before you overfit. (I'm also seeing this empirically, that the deepest maxdepth I've manage to successfully train with has got the best performance on a test set). I tried to reconfigure my cluster to have roughly the same amount of overall memory (an amount of memory which comfortable allowed me to train xgboost with a maxdepth of say 8 without being anywhere near the memory limit according to the dask dashboard (roughly around 50% utilisation across the board)) but with fewer, larger workers (in itself not cost ideal, this increased my DBU/hr in databricks by about 20%), but still ran into the error. Furthermore, it happens deep into training, often after like 3 hours, so it's not really feasible from a time or cost perspective, to experiment with lots of little hyperparameter tweaks (e.g. introduce some gamma-reg so that not all trees bottom out, perhaps up the learning rate a little so we have fewer trees, start training with high depths and then bring down the maxdepth and up the learning rate for later training rounds, etc). And I've also not been able to get verbose training output to work in dask-databricks (annoyingly it works in vanilla dask, but I found vanilla dask on databricks to be extremely unreliable), but verbose output would help a little bit in terms of being able to understand quickly whether a set of hyperparameters was going to give acceptable performance from an ML metrics perspective and is thus worth pursuing. |
XGBoost pyspark in Databricks is quite stable, maybe you can try xgboost pyspark. and it's quite fast if using GPUs. |
For the graph size warning, I suggest you to look for issues like whether the partition size is being too small. It will generate a larger operation graph and slower performance. On the other hand, if it's caused by XGBoost booster model, which happens only at the end of the training, then please ignore it. Thus far, the only critical thing for XGBoost to achieve good distributed training performance is data balancing. It can be mad slow if data size is skewed. The latency caused by waiting accumulates instead of simply bottlenecked by the slowest worker. The training function and the estimators accept a callback function. You can define your own logger using callbacks. Please find an example in the |
@wbo4958 : I believe you're talking about the @trivialfis : I can take another look at the partitioning strategy, but the issue here is that it trains just fine on the data I've prepared, on the cluster I set up (never getting north of ~50% memory utilisation on any worker) until I start pushing up the maxdepths. So this suggests something to do with tree size, but I also find it rather dubious that the issue would be that the ensemble of trees itself is taken up too much memory, because I'm being relatively aggressive about upping the learning rate, so we're never getting to more than say 500 trees, and 500 trees of maxdepth 14...back of the envelope says that's gonna be small fraction of the size of the full training data (not sure whether the full tree is being copied to all workers but even then, I'd be surprised if it would be taking up more than single-digit % of each worker's RAM) Thanks for the tip on logging, I'll have a look. |
It makes sense. Dask prefers small data transfer, which is unrelated to the total size of the data. It raises a warning if it needs to send large objects or graphs across the network.
It's not about taking up too much memory; You can push the memory usage to its limit without seeing dask warning. It's dask considers sending objects of large sizes across the network or using complex graphs inefficient and warns users to look for potential causes and optimizations. It's a performance warning. Having said that, I will ping @rjzamora for better insight. I'm not familiar with the dynamics inside dask. |
I'm not sure if this is helpful at all, but the In practice, the large-graph warning usually means that you are constructing a graph on the client that contains data that it probably shouldn't contain. For example, if each of your tasks will be operating on a distinct partition of Array data, you wouldn't want to pass that data to the workers through the graph. Rather, you would want your tasks to include the necessary logic to read that data from disk when it executes on the worker. In the case of XGB, it does seem possible that dask is just complaining about the size of the booster when you send it to the cluster. @trivialfis is correct that the model needs to get to the workers somehow. As long as you aren't explicitly passing a copy of the booster in every training task within the graph, then there is probably not a "better" option. |
Not sure what is meant by "explicitly passing a copy of the booster in every training task within the graph" ? The only perhaps non-standard thing I'm doing is dynamically updating the learning rate using code very similar to above. Is it possible that this is causing dask to copy not just the latest version of the booster (containing all the trees) but also previous versions? When you say dask might be complaining about the size of the booster (I believe it is btw, because I can see that if I up the max depth, the graph size in the warning message increases), is there some reason why a large graph might cause problems long before I physically run out of memory? I was using a cluster of |
I don't think you are doing this. My intention was more-so to explain "who" this warning is intended for. My general impression that the large-graph warning itself shouldn't be a concern to you. With that said, I am concerned if the size of the graph grows with every iteration and you plan on doing this for many iterations.
I honestly don't know off hand - Are you planning to run many iterations?
I'm definitely not 100% sure about this, but my impression is that the graph is large-enough to meet Dask's simple heuristic to warn the user, but not actually large enough to cause problems. A large graph can (temporarily) slow down the scheduler if there is a huge number of tasks or it is overwhelmed with serialization/deserialization. |
My understanding is that dask simply considers the graph size abnormal, which is unrelated to the total amount of memory available in the system. |
By iterations, do you mean separate training runs with new learning rates? Is so, not really. I think I ran 100 boosting rounds at a learning rate of 0.1, another 100 at a learning rate of 0.2, another 100 at a learning rate of 0.4, and then finally set the learning rate to 0.85 and let it train until it early stops. My best estimate is that the whole ensemble would come in at south of 1000 trees, and as the maxdepth increases, the optimal stopping point will come forward and we'll end up with fewer trees, but presumably overall more terminal nodes/an booster that consumes more memory. I could of course just not do this, but in practice I've found that you need a decently low learning rate to get good performance, but low learning rates lead to extremely slow convergence and you need to up the learning rate after a while if you don't want it to train for thousands of rounds. If we think the iterative training is likely to be the issue, I suppose one mitigation would be to just train for 200 boosting rounds with a learning rate of 0.1, and then set the learning rate to maybe 0.6 and train until we early stop. This would likely end up with more trees in the ensemble overall, and perhaps marginally worse generalisation but nothing too catastrophic. If even if is likely to cause problems and I need to train in one shot until we early stop, I think it could get quite tricky to choose a learning rate that would generalise well but would not take unacceptably long to train. I guess an option of desperation (the thing I always tell people not to do) would be to just estimate what a good number of boosting rounds is, and do away with early stopping (could then repurpose the eval set for more training data) |
The booster is transferred once per training session. If you run My suggestion is still to just ignore the graph size warning. There's no effective solution to it yet and it doesn't affect your model training. |
Yeah sorry my bad, I looked back through the messages and I didn't make this clear when I said "I'm struggling". The issue is not just the graph warning, training is actually failing when I increase the maxdepth. It's also taking a long time before it fails (<1 hour) to quite difficult to iterate. So with the same train/eval set, on a given cluster configuration, I'm able to train with maxdepth<=10, but it fails for maxdepth=12. Annoyingly I didn't make a note of the traceback at the time, but I convinced myself it seemed memory related. I can burn some more compute hours and reproduce if useful. |
|
seeing as it took a few hours before it failed, I wasn't watching the dask dashboard at the time of failure, but ~20 minutes into training, all workers were at circa 50% memory usage, which is why I was saying previously that it seemed incredibly unlikely that a large learner size was pushing it over the edge. I'll re-run and post a detailed error message |
Hmm, thank you for sharing. I see a severe imbalance there, with a few workers having significantly less data than others. |
Yeah, let me try and get a screengrab of what is looks like early on in training when things are healthy |
We will try to work on automated data balancing. Without it, it isn't easy to control the memory usage. |
@gdubs89 Could you please share the following information:
Please share the version of the above information closest to the version that causes issues. I will try to reproduce and profile next week. |
An extra datapoint to add to this, I did manage to train a tree with maxdepth=14, when I just did it in one training call with early stopping, rather than sequentially training and turning up the learning rate. Unfortunately, even with a relatively high learning rate of 0.3, this took ~8 hours until it early-stopped (as opposed to more like 3-4 hours at maxdepth=12 with iteratively increasing learning rates). So while this training in steps is causing some issues, I would regard this as more than just a nice to have (especially if I wanted to increase the training data size/cluster by an order of magnitude) |
Thank you for the detailed information. I'm working on the dask interface now. This PR should help with the issue of retrieving evaluation logs from databricks: #10942 . Will look into memory usage. |
I'm currently training a binary classifier using a tiny sample of a dataset. The dataset is of size approx 50bn rows per day, and we persist the data for ~60 days, so in theory I could be training this data on up to ~3TN rows of data. Of course that's probably a little excessive, but currently I'm training on a 0.1% sample of a day's data, i.e. approx 50 million rows.
I do this by doing
df = spark.read.parquet('s3://bucketname/data.pq').sample(fraction=0.001).toPandas()
I can play with this fraction a little bit, I've pushed it as far as 100 million rows and might be able to push it a bit further, but fundamentally the approach of pulling everything into a massive driver node and training in memory is not scalable and it's never going to allow me to train on 1 billion rows, or 10 billion rows, or more.
To that end, I've been looking for the canonical way to scale xgboost, i.e. do distributed training on databricks. I'm open to doing GPU training but my strong suspicion is that I'm far more memory-limited than compute limited (when training on 50million rows on a single EC2 machine, once the data has been read in and converted to dmatrices, the actual training is a breeze, takes 10-15 minutes), so my instinct is to try distributed CPU training.
Also, I'm using the following bells & whistles which I'll need any distributed training to support
For the sake of benchmarking, I've prepared the following 4 datasets:
(in each case there's a train set, the sizes above give the size of the train set, and then there's a corresponding eval set approx 20% of the size)
I first tried to do this using xgboost-dask. This is the solution I landed on:
This "worked" when I used dataset 3 described above, but failed when I used dataset 2. I.e. 50 million rows and about ~20 columns worked but 50 million rows and ~50 columns was too much. I was also a little suspicious that dask wasn't utilising the worker nodes. I can't connect to the dask dashboard, I think it's something I'd need to talk to our databricks admin about (I tried to SSH into the driver but my connection timed out, to my best understanding, we'd need to unblock some port), but the databricks cluster dashboard only ever showed the driver node being engaged (in retrospect, it could also possibly have been just one worker being engaged, if this is deemed relevant I can re-run and check). Note that when I do
print(client)
, it's telling me I have 128 threads (8*16, i.e. the number of worker cores) and ~500gb of RAM, but they don't seem to be being engaged by the training process.If only one machine is being engaged, each of these machines has significantly less memory than the machine I used to train on the 50 million row dataset in memory, so it's not entirely surprising that this fell over at the point where it did. I tested this by firing up a "wonky" cluster, comprised of two
rd5.16xlarge
workers and a driver of the same type. This worked, but again only one machine was being engaged, so we've not gained anything over just training on a single large machine.So my suspicion here is that raw dask doesn't play very well with databricks/spark, so instead I decided to try
dask-databricks
. So basically in the above code, replacewith
Same deal, when I
print(client)
, I see the number of threads/amount of memory I expect. However when running on a cluster of 8i3.4xlarge
workers, I have the same scaling issues as previously, I can run on the 50 milliow row dataset with ~20 columns but when I try on the set with ~50 columns, it falls over.I'm now running a cluster of 12
r5d.8xlarge
machines (I should have usedr5d.16xlarge
like I did before for reproducibility), and the training run for the 50million dataset with 50 columns hasn't technically crashed, but it's been running for 50 minutes now (which, given how big this cluster is compared to the single machine I can train this in memory in in ~10-15 minutes, is bad). When using dask-databricks, I can access the dask dashboard, and while I'm not expert on how to read this, it looks like all CPUs are being used, but only like 1.5/32 cores are being used per worker. This is in line with what the databricks cluster's dashboard is telling me.I also get a warning
which I don't fully know what to do with.
The cluster I'm currently using has at least 3x more RAM and 4x more cores than the largest single EC2 machine, the one that I've been using to train on 50million rows/50 columns (and that I've shown can be pushed a little bit further, at least to 100million rows, maybe to 150m, probably not as far as 200m), and also I would have hoped that when doing distributed training in dask, you'd get much more memory efficient handling of the data than when pulling the data into pandas. And yet I'm not even getting close to being able to replicate the performance I get with a single EC2 instance, which does not seem to bode well for scaling up to 500 million rows and beyond.
Help either with this, or other ways to scale XGBoost beyond in-memory training would be greatly appreciated. I was hoping there would be an accepted way to do distributed xgboost training but alas, it doesn't seem that there is an accepted wisdom on how to do this.
Other notes:
The text was updated successfully, but these errors were encountered: