Optimizations for CPU#4529
Optimizations for CPU#4529hcho3 merged 38 commits intodmlc:masterfrom SmirnovEgorRu:optimizations_fork
Conversation
|
Thanks for huge effort! A few questions in general
|
|
@chenqin Hi!
Here I mean - vector instruction sets mostly. Now XGBoost uses sse2 only, but for gradient/hessians computations (as an example) AVX512 can be used as well:
One problem here - that compiler in some cases can't auto-vectorize your code, instrinsic-functions help a lot here. Also, if we want to support different instructions sets (for example AVX512 and sse2) in one library-file - we need to have separate paths of code for them. It can be done, but requires some changes in build system. I think, if topic of gradient boosting optimizations are interested - we can write some article for web site (as it was done for GPU). From my side I can provide materials about this topic. @chenqin, @hcho3 What do you think? |
trivialfis
left a comment
There was a problem hiding this comment.
Thanks for the hard work, this is incredible. ;-) I'm going to work on this along with DMatrix soon. So will approve as long as all tests passed (as in current case). @hcho3
|
Hi @hcho3, |
|
Reviewing now |
|
@hcho3 I'm only approving the PR because the tests passed, and I think optimizing histogram build will be beneficial to future improvement. But if we merge this someone needs to bring a huge refactor to quantile_hist and clarify how all these are put together. Personally I prefer optimizing this by thinking about the whole structure. And I think there's lots of opportunities in optimization if we look at how the whole procedure is run, I mean how many of the steps are necessary if we do some preprocessing to the data like calculating the number of non-zero columns, implement a sparse histogram you mentioned, if it's possible to do grouping block before constructing quantile cuts etc. Or if it's possible to implement prediction cache for multi-class objective so that CSR format may not be needed at all. BTW, @RAMitchell , when I mentioned "column matrix", I meant this class: https://github.com/dmlc/xgboost/blob/master/src/common/column_matrix.h Going instruction level optimization is cool, but to me quantile hist has a lot more room to be improved on other places and vectorization might not be the critical point. I don't have the time to do it myself right now, it just my opinions. |
hcho3
left a comment
There was a problem hiding this comment.
@SmirnovEgorRu Thanks a lot for preparing the pull request. The code is better organized and more comprehensible than #4433, and benchmark results are nice. I also like how now all operations are batched over multiple nodes. Overall, I'm inclined to approve this pull request.
@trivialfis I agree that more clarification of what each component does is in order.
Co-Authored-By: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
|
@hcho3, thanks for review. I have applied all comments, CI is green. |
|
Looks like the R project website (https://cloud.r-project.org) is currently down, causing Windows R tests to fail. For now, we will ignore the failed tests. |
|
@hcho3, thank you! |
|
@SmirnovEgorRu It’s merged now. Thanks! |
|
Thanks for including these changes! Is there an ETA as to when a release can be expected? |
|
Are results (model, predictions) reproducible for same machine, same random state, same input? |
|
@arnocandel I tested this for multiple data sets - it was reproducible even for multicore systems, |
|
Hi @hcho3, @trivialfis, |
|
See #4680 |
It is this PR #4433. But I have merged all commits in once and applied comment in the review.
My next steps here: