forked from openucx/ucc
-
Notifications
You must be signed in to change notification settings - Fork 2
/
NEWS
280 lines (207 loc) · 8.77 KB
/
NEWS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
/**
* @copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
*
* See file LICENSE for terms.
*/
## Current
## New Features and Enhancements
### CL/HIER
- Disable onesided alltoallv {PR #875}
### TL/CUDA
- Initialize remote CUDA scratch to NULL {PR #911}
### TL/UCP
- Enable hybrid alltoallv {PR #781}
- Avoid copy in knomial scatter {PR #771}
- Enable reorder ranks to reduce_scatter, Knomial Allreduce, Ring Allgather/v {PR #819}
- Remove memcpy in last SRA step {PR #743}
- Fix sparse pack in hybrid a2av {PR #825}
- Fix recycle in hybrid a2av {PR #827}
- Reorder ranks for SRA {PR #834}
- Use ring allgather when reordering needed {PR #879}
- Use pipelining in SRA allreduce for CUDA {PR #873}
- Poll for onesided alltoall completion {PR #876}
- Add support for non-host buffers in bruck alltoall {PR #852}
- Added Neighbor Exchange Allgather{PR #822}
### TL/SHARP
- Enable bcast for any predefined dt {PR #774}
- Don't print team create error {PR #777}
- Check datasize supported {PR #776}
- Fix sharp context cleanup {PR #843}
### API
- Remove duplicate get_version_string {PR #933}
### TL/NCCL
- Make team init non-blocking {PR #772}
- Add CUDA managed to score {PR #793}
- Make ncclGroupEnd nb {PR #798}
- Lazy init nccl comm {PR #851}
### TL/MLX5
- Share ib_ctx and pd {PR #749}
- Rcache {PR #753}
- Device memory and topo init {PR #780}
- Adding mcast interface {PR #784}
- A2A part 1 -- coll init {PR #790}
- A2A part 2 -- full collective {PR #802}
- Revisit team and ctx init {PR #815}
- Fix context create hang {PR #887}
- Add librdmacm linkage {PR #910}
### CORE
- Fix score update when only score given {PR #779}
- Coverity fixes {PR #809}
- Additional coverty fixes {PR #813}
- Fix error handling for ctx create epilog {PR #818}
- Skip zero size collectives {PR #787}
### DOCS
- Updating NEWS for v1.2 {PR #791}
- Updating NEWS for v1.3 {PR #937}
### BUILD and TEST
- Updated build system to enable UCC with ROCm 6.x {PR #906 and #917}
- Check op and dt compatibility {PR #773}
- Fix barrier test {PR #799}
- Propagate HIP_CXXFLAGS to gtest and mpi {PR #803}
## 1.2.0 (June 6th, 2023)
## New Features and Enhancements
## CL/HIER
- Fixed single proc on node issue in alltoall ([#658](https://github.com/openucx/ucc/pull/658))
- Implemented allreduce rab pipelined ([#608](https://github.com/openucx/ucc/pull/608))
- Added bcast 2step algorithm ([#620](https://github.com/openucx/ucc/pull/620))
- Fixed allreduce rab pipeline ([#759](https://github.com/openucx/ucc/pull/759))
## TL/CUDA
- Support for CUDA 12
- Fixed cache unmap issue ([#642](https://github.com/openucx/ucc/pull/642))
- Implemented reduce scatter linear ([#669](https://github.com/openucx/ucc/pull/669))
- Added algorithm selection based on topology ([#688](https://github.com/openucx/ucc/pull/688))
- Fixed linear algorithms ([#751](https://github.com/openucx/ucc/pull/751))
- Fixed pipelining in linear rs ([#770](https://github.com/openucx/ucc/pull/770))
## TL/UCP
- Added special service worker ([#560](https://github.com/openucx/ucc/pull/560))
- Added scatterv ([#663](https://github.com/openucx/ucc/pull/663))
- Added gatherv ([#664](https://github.com/openucx/ucc/pull/664))
- Fixed running with npolls 0 ([#695](https://github.com/openucx/ucc/pull/695))
- Added knomial allgather ([#729](https://github.com/openucx/ucc/pull/729))
- Fixed bug for triggered colls ([#757](https://github.com/openucx/ucc/pull/757))
- Added bruck alltoall ([#756](https://github.com/openucx/ucc/pull/756))
- Added SLOAV alltoallv ([#687](https://github.com/openucx/ucc/pull/687))
- Large message broadcast optimizations ([#738](https://github.com/openucx/ucc/pull/738))
- Ranks reordering in ring allgather for better locality([#69](https://github.com/openucx/ucc/pull/698))
## TL/SHARP
- Fixed memory type check in allreduce ([#662](https://github.com/openucx/ucc/pull/662))
- Added support for sharpv3 dt ([#661](https://github.com/openucx/ucc/pull/661))
- Fixed assert check ([#686](https://github.com/openucx/ucc/pull/686))
- Implemented SHARP OOB fixes ([#746](https://github.com/openucx/ucc/pull/746))
- Fixed local rank when NODE SBGP not enabled ([#760](https://github.com/openucx/ucc/pull/760))
- Prevented sharp team with team max ppn > 1 ([#761](https://github.com/openucx/ucc/pull/761))
## CORE
- Fixed memory type score update ([#650](https://github.com/openucx/ucc/pull/650))
- Fixed ucc parser build ([#666](https://github.com/openucx/ucc/pull/666))
- Implemented ucc_pipeline_params ([#675](https://github.com/openucx/ucc/pull/675))
- Changed log level of config_modify ([#667](https://github.com/openucx/ucc/pull/667))
- Fixed timeout handle for triggered post ([#679](https://github.com/openucx/ucc/pull/679))
## DOCS
- Added User Guide ([#720](https://github.com/openucx/ucc/pull/720))
## 1.1.0 (October 7th, 2022)
## Features
## API
- Added float 128 and float 32, 64, 128 (complex) data types
- Added Active Sets based collectives to support dynamic groups as well as
point-to-point messaging
- Added ucc_team_get_attr interface
## Core
- Config file support
- Fixed component search
## CL
- Added split rail allreduce collective implementation
- Enable hierarchical alltoallv and barrier
- Fixed cleanup bugs
## TL
- Added SELF TL supporting team size one
### UCP
- Added service broadcast
- Added reduce_scatterv ring algorithm
- Added k-nomial based gather collective implementation
- Added one-sided get based algorithms
### SHARP
- Fixed SHARP OOB
- Added SHARP broadcast
### GPU Collectives (CUDA, NCCL TL and RCCL TL)
- Added support for CUDA TL (intranode collectives for NVIDIA GPUs)
- Added multiring allgatherv, alltoall, reduce-scatter, and reduce-scatterv
multiring in CUDA TL
- Added topo based ring construction in CUDA TL to maximize bandwidth
- Added NCCL gather, scatter and its vector variant
- Enable using multiple streams for collectives
- Added support for RCCL gather (v), scatter (v), broadcast, allgather (v),
barrier, alltoall (v) and all reduce collectives
- Added ROCm memory component
- Adapted all GPU collectives to executor design
### Tests
- Added tests for triggered collectives in perftests
- Fixed bugs in multi-threading tests
### Utils
- Added CPU model and vendor detection
- Several bug fixes in all components
## 1.0.0 (April 19th, 2022)
### Features
#### API
- Added Avg reduce operation
- Added nonblocking team destroy option
- Added user-defined datatype definitions
- Added Bfloat16 type
- Clarify semantics of core abstractions including teams and context
- Added timeout option
#### Core
- Added coll scoring and selection support
- Added support for Triggered collectives
- Added support for timeouts in collectives
- Added support for team create without ep in post
- Added support for multithreaded context progress
- Added support for nonblocking team destroy
#### CL
- Added support for hierarchical collectives
- Added support for hierarchical allreduce collective operation
- Added support for collectives based on one-sided communication routines
#### TL
- Added SHARP TL
##### UCP
- Added Bcast SAG algorithm for large messages
- Added Knomial based reduce algorithm
- Making allgather and alltoall agree with the API
- Added SRA knomial allreduce algorithm
- Added pairwise alltoall and alltoallv algorithms
- Added allgather and allgatherv ring algorithms
- Added support for collective operations based on one-sided semantics
- Added support for alltoall with one-sided transfer semantics
- Bug fixes
##### SHARP
- Added support for switch based hardware collectives (SHARP)
#### NCCL
- Add support for NCCL allreduce, alltoall, alltoallv, barrier, reduce, reduce
scatter, bcast, allgather and allgatherv
#### Tests
- Updated tests to test the newly added algorithms and operations
## 0.1.0 (TBD)
### Features
#### API
- UCC API to support library, contexts, teams, collective operations, execution
engine, memory types, and triggered operations
#### Core
- Added implementation for UCC abstractions - library, context, team,
collective operations, execution engine, memory types, and triggered
operations
- Added support for memory types - CUDA, and CPU
- Added support for configuring UCC library and contexts
#### CL
- Added support for collectives, while the source and destination is either in
CPU or device (GPU)
- Added support for UCC_THREAD_MULTIPLE
- Added support for CUDA stream-based collectives
#### TL
- Added support for send/receive based collectives using UCX/UCP as a transport
layer
- Support for basic collectives types including barrier, alltoall, alltoallv,
broadcast, allgather, allgatherv, allreduce was added in the UCP TL
- Added support using NCCL as a transport layer
- Support for collectives types including alltoall, alltoallv, allgather,
allgatherv, allreduce, and broadcast
#### Tests
- Added support for unit testing (gtest) infrastructure
- Added support for MPI tests