Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adapt libfabric dataplane of SST to Cray CXI provider #3672

Merged
merged 7 commits into from
Dec 6, 2023

Conversation

franzpoeschel
Copy link
Contributor

@franzpoeschel franzpoeschel commented Jun 20, 2023

This PR is a first step towards trying to make use of libfabric+CXI on Frontier for SST streaming.

It successfully connects to the provider in init_fabric() in my tests, and the hello_sst examples finishes without error – but for some reason without loading any data. I will provide a detailed run log later (I lost my earlier logs somehow and a new job is currently enqueued), but essentially fi_cq_sread returns one finished task from the queue that reports a successful remote read of zero bytes.

When changing the data size in the helloSst examples, different errors occur (permission violations or (if I remember correctly) error code 90 - message too long) depending on the chosen data size.

As you can see in the diff, the configuration of libfabric is somewhat different from the one used in SST so far.
The most important differences are:

  • Use of FI_MR_ENDPOINT is mandatory. I have adapted all calls to fi_mr_reg(), but maybe this flag has more implications.
  • The old implementation uses FI_MR_BASIC which implies FI_MR_LOCAL. I just noticed that I did not set this flag for some reason. I don't remember if this is necessary or if I just forgot this. I will need to try this out. Setting FI_MR_LOCAL does not make a difference, I pushed a commit containing it.
  • Use of FI_PROGRESS_MANUAL is mandatory (contradicting the libfabric documentation that says that AUTO should always be supported). I don't really know the implications of this. From the libfabric documentation of this parameter that I found, this reads more like a performance implication, turning fi_cq_sread() into a blocking/synchronous operation (it is called without a timeout in ADIOS2).

Note that I did not test the latest two commits (cleanup and documentation) yet, but their changes are not major.

Most things in this PR are currently hardcoded specifically for the requirements of the CXI provider.

Once I get a job on the system again, I will try to be a bit more specific in some things and answer some questions above, and provide a logfile.

cc @pnorbert @eisenhauer

I used this submission script in my tests:

#!/bin/bash
#SBATCH -N 2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH -t 1:00
#SBATCH -A <project id>
#SBATCH -o out.txt
#SBATCH -e err.txt
#SBATCH --network=single_node_vni,job_vni

export FABRIC_IFACE=cxi2
export SstVerbose=5
echo Allocating jobs
srun --network=single_node_vni,job_vni -N 1 --ntasks=1 ../build/bin/hello_sstWriter > sst1.txt 2>&1 &
echo Allocated job 1
srun --network=single_node_vni,job_vni -N 1 --ntasks=1 ../build/bin/hello_sstReader > sst2.txt 2>&1 &
echo Allocated job 2
wait
echo Completed

Current diff: release_29...franzpoeschel:ADIOS2:libfabric-cray

@eisenhauer
Copy link
Member

Thanks Franz, happy to help look at this too. Might not have time until later this week, but I'll be watching in case you get more info before then.

@franzpoeschel
Copy link
Contributor Author

Thanks Greg. I am also willing to put time into this, but this concerns several systems that I don't really know about, so I'm rather limited in what I can do, and I'm currently somewhat stuck on this issue of "loading zero data".

The output of the writer is the following:

srun: warning: can't run 1 processes on 2 nodes, setting nnodes to 1
Writer 0 (0x3078d0): Sst set to use sockets as a Control Transport
Provider: 'cxi', domain: 'cxi0'
Provider: 'cxi', domain: 'cxi1'
Provider: 'cxi', domain: 'cxi2'
DP Writer 0 (0x3078d0): RDMA Dataplane found the requested interface cxi2, provider type cxi.
DP Writer 0 (0x3078d0): RDMA Dataplane evaluating viability, returning priority 100
DP Writer 0 (0x3078d0): Considering DataPlane "evpath" for possible use, priority is 1
DP Writer 0 (0x3078d0): Considering DataPlane "rdma" for possible use, priority is 100
DP Writer 0 (0x3078d0): Selecting DataPlane "rdma", priority 100 for use
DP Writer 0 (0x3078d0): ignoring fabric cxi because it's not of a supported type. It may work to force this fabric to be used by setting FABRIC_IFACE to cxi0, but it may not be stable or performant.
DP Writer 0 (0x3078d0): ignoring fabric cxi because it's not of a supported type. It may work to force this fabric to be used by setting FABRIC_IFACE to cxi1, but it may not be stable or performant.
DP Writer 0 (0x3078d0): using interface set by FABRIC_IFACE.
DP Writer 0 (0x3078d0): Fabric parameters to use at fabric initialization: fi_info:
    caps: [ FI_MSG, FI_RMA, FI_READ, FI_WRITE, FI_RECV, FI_SEND, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_MULTI_RECV, FI_TRIGGER, FI_SHARED_AV ]
    mode: [  ]
    addr_format: FI_ADDR_CXI_COMPAT
    src_addrlen: 4
    dest_addrlen: 0
    src_addr: fi_addr_cxi://0x002205ff
    dest_addr: (null)
    handle: (nil)
    fi_tx_attr:
        caps: [ FI_MSG, FI_RMA, FI_READ, FI_WRITE, FI_SEND, FI_TRIGGER, FI_SHARED_AV ]
        mode: [  ]
        op_flags: [  ]
        msg_order: [  ]
        comp_order: [ FI_ORDER_NONE ]
        inject_size: 192
        size: 256
        iov_limit: 1
        rma_iov_limit: 1
        tclass: 0x0
    fi_rx_attr:
        caps: [ FI_MSG, FI_RMA, FI_RECV, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_MULTI_RECV, FI_TRIGGER, FI_SHARED_AV ]
        mode: [  ]
        op_flags: [  ]
        msg_order: [ FI_ORDER_WAW, FI_ORDER_SAS, FI_ORDER_RMA_WAW, FI_ORDER_ATOMIC_RAR, FI_ORDER_ATOMIC_RAW, FI_ORDER_ATOMIC_WAR, FI_ORDER_ATOMIC_WAW ]
        comp_order: [ FI_ORDER_NONE ]
        total_buffered_recv: 0
        size: 1024
        iov_limit: 1
    fi_ep_attr:
        type: FI_EP_RDM
        protocol: FI_PROTO_CXI_COMPAT
        protocol_version: 1
        max_msg_size: 1073741824
        msg_prefix_size: 0
        max_order_raw_size: 18446744073709551615
        max_order_war_size: 18446744073709551615
        max_order_waw_size: 18446744073709551615
        mem_tag_format: 0x0000aaaaaaaaaaaa
        tx_ctx_cnt: 0
        rx_ctx_cnt: 0
        auth_key_size: 8
    fi_domain_attr:
        domain: 0x0
        name: cxi2
        threading: FI_THREAD_SAFE
        control_progress: FI_PROGRESS_MANUAL
        data_progress: FI_PROGRESS_MANUAL
        resource_mgmt: FI_RM_ENABLED
        av_type: FI_AV_UNSPEC
        mr_mode: [ FI_MR_VIRT_ADDR, FI_MR_ALLOCATED, FI_MR_PROV_KEY, FI_MR_ENDPOINT ]
        mr_key_size: 4
        cq_data_size: 8
        cq_cnt: 32
        ep_cnt: 128
        tx_ctx_cnt: 256
        rx_ctx_cnt: 256
        max_ep_tx_ctx: 256
        max_ep_rx_ctx: 256
        max_ep_stx_ctx: 0
        max_ep_srx_ctx: 0
        cntr_cnt: 16
        mr_iov_limit: 1
        caps: [  ]
        mode: [  ]
        auth_key_size: 8
        max_err_data: 0
        mr_cnt: 100
        tclass: 0x0
    fi_fabric_attr:
        name: cxi
        prov_name: cxi
        prov_version: 0.0
        api_version: 1.11
    nic:
        fi_device_attr:
            name: cxi2
            device_id: 0x501
            device_version: 2
            vendor_id: 0x17db
            driver: cxi_core
            firmware: (null)
        fi_bus_attr:
            bus_type: FI_BUS_PCI
            fi_pci_attr:
                domain_id: 0
                bus_id: 213
                device_id: 0
                function_id: 0
        fi_link_attr:
            address: 0x1102
            mtu: 2112
            speed: 200000000000
            state: FI_LINK_UP
            network_type: HPC Ethernet

Using second vni.
Writer found CXI auth key: 10059 5
DP Writer 0 (0x3078d0): Fabric Parameters:
fi_info:
    caps: [ FI_MSG, FI_RMA, FI_READ, FI_WRITE, FI_RECV, FI_SEND, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_MULTI_RECV, FI_TRIGGER, FI_SHARED_AV ]
    mode: [  ]
    addr_format: FI_ADDR_CXI_COMPAT
    src_addrlen: 4
    dest_addrlen: 0
    src_addr: fi_addr_cxi://0x002205ff
    dest_addr: (null)
    handle: (nil)
    fi_tx_attr:
        caps: [ FI_MSG, FI_RMA, FI_READ, FI_WRITE, FI_SEND, FI_TRIGGER, FI_SHARED_AV ]
        mode: [  ]
        op_flags: [  ]
        msg_order: [  ]
        comp_order: [ FI_ORDER_NONE ]
        inject_size: 192
        size: 256
        iov_limit: 1
        rma_iov_limit: 1
        tclass: 0x0
    fi_rx_attr:
        caps: [ FI_MSG, FI_RMA, FI_RECV, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_MULTI_RECV, FI_TRIGGER, FI_SHARED_AV ]
        mode: [  ]
        op_flags: [  ]
        msg_order: [ FI_ORDER_WAW, FI_ORDER_SAS, FI_ORDER_RMA_WAW, FI_ORDER_ATOMIC_RAR, FI_ORDER_ATOMIC_RAW, FI_ORDER_ATOMIC_WAR, FI_ORDER_ATOMIC_WAW ]
        comp_order: [ FI_ORDER_NONE ]
        total_buffered_recv: 0
        size: 1024
        iov_limit: 1
    fi_ep_attr:
        type: FI_EP_RDM
        protocol: FI_PROTO_CXI_COMPAT
        protocol_version: 1
        max_msg_size: 1073741824
        msg_prefix_size: 0
        max_order_raw_size: 18446744073709551615
        max_order_war_size: 18446744073709551615
        max_order_waw_size: 18446744073709551615
        mem_tag_format: 0x0000aaaaaaaaaaaa
        tx_ctx_cnt: 0
        rx_ctx_cnt: 0
        auth_key_size: 8
    fi_domain_attr:
        domain: 0x0
        name: cxi2
        threading: FI_THREAD_SAFE
        control_progress: FI_PROGRESS_MANUAL
        data_progress: FI_PROGRESS_MANUAL
        resource_mgmt: FI_RM_ENABLED
        av_type: FI_AV_UNSPEC
        mr_mode: [ FI_MR_VIRT_ADDR, FI_MR_ALLOCATED, FI_MR_PROV_KEY, FI_MR_ENDPOINT ]
        mr_key_size: 4
        cq_data_size: 8
        cq_cnt: 32
        ep_cnt: 128
        tx_ctx_cnt: 256
        rx_ctx_cnt: 256
        max_ep_tx_ctx: 256
        max_ep_rx_ctx: 256
        max_ep_stx_ctx: 0
        max_ep_srx_ctx: 0
        cntr_cnt: 16
        mr_iov_limit: 1
        caps: [  ]
        mode: [  ]
        auth_key_size: 8
        max_err_data: 0
        mr_cnt: 100
        tclass: 0x0
    fi_fabric_attr:
        name: cxi
        prov_name: cxi
        prov_version: 0.0
        api_version: 1.11
    nic:
        fi_device_attr:
            name: cxi2
            device_id: 0x501
            device_version: 2
            vendor_id: 0x17db
            driver: cxi_core
            firmware: (null)
        fi_bus_attr:
            bus_type: FI_BUS_PCI
            fi_pci_attr:
                domain_id: 0
                bus_id: 213
                device_id: 0
                function_id: 0
        fi_link_attr:
            address: 0x1102
            mtu: 2112
            speed: 200000000000
            state: FI_LINK_UP
            network_type: HPC Ethernet

Writer 0 (0x3078d0): Opening Stream "helloSst"
Writer 0 (0x3078d0): Writer stream params are:
Param -   RegistrationMethod=File
Param -   RendezvousReaderCount=1
Param -   QueueLimit=0 (unlimited)
Param -   QueueFullPolicy=Block
Param -   StepDistributionMode=StepsAllToAll
Param -   DataTransport=rdma
Param -   ControlTransport=sockets
Param -   NetworkInterface=(default)
Param -   ControlInterface=(default to NetworkInterface if applicable)
Param -   DataInterface=(default to NetworkInterface if applicable)
Param -   CompressionMethod=None
Param -   CPCommPattern=Min
Param -   MarshalMethod=BP5
Param -   FirstTimestepPrecious=False
Param -   IsRowMajor=1  (not user settable) 
Param -   OpenTimeoutSecs=60 (seconds)
Param -   SpeculativePreloadMode=Auto
Param -   SpecAutoNodeThreshold=1
Param -   ControlModule=select
Writer 0 (0x3078d0): Stream "helloSst" waiting for 1 readers
Writer 0 (0x3078d0): Beginning writer-side reader open protocol
DP Writer 0 (0x3078d0): Received contact info for RS_Stream 0x3adaa0, WSR Rank 0
Writer 0 (0x3078d0): Setting SpeculativePreload ON for new reader
Writer 0 (0x3078d0): My oldest timestep was 0, global oldest timestep was 0
Writer 0 (0x3078d0): Finish writer-side reader open protocol for reader 0x3f9a70, reader ready response pending
Writer 0 (0x3078d0): (PID abb4, TID 7fffd3e6af80) Waiting for Reader ready on WSR 0x3f9a70.
Writer 0 (0x3078d0): Reader Activate message received for Stream 0x3f9a70.  Setting state to Established.
Writer 0 (0x3078d0): Parent stream reader count is now 1.
Writer 0 (0x3078d0): Reader ready on WSR 0x3f9a70, Stream established, Starting 0 LastProvided 0.
Writer 0 (0x3078d0): Finish opening Stream "helloSst"
DP Writer 0 (0x3078d0): Providing timestep data with block 0x446550 and access key -50386303
Writer 0 (0x3078d0): Reader 0 status Established has last released 4294967295, last sent 0
Writer 0 (0x3078d0): QueueMaintenance, smallest last released = -1, count = 1
Writer 0 (0x3078d0): Removing dead entries
Writer 0 (0x3078d0): QueueMaintenance complete
Writer 0 (0x3078d0): Sending TimestepMetadata for timestep 0 (ref count 1), one to each reader
Writer 0 (0x3078d0): Sent timestep 0 to reader cohort 0
Writer 0 (0x3078d0): ADDING timestep 0 to sent list for reader cohort 0, READER 0x3f9a70, reference count is now 2
Writer 0 (0x3078d0): PRELOADMODE for timestep 0 non-default for reader , active at timestep 0, mode 1
Writer 0 (0x3078d0): Sending a message to reader 0 (0x307490)
Writer 0 (0x3078d0): SubRef : Writer-side Timestep 0 now has reference count 1, expired 0, precious 0
Writer 0 (0x3078d0): Reader 0 status Established has last released 4294967295, last sent 0
Writer 0 (0x3078d0): QueueMaintenance, smallest last released = -1, count = 1
Writer 0 (0x3078d0): Removing dead entries
Writer 0 (0x3078d0): QueueMaintenance complete
Writer 0 (0x3078d0): SstWriterClose, Sending Close at Timestep 0, one to each reader
Writer 0 (0x3078d0): Working on reader cohort 0
Writer 0 (0x3078d0): Sending a message to reader 0 (0x307490)
Writer 0 (0x3078d0): Reader 0 status Established has last released 4294967295, last sent 0
Writer 0 (0x3078d0): QueueMaintenance, smallest last released = -1, count = 1
Writer 0 (0x3078d0): Removing dead entries
Writer 0 (0x3078d0): QueueMaintenance complete
Writer 0 (0x3078d0): Waiting for timesteps to be released in WriterClose
Writer 0 (0x3078d0): IN TS WAIT, ENTRIES are Timestep 0 (exp 0, Prec 0, Ref 1), Count now 1
Writer 0 (0x3078d0): The timesteps still queued are: 0 
Writer 0 (0x3078d0): Reader Count is 1
Writer 0 (0x3078d0): Reader [0] status is Established
Writer 0 (0x3078d0): Received a release timestep message for timestep 0 from reader cohort 0
Writer 0 (0x3078d0): Got the lock in release timestep
Writer 0 (0x3078d0): Doing dereference sent
Writer 0 (0x3078d0): Reader sent timestep list 0x446330, trying to release 0
Writer 0 (0x3078d0): Reader considering sent timestep 0,trying to release 0
Writer 0 (0x3078d0): SubRef : Writer-side Timestep 0 now has reference count 0, expired 0, precious 0
Writer 0 (0x3078d0): Doing QueueMaint
Writer 0 (0x3078d0): Reader 0 status Established has last released 0, last sent 0
Writer 0 (0x3078d0): QueueMaintenance, smallest last released = 0, count = 1
Writer 0 (0x3078d0): Writer tagging timestep 0 as expired
DP Writer 0 (0x3078d0): Releasing timestep 0
Writer 0 (0x3078d0): Removing dead entries
Writer 0 (0x3078d0): Remove queue Entries removing Timestep 0 (exp 1, Prec 0, Ref 0), Count now 0
Writer 0 (0x3078d0): QueueMaintenance complete
Writer 0 (0x3078d0): Releasing the lock in release timestep
Writer 0 (0x3078d0): Reader Close message received for stream 0x3f9a70.  Setting state to PeerClosed and releasing timesteps.
Writer 0 (0x3078d0): In PeerFailCloseWSReader, releasing sent timesteps
Writer 0 (0x3078d0): Dereferencing all timesteps sent to reader 0x3f9a70
Writer 0 (0x3078d0): DONE DEREFERENCING
Writer 0 (0x3078d0): Moving Reader stream 0x3f9a70 to status PeerClosed
Writer 0 (0x3078d0): Reader 0 status PeerClosed has last released 0, last sent 0
Writer 0 (0x3078d0): QueueMaintenance, smallest last released = LONG_MAX, count = 0
Writer 0 (0x3078d0): Removing dead entries
Writer 0 (0x3078d0): QueueMaintenance complete
Writer 0 (0x3078d0): 
Stream "helloSst" (0x3078d0) summary info:
Writer 0 (0x3078d0): 	Duration (secs) = 0.100589
Writer 0 (0x3078d0): 	Timesteps Created = 1
Writer 0 (0x3078d0): 	Timesteps Delivered = 1
Writer 0 (0x3078d0): 
Writer 0 (0x3078d0): All timesteps are released in WriterClose
Writer 0 (0x3078d0): Destroying stream 0x3078d0, name helloSst
DP Writer 0 (0x3078d0): Releasing reader-specific state for remaining readers.
DP Writer 0 (0x3078d0): Releasing remaining timesteps.
DP Writer 0 (0x3078d0): Tearing down RDMA state on writer.
Writer 0 (0x3078d0): Reference count now zero, Destroying process SST info cache
Writer 0 (0x3078d0): Freeing LastCallList
Writer 0 (0x7fffffff2f18): SstStreamDestroy successful, returning

The reader:

srun: warning: can't run 1 processes on 2 nodes, setting nnodes to 1
Reader 0 (0x307490): Sst set to use sockets as a Control Transport
Reader 0 (0x307490): Looking for writer contact in file helloSst.sst, with timeout 60 secs
Reader 0 (0x307490): Waiting for writer DPResponse message in SstReadOpen("helloSst")
Reader 0 (0x307490): finished wait writer DPresponse message in read_open, WRITER is using "rdma" DataPlane
Provider: 'cxi', domain: 'cxi0'
Provider: 'cxi', domain: 'cxi1'
Provider: 'cxi', domain: 'cxi2'
DP Reader 0 (0x307490): RDMA Dataplane found the requested interface cxi2, provider type cxi.
DP Reader 0 (0x307490): RDMA Dataplane evaluating viability, returning priority 100
DP Reader 0 (0x307490): Prefered dataplane name is "rdma"
DP Reader 0 (0x307490): Considering DataPlane "evpath" for possible use, priority is 1
DP Reader 0 (0x307490): Considering DataPlane "rdma" for possible use, priority is 100
DP Reader 0 (0x307490): Selecting DataPlane "rdma" (preferred) for use
DP Reader 0 (0x307490): ignoring fabric cxi because it's not of a supported type. It may work to force this fabric to be used by setting FABRIC_IFACE to cxi0, but it may not be stable or performant.
DP Reader 0 (0x307490): ignoring fabric cxi because it's not of a supported type. It may work to force this fabric to be used by setting FABRIC_IFACE to cxi1, but it may not be stable or performant.
DP Reader 0 (0x307490): using interface set by FABRIC_IFACE.
DP Reader 0 (0x307490): Fabric parameters to use at fabric initialization: fi_info:
    caps: [ FI_MSG, FI_RMA, FI_READ, FI_WRITE, FI_RECV, FI_SEND, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_MULTI_RECV, FI_TRIGGER, FI_SHARED_AV ]
    mode: [  ]
    addr_format: FI_ADDR_CXI_COMPAT
    src_addrlen: 4
    dest_addrlen: 0
    src_addr: fi_addr_cxi://0x002325ff
    dest_addr: (null)
    handle: (nil)
    fi_tx_attr:
        caps: [ FI_MSG, FI_RMA, FI_READ, FI_WRITE, FI_SEND, FI_TRIGGER, FI_SHARED_AV ]
        mode: [  ]
        op_flags: [  ]
        msg_order: [  ]
        comp_order: [ FI_ORDER_NONE ]
        inject_size: 192
        size: 256
        iov_limit: 1
        rma_iov_limit: 1
        tclass: 0x0
    fi_rx_attr:
        caps: [ FI_MSG, FI_RMA, FI_RECV, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_MULTI_RECV, FI_TRIGGER, FI_SHARED_AV ]
        mode: [  ]
        op_flags: [  ]
        msg_order: [ FI_ORDER_WAW, FI_ORDER_SAS, FI_ORDER_RMA_WAW, FI_ORDER_ATOMIC_RAR, FI_ORDER_ATOMIC_RAW, FI_ORDER_ATOMIC_WAR, FI_ORDER_ATOMIC_WAW ]
        comp_order: [ FI_ORDER_NONE ]
        total_buffered_recv: 0
        size: 1024
        iov_limit: 1
    fi_ep_attr:
        type: FI_EP_RDM
        protocol: FI_PROTO_CXI_COMPAT
        protocol_version: 1
        max_msg_size: 1073741824
        msg_prefix_size: 0
        max_order_raw_size: 18446744073709551615
        max_order_war_size: 18446744073709551615
        max_order_waw_size: 18446744073709551615
        mem_tag_format: 0x0000aaaaaaaaaaaa
        tx_ctx_cnt: 0
        rx_ctx_cnt: 0
        auth_key_size: 8
    fi_domain_attr:
        domain: 0x0
        name: cxi2
        threading: FI_THREAD_SAFE
        control_progress: FI_PROGRESS_MANUAL
        data_progress: FI_PROGRESS_MANUAL
        resource_mgmt: FI_RM_ENABLED
        av_type: FI_AV_UNSPEC
        mr_mode: [ FI_MR_VIRT_ADDR, FI_MR_ALLOCATED, FI_MR_PROV_KEY, FI_MR_ENDPOINT ]
        mr_key_size: 4
        cq_data_size: 8
        cq_cnt: 32
        ep_cnt: 128
        tx_ctx_cnt: 256
        rx_ctx_cnt: 256
        max_ep_tx_ctx: 256
        max_ep_rx_ctx: 256
        max_ep_stx_ctx: 0
        max_ep_srx_ctx: 0
        cntr_cnt: 16
        mr_iov_limit: 1
        caps: [  ]
        mode: [  ]
        auth_key_size: 8
        max_err_data: 0
        mr_cnt: 100
        tclass: 0x0
    fi_fabric_attr:
        name: cxi
        prov_name: cxi
        prov_version: 0.0
        api_version: 1.11
    nic:
        fi_device_attr:
            name: cxi2
            device_id: 0x501
            device_version: 2
            vendor_id: 0x17db
            driver: cxi_core
            firmware: (null)
        fi_bus_attr:
            bus_type: FI_BUS_PCI
            fi_pci_attr:
                domain_id: 0
                bus_id: 213
                device_id: 0
                function_id: 0
        fi_link_attr:
            address: 0x1192
            mtu: 2112
            speed: 200000000000
            state: FI_LINK_UP
            network_type: HPC Ethernet

Reader found CXI auth key: 10059 5
Reader 0 (0x307490): Waiting for writer response message in SstReadOpen("helloSst")
Reader 0 (0x307490): finished wait writer response message in read_open
Reader 0 (0x307490): Opening Reader Stream.
Writer stream params are:
Param -   RegistrationMethod=File
Param -   RendezvousReaderCount=1
Param -   QueueLimit=0 (unlimited)
Param -   QueueFullPolicy=Block
Param -   StepDistributionMode=StepsAllToAll
Param -   DataTransport=rdma
Param -   ControlTransport=sockets
Param -   NetworkInterface=(default)
Param -   ControlInterface=(default to NetworkInterface if applicable)
Param -   DataInterface=(default to NetworkInterface if applicable)
Param -   CompressionMethod=None
Param -   CPCommPattern=Min
Param -   MarshalMethod=BP5
Param -   FirstTimestepPrecious=False
Param -   IsRowMajor=1  (not user settable) 
Param -   OpenTimeoutSecs=60 (seconds)
Param -   SpeculativePreloadMode=Auto
Param -   SpecAutoNodeThreshold=1
Param -   ControlModule=select
Reader 0 (0x307490): Reader stream params are:
Param -   RegistrationMethod=File
Param -   DataTransport=rdma
Param -   ControlTransport=sockets
Param -   NetworkInterface=(default)
Param -   ControlInterface=(default to NetworkInterface if applicable)
Param -   DataInterface=(default to NetworkInterface if applicable)
Param -   AlwaysProvideLatestTimestep=False
Param -   OpenTimeoutSecs=60 (seconds)
Param -   SpeculativePreloadMode=Auto
Param -   SpecAutoNodeThreshold=1
Param -   ControlModule=select
Reader 0 (0x307490): Writer is using Minimum Connection Communication pattern (min)
DP Reader 0 (0x307490): Received contact info for WS_stream 0x3faef0, WSR Rank 0
Reader 0 (0x307490): Sending Reader Activate messages to writer
Reader 0 (0x307490): Finish opening Stream "helloSst", starting with Step number 0
Reader 0 (0x307490): Wait for next metadata after last timestep -1
Reader 0 (0x307490): Waiting for metadata for a Timestep later than TS -1
Reader 0 (0x307490): (PID 1205a, TID 7fffd3e6af80) Stream status is Established
Reader 0 (0x307490): Received a Timestep metadata message for timestep 0, signaling condition
Reader 0 (0x307490): Received a writer close message. Timestep 0 was the final timestep.
Reader 0 (0x307490): Examining metadata for Timestep 0
Reader 0 (0x307490): Returning metadata for Timestep 0
Reader 0 (0x307490): Setting TSmsg to Rootentry value
DP Reader 0 (0x307490): RdmaTimestepArrived with Timestep = 0, PreloadMode = 1
Reader 0 (0x307490): SstAdvanceStep returning Success on timestep 0
DP Reader 0 (0x307490): Performing remote read of Writer Rank 0 at step 0
DP Reader 0 (0x307490): Block address is 0x446550, with a key of -50386303
DP Reader 0 (0x307490): Remote read target is Rank 0 (Offset = 0, Length = 40)
DP Reader 0 (0x307490): Posted RDMA get for Writer Rank 0 for handle 0x441590
DP Reader 0 (0x307490): Rank 0, RdmaWaitForCompletion
DP Reader 0 (0x307490): got completion for request with handle 0x441590 (flags 260).
Incoming variable is of size 10
Reader rank 0 reading 10 floats starting at element 0
Reader 0 (0x307490): Sending ReleaseTimestep message for timestep 0, one to each writer
Reader 0 (0x307490): 
Stream "helloSst" (0x307490) summary info:
Reader 0 (0x307490): 	Duration (secs) = 0.001023
Reader 0 (0x307490): 	Timestep Metadata Received = 1
Reader 0 (0x307490): 	Timesteps Consumed = 1
Reader 0 (0x307490): 	MetadataBytesReceived = 176 (176 bytes)
Reader 0 (0x307490): 	DataBytesReceived = 0 (0 bytes)
Reader 0 (0x307490): 	PreloadBytesReceived = 0 (0 bytes)
Reader 0 (0x307490): 	PreloadTimestepsReceived = 0
Reader 0 (0x307490): 	AverageReadRankFanIn = 1.0
Reader 0 (0x307490): 
Reader 0 (0x307490): Destroying stream 0x307490, name helloSst
DP Reader 0 (0x307490): Tearing down RDMA state on reader.
Reader 0 (0x307490): Reader-side close handler invoked
Reader 0 (0x307490): Reference count now zero, Destroying process SST info cache
Reader 0 (0x307490): Freeing LastCallList
Reader 0 (0x7fffffff2f08): SstStreamDestroy successful, returning

@eisenhauer
Copy link
Member

Are you just concerned about the zero bytes in the "summary info" at the end? If so, that's more of an informational thing and it appears that it wasn't actually implemented in the original RDMA data plane, so that value never gets updated and you can ignore it. (I can sort actually implementing it, but perhaps not until Friday because I have ORNL visitors here today and tomorrow.). So the question is whether or not you're getting the right data in the read buffer.

@franzpoeschel
Copy link
Contributor Author

franzpoeschel commented Jun 21, 2023

Are you just concerned about the zero bytes in the "summary info" at the end?

It's not only that, no. In the call to WaitForAnyPull that corresponds with the data load request, the CQEntry (to my understanding) reports zero loaded data:

static int WaitForAnyPull(CP_Services Svcs, Rdma_RS_Stream Stream)
{
    FabricState Fabric = Stream->Fabric;
    RdmaCompletionHandle Handle_t;
    struct fi_cq_data_entry CQEntry = {0};

    ssize_t rc;
    rc = fi_cq_sread(Fabric->cq_signal, (void *)(&CQEntry), 1, NULL, -1);
    if (rc < 1)
    {
        struct fi_cq_err_entry error;
        fi_cq_readerr(Fabric->cq_signal, &error, 0);
        Svcs->verbose(Stream->CP_Stream, DPCriticalVerbose,
                      "failure while waiting for completions WaitForAnyPull "
                      "(%d (%s - %s)).\n",
                      rc, fi_strerror(error.err),
                      fi_cq_strerror(Fabric->cq_signal, error.err,
                                     error.err_data, NULL, error.len));
        return 0;
    }
    else
    {
        Svcs->verbose(
            Stream->CP_Stream, DPTraceVerbose,
            "got completion for request with handle %p (flags %li).\n",
            CQEntry.op_context, CQEntry.flags);
        Handle_t = (RdmaCompletionHandle)CQEntry.op_context;
        Handle_t->Pending--;
        Stream->PendingReads--;

        // TODO: maybe reuse this memory registration
        if (Fabric->local_mr_req)
        {
            fi_close((struct fid *)Handle_t->LocalMR);
        }
    }
    return 1;
}

Latching onto this with GDB:

(gdb) p CQEntry
$2 = {op_context = 0x441d40, flags = 260, len = 0, buf = 0x0, data = 0}
(gdb) p *Handle_t
$4 = {LocalMR = 0x0, CPStream = 0x3adaf0, Buffer = 0x440a50, Length = 40, Rank = 0, Pending = 1, PreloadBuffer = 0x0}
(gdb) p *(float*)Handle_t->Buffer
$6 = 0
(gdb) p ((float*)Handle_t->Buffer)[1]
$8 = 0 // should be 1

Length = 40 should be the load call that attempted to read 10 floating point values. 260 is FI_READ | FI_RMA.

@eisenhauer
Copy link
Member

Ah, OK, puzzling. I have limited time today and tomorrow, but I'll look when I can and let you know if I have some kind of insight...

@eisenhauer
Copy link
Member

OK, I'm trying to run this, but I'm getting somewhat different output on the reader (sst2.txt):
DP Reader 0 (0x2327a0): Posted RDMA get for Writer Rank 0 for handle 0x8ccfe0
DP Reader 0 (0x2327a0): Rank 0, RdmaWaitForCompletion
DP Reader 0 (0x2327a0): failure while waiting for completions WaitForAnyPull (-259 (Input/output error - PERM_VIOLATION)).
Incoming variable is of size 10

That there's a failure in WaitForAnyPull is consistent with not getting any data, but I haven't sorted out a possible cause yet. Also trying to sort FI_LOG_LEVEL to get internal info from libfabric. I thought I had output from that, but then I switched to starting from your script and I don't seem to be getting anything. Will keep poking.

@franzpoeschel
Copy link
Contributor Author

Thank you for trying this out

OK, I'm trying to run this, but I'm getting somewhat different output on the reader (sst2.txt): DP Reader 0 (0x2327a0): Posted RDMA get for Writer Rank 0 for handle 0x8ccfe0 DP Reader 0 (0x2327a0): Rank 0, RdmaWaitForCompletion DP Reader 0 (0x2327a0): failure while waiting for completions WaitForAnyPull (-259 (Input/output error - PERM_VIOLATION)). Incoming variable is of size 10

I did get this PERM_VIOLATION error in two different situations:

  1. Without using fi_mr_bind() and fi_mr_enable() (as required by the FI_MR_ENDPOINT flag, I always got this error
  2. With the implementation as it currently is, I get the error when increasing the vector size from 10 to something like 10000.

My guess: That this runs without an apparent error at a low vector length for me is probably a fluke?

That there's a failure in WaitForAnyPull is consistent with not getting any data, but I haven't sorted out a possible cause yet. Also trying to sort FI_LOG_LEVEL to get internal info from libfabric. I thought I had output from that, but then I switched to starting from your script and I don't seem to be getting anything. Will keep poking.

In my tests, the FI_LOG_LEVEL=Debug output was pretty much useless except when connecting to the network. Maybe there is some secret Cray debugging flag that I don't know, but their implementation seems to mostly ignore the FI_LOG_LEVEL..

@franzpoeschel
Copy link
Contributor Author

What just came to my mind: The MPI dataplane only works when launching at least two processes each for writer and reader. I will try out next week if this might make a difference here.

Also, I generally played a lot with LD_PRELOAD and seeing what MPI does on the system in order to figure things out. What I noticed is that MPICH does not seem to use fi_mr_reg() at all, but something else. I should maybe take a look at the libfabric implementation in MPICH next week again and see if I can figure out anything from there (though this seems more like a fool's hope..).

@franzpoeschel
Copy link
Contributor Author

Using two processes per writer and per reader did not make a difference.
Also I was hoping that there might be some fi_*() call returning an error status that was not checked. I added some error checks (last commit) but none of them reports an error.

@eisenhauer
Copy link
Member

Well, worth a shot. Unfortunately I'm on travel this week and not likely to get back to this until I return.

@franzpoeschel
Copy link
Contributor Author

I have isolated a minimal example of libfabric as it is used in SST here: https://codebase.helmholtz.cloud/poesch58/libfabric-minimal-example
I will try adding the CXI adaptations there as well. i expect that this will either result in the same issue, giving us a good base for getting this back to the ORNL and HPE support, or that this might even work and give us an idea what's going wrong in the "big" codebase.

@franzpoeschel
Copy link
Contributor Author

franzpoeschel commented Jul 7, 2023

Alright, the minimal example does reproduce the exact same issue that I see in this PR.

I have two tags in the Git repo:

  1. Tag sockets-working: Communication via the Sockets provider works. On my local machine:

    > export FABRIC_IFACE=eno1
    > ./write &
    ...
    > ./read
    ...
    Received 128 bytes of data at address 0x7fffffff17a0
    Received remote message:
    Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut lLorem ipsum dolor sit amet,
    
  2. Tag cxi-empty-message. The adaptations made for the CXI provider are seen in this diff.

    > salloc -N2 -n2 -c1 --ntasks-per-node=1 -ACSC380 -t 1:00:00 --network=single_node_vni,job_vni
    > export FABRIC_IFACE=cxi2
    > srun -N1 -n1 ./writer &
    ...
    > srun -N1 -n1 ./reader
    ...
    Received 0 bytes of data at address (nil)
    Received remote message:
    w?
    

There is no direct failure in the CXI run, no error message, there is even a completion even from fi_cq_sread(), but the event claims zero bytes that were read.

@franzpoeschel
Copy link
Contributor Author

I think I have good news: I received feedback from HPE on the minimal reproducer and it seems like it was not too far from a working implementation.
They suggested that the CXI provider does not support FI_MR_VIRT_ADDR which ADIOS2 currently uses. Adapting to this is quite easy since it implies that the virtual memory base address of read requests needs to be replaced with a zero.

A first attempt shows this result from the minimal example on Crusher:

Using second vni.
Address len: 4l
Received 0 bytes of data at address (nil)
Received remote message:
Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut lLorem ipsum dolor sit amet,

(The length of received data is apparently just not reported in this call)

@eisenhauer
Copy link
Member

Well, at least somewhat promising. Next steps? I can poke at this at some point, but things are getting complex with next week being a holiday week.

@franzpoeschel
Copy link
Contributor Author

franzpoeschel commented Nov 17, 2023

Well, at least somewhat promising. Next steps? I can poke at this at some point, but things are getting complex with next week being a holiday week.

I tried to integrate the minimal example thing back into ADIOS2 just now and had the first functioning libfabric-based SST run on Frontier (I used Frontier since the Crusher queue is currently full). So it seems like this is the right approach and the fix suggested by Cray/HPE was the right one.

I pushed my changes (and also rebased this PR onto the release_29 branch since that contains the most recent fixes for SST. Can rebase it back onto master after the release).

Next steps?

  • Scaling tests, I only did a serial test from one node to another so far
  • Implement the fix for Speculative Preload Mode, I skipped this for now since I wanted to see if this works at all and the preload logic is more involved than the normal one
  • Remove the hardcoded changes to bring this into a mergeable state
  • Remove other instances of "shortcut" logic (skipped "free()", selecting the right keys from the environment (they are currently all the same, but this might change))
  • As a follow-up: Maybe integrate the minimal example for libfabric into ADIOS2, @pnorbert once mentioned that there is potential interest in a tool that can be used to determine which libfabric provider configuration might work for SST. I think that this minimal example is a good start for adding sth like that.
  • Try this on other Cray systems such as Perlmutter

but things are getting complex with next week being a holiday week.

This is fine. I will continue working on this PR and on the points mentioned above, but this is not super urgent for me and the code is currently far from being mergeable anyway.
The important thing for now is that SST+libfabric seems to be (minimally) functional with this on Frontier.

@vicentebolea
Copy link
Collaborator

@franzpoeschel thanks for your contribution, I will review this next week (running out of cycles for adios2 this week). In the meanwhile please fix the git conflicts.

@franzpoeschel
Copy link
Contributor Author

franzpoeschel commented Nov 17, 2023

@franzpoeschel thanks for your contribution, I will review this next week (running out of cycles for adios2 this week). In the meanwhile please fix the git conflicts.

This PR is not yet really in a reviewable state.
The current update is that SST+libfabric now has basic functionality with the Cray CXI provider, but the CXI provider is currently essentially hardcoded. There is still some way to go before merging this.

EDIT: I just saw that something I did triggered a review request somehow. No idea how that happened, you can ignore this for now.

@eisenhauer
Copy link
Member

@franzpoeschel Happy to lend a hand with integration too (after the US holiday).

@franzpoeschel
Copy link
Contributor Author

I have pushed a commit that ideally adds those fixes for speculative preload mode as well.
Has anyone recently tested if speculative preload mode still works? Setting sstIO.SetParameter("SpeculativePreloadMode", "OFF"); in helloSstReader.cpp has no effect in my tests.

@eisenhauer
Copy link
Member

I have pushed a commit that ideally adds those fixes for speculative preload mode as well. Has anyone recently tested if speculative preload mode still works? Setting sstIO.SetParameter("SpeculativePreloadMode", "OFF"); in helloSstReader.cpp has no effect in my tests.

While the Preload modes had confirmed positive effects on the sockets-based data plane (big improvements on simple bandwidth tests), I was never as clear that they would be a win in an RDMA situation. But as long as the RDMA parts work, we can work on Preload at our leisure...

@franzpoeschel
Copy link
Contributor Author

I have pushed a commit that ideally adds those fixes for speculative preload mode as well. Has anyone recently tested if speculative preload mode still works? Setting sstIO.SetParameter("SpeculativePreloadMode", "OFF"); in helloSstReader.cpp has no effect in my tests.

While the Preload modes had confirmed positive effects on the sockets-based data plane (big improvements on simple bandwidth tests), I was never as clear that they would be a win in an RDMA situation. But as long as the RDMA parts work, we can work on Preload at our leisure...

I was just wondering since the RDMA/libfabric implementation has a rather complex preload logic which I was not actually able to activate.
But I'm fine with focusing on pull-based reading for now.

@franzpoeschel
Copy link
Contributor Author

I successfully ran a 128-node setup with this using PIConGPU on Frontier, so this is beginning to look like a breakthrough.

Do you know if there is any logic in SST to evenly distribute the network cards found on a node among jobs? On Frontier, there are generally four per node (cxi0, cxi1, cxi2, cxi3) and it would probably be good to use them evenly.

@eisenhauer
Copy link
Member

So far as I know, no, there is no SST logic to select network interfaces. We have relied upon the ability to set the FI_INTERFACE variable when necessary, but that is only sufficient to specify a functional interface, not to distribute load between multiple. So if that's necessary, we'll have to add something.

@franzpoeschel
Copy link
Contributor Author

Ok, that's what I suspected. Choices that I see for this:

  • Just continue using environment variables. Users can write a little wrapper script srun select_interface.sh bin/code_with_adios_stream. Pro: It's flexible, users can use their knowledge on how the application is scheduled to make a good distribution of network cards. Con: No one is actually going to do it.
  • If the FABRIC_IFACE is not set, just pick one at random. Pro: Still better than all instances using the same card, can be used in conjunction with the above approach (as a fallback). Con: Might break code that relies on the legacy fabric selection logic, so maybe implement this only for CXI?
  • Collectively gather all hostnames, figure out what MPI ranks are running on the same host as the current rank, select a card depending on that. Pro: Good performance by default. Con: Still not the best performance (ignores internal layout of nodes), most complex logic.

I think that the more complex options can be done in a later PR while this PR should focus on getting CXI runnable.

@franzpoeschel
Copy link
Contributor Author

franzpoeschel commented Nov 23, 2023

I successfully ran a benchmark at 4096 nodes now. The performance of libfabric-based SST seems to be slightly better than that of MPI-based SST.

What I'm measuring is the "perceived" throughput, i.e. the throughput based on the time from load request to load completion on the reader site. This figure is skewed by communication overhead.
With libfabric I get roughly 3.6~4.7 GiB per second and node (14~19 TiB per second in total). With MPI, that's 2.7~3.7GiB per second and node (11~15TiB per second in total).

These figures are from a single benchmark each for now, but the results are roughly what I expected.

@eisenhauer
Copy link
Member

Fantastic! I was paging through the changes to rdma_dp and it shouldn't be hard to integrate in such a way that we support the prior providers as well. Quick question, I notice that you're using manual progress rather than auto (which I recall was an earlier problem with cxi). Do you have to have an provision to check the completion queue in the background so that transfers continue to progress even if ADIOS is busy doing something other than rdma calls?

@franzpoeschel
Copy link
Contributor Author

franzpoeschel commented Nov 23, 2023

Quick question, I notice that you're using manual progress rather than auto (which I recall was an earlier problem with cxi). Do you have to have an provision to check the completion queue in the background so that transfers continue to progress even if ADIOS is busy doing something other than rdma calls?

CXI only supports manual progress, specifying automatic progress will not let you select the CXI provider. From how I understand the SST implementation in ADIOS2, automatic progress is not needed anyway: On the writer side, SST runs in its own thread and uses blocking I/O within that thread; on the reader side, loading data is written in blocking way as well (ref. fi_cq_sread() inside PullSelection() with timeout=-1).

I was paging through the changes to rdma_dp and it shouldn't be hard to integrate in such a way that we support the prior providers as well.

The biggest challenge is probably to not accidentally break libfabric for other systems because some configuration detail changed. These things are hard to test automatically.
Given that results are looking good for now, I think that I can start working on integrating this next week.

Copy link
Contributor Author

@franzpoeschel franzpoeschel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now enabled automatically via CMake by checking if the CXI header can be included.
I've done a small test on my local computer, this now works with the sockets provider as well as with the CXI provider.

Remaining todo:

  • Rebase back onto master
  • Test on Summit, maybe do another somewhat larger test on Frontier if this is still working properly (though I don't see why not)
  • Remove changes from SST example again?

Before I rebase this back onto master, this diff is more precise than the one shown in the PR: release_29...franzpoeschel:ADIOS2:libfabric-cray

Todo after this PR:

  • Some form of polling for better supporting FI_PROGRESS_MANUAL
  • Maybe add an environment variable that lets you select the provider directly. Testing SST against the sockets provider was a bit annoying, since the FABRIC_IFACE=eno1 was matched by multiple providers and the tcp provider (selected by the dataplane) did not work. I can do that in a new PR, it should be a simple enough change.

source/adios2/toolkit/sst/dp/rdma_dp.c Show resolved Hide resolved
source/adios2/toolkit/sst/dp/rdma_dp.c Show resolved Hide resolved
source/adios2/toolkit/sst/dp/rdma_dp.c Show resolved Hide resolved
@franzpoeschel franzpoeschel force-pushed the libfabric-cray branch 2 times, most recently from 6518dad to 7428848 Compare December 1, 2023 11:15
@franzpoeschel franzpoeschel marked this pull request as ready for review December 1, 2023 11:15
@franzpoeschel franzpoeschel changed the title [WIP] Adapt libfabric dataplane of SST to Cray CXI provider Adapt libfabric dataplane of SST to Cray CXI provider Dec 1, 2023
@eisenhauer
Copy link
Member

Looks like clang-format wants to reorder your includes:

#ifdef SST_HAVE_CRAY_CXI
// Needs to be included before rdma/fi_cxi_ext.h
-#include <stdbool.h>
#include <rdma/fi_cxi_ext.h>
+#include <stdbool.h>
#endif

If stdbool.h is required for rdma/fi_cxi_ext.h (and so must come first), you can either move it before the #ifdef (shouldn't hurt for other includes), or put a blank line between the two (so clang-format won't mess with it).

@franzpoeschel
Copy link
Contributor Author

If stdbool.h is required for rdma/fi_cxi_ext.h (and so must come first), you can either move it before the #ifdef (shouldn't hurt for other includes), or put a blank line between the two (so clang-format won't mess with it).

It is required for that header, yes. I'll add a little comment in between, that usually keeps clang-format from touching the include order.

@eisenhauer
Copy link
Member

OK, @vicentebolea might have opinions about the CMake modes, but if they work, they're a start. @ax3l , you've tested on Frontier and Perlmutter? @pnorbert, before merging we probably need to test there everywhere we can that uses RDMA to make sure we don't have a regression. I can do summit, but I don't have as much access to other platforms as you probably do...

@franzpoeschel
Copy link
Contributor Author

There does seem to be a regression that affects Summit. I'll try to figure out where it goes wrong.

@franzpoeschel
Copy link
Contributor Author

Fixed now and tested on Summit.

@eisenhauer
Copy link
Member

@franzpoeschel If you can fix the formatting and rebase this on current master, we can probably merge this.

@franzpoeschel
Copy link
Contributor Author

@franzpoeschel If you can fix the formatting and rebase this on current master, we can probably merge this.

done

@eisenhauer eisenhauer enabled auto-merge (squash) December 6, 2023 12:49
@eisenhauer eisenhauer merged commit 42e062b into ornladios:master Dec 6, 2023
34 checks passed
pnorbert added a commit to pnorbert/ADIOS2 that referenced this pull request Dec 7, 2023
* master:
  Update readme for heat transfer example with new location and build instructions
  Ignore tests with defects for now
  Adapt libfabric dataplane of SST to Cray CXI provider (ornladios#3672)
  ci: fix path to lsan suppressions, fix broken gh status post
  Use adios2_mode_readRandomAccess in matlab open to make it work for BP5 (ornladios#3956)
  Add Global Array Capabilities and Limitations
  Add Section for Anatomy of an ADIOS Program
  Enable Shell-Check for gh-actions scripts
  Enable Shell-Check for circle CI scripts
  Enable Shell-Check for tau contract scripts
  Enable Shell-Check for scorpio contract scripts
  Enable Shell-Check for lammps contract scripts
  Delete VTK code in examples
  Fix MATLAB bindings for MacOS (ornladios#3950)
  Set the compiler for the Kokkos DataMan example to what is used to build Kokkos
  Fix the HIP architecture CMAKE variable (ornladios#3931)
  perfstubs 2023-11-27 (845d0702) (ornladios#3944)
  Revert "Only rank 0 should print the initialization message in perfstub"
dmitry-ganyushin added a commit to dmitry-ganyushin/ADIOS2 that referenced this pull request Dec 7, 2023
* master:
  Update readme for heat transfer example with new location and build instructions
  Ignore tests with defects for now
  Adapt libfabric dataplane of SST to Cray CXI provider (ornladios#3672)
  ci: fix path to lsan suppressions, fix broken gh status post
  Use adios2_mode_readRandomAccess in matlab open to make it work for BP5 (ornladios#3956)
  Add Global Array Capabilities and Limitations
  Add Section for Anatomy of an ADIOS Program
  Enable Shell-Check for gh-actions scripts
  Enable Shell-Check for circle CI scripts
  Enable Shell-Check for tau contract scripts
  Enable Shell-Check for scorpio contract scripts
  Enable Shell-Check for lammps contract scripts
  Delete VTK code in examples
  Fix MATLAB bindings for MacOS (ornladios#3950)
  Set the compiler for the Kokkos DataMan example to what is used to build Kokkos
  Fix the HIP architecture CMAKE variable (ornladios#3931)
  perfstubs 2023-11-27 (845d0702) (ornladios#3944)
  Revert "Only rank 0 should print the initialization message in perfstub"
  CI Contract: Build examples with external ADIOS
  Example using DataMan with Kokkos buffers
  Propagating the GPU logic inside the DataMan engine
  ci: Use mpich built with ch3:sock:tp for faster tests
  ReadMe.md: Mention 2.9.2 release
  Cleanup server output a bit (ornladios#3914)
  ci: set openmpi and openmp params
  Example using Kokkos buffers with SST
  Changes to MallocV to take into consideration the memory space of a variable
  Change install directory of Gray scott files again
  ci,crusher: increase supported num branches
  ci: add shellcheck coverage to source and testing
  Change install directory of Gray scott files
  Only rank 0 should print the initialization message in perfstub
  Defining and computing derived variables (ornladios#3816)
  Add Remote "-status" command to see if a server is running and where (ornladios#3911)
  examples,hip: use find_package(hip) once in proj
  Add Steps Tutorial
  Add Operators Tutorial
  Add Attributes Tutorial
  Add Variables Tutorial
  Add Hello World Tutorial
  Add Tutorials' Download and Build section
  Add Tutorials' Overview section
  Improve bpStepsWriteRead* examples
  Rename bpSZ to bpOperatorSZWriter
  Convert bpAttributeWriter to bpAttributeWriteRead
  Improve bpWriter/bpReader examples
  Close file after reading for hello-world.py
  Fix names of functions in engine
  Fix formatting warnings
  Add dataspaces.rst in the list of engines
  Add query.rst
  cmake: find threads package first
  docs: update new_release.md
  Bump version to v2.9.2
  ci: update number of task for mpich build
  clang-format: Correct format to old style
  Merge pull request ornladios#3878 from anagainaru/test-null-blocks
  Merge pull request ornladios#3588 from vicentebolea/fix-mpi-dp
  bp5: make RecMap an static anon namespaced var
  Replace LookupWriterRec's linear search on RecList with an unordered_map. For 250k variables, time goes from 21sec to ~1sec in WSL. The order of entries in RecList was not necessary for the serializer to work correctly. (ornladios#3877)
  Fix data length calculation for hash (ornladios#3875)
  Merge pull request ornladios#3823 from eisenhauer/SstMemSel
  gha,ci: update checkout to v4
  Blosc2 USE ON: Fix Module Fallback
  cmake: correct prefer_shared_blosc behavior
  cmake: correct info.h installation path
  ci: disable MGARD static build
  operators: fix module library
  ci: add downloads readthedocs
  cmake: Add Blosc2 2.10.1 compatibility.
  Fix destdir install test (ornladios#3850)
  cmake: update minimum cmake to 3.12 (ornladios#3849)
  MPI: add timeout for conf test for MPI_DP (ornladios#3848)
  MPI_DP: do not call MPI_Init (ornladios#3847)
  install: export adios2 device variables (ornladios#3819)
  Merge pull request ornladios#3799 from vicentebolea/support-new-yaml-cpp
  Merge pull request ornladios#3737 from vicentebolea/fix-evpath-plugins-path
  Partial FFS Upstream, only changes to type_id
  bpls -l  with scalar string variable: print the value (since min/max is empty). This changes the code for all types using Engine.Get() to get the value now.
  Set AWS version requirement to 1.10.15 and also turn it OFF by default as it is not a stable feature of ADIOS just yet.
  Fix local values block reading
  docs,ci: backport fixes for readthedocs
pnorbert added a commit to pnorbert/ADIOS2 that referenced this pull request Dec 12, 2023
* master:
  Have HDF5 write raise error if operator(s) requested (ornladios#3951)
  fix for ASAN issue related to JoinedDimArray handling in BP5 deserializer (ornladios#3963)
  New operator MDR, for refactoring floating point arrays using MGARD's new MDR extension. (ornladios#3826)
  restricted http transport from windows builds.
  XMLConfigTest: Add RemoveIO test
  adios2::core::ADIOS: Initialize new IO objects with config file
  removed unsused variable
  Update readme for heat transfer example with new location and build instructions
  Ignore tests with defects for now
  Adapt libfabric dataplane of SST to Cray CXI provider (ornladios#3672)
  ci: fix path to lsan suppressions, fix broken gh status post
  Use adios2_mode_readRandomAccess in matlab open to make it work for BP5 (ornladios#3956)
  Add Global Array Capabilities and Limitations
  Add Section for Anatomy of an ADIOS Program
  Enable Shell-Check for gh-actions scripts
  Enable Shell-Check for circle CI scripts
  Enable Shell-Check for tau contract scripts
  Enable Shell-Check for scorpio contract scripts
  Enable Shell-Check for lammps contract scripts
  Delete VTK code in examples
  Fix MATLAB bindings for MacOS (ornladios#3950)
  Set the compiler for the Kokkos DataMan example to what is used to build Kokkos
  Fix the HIP architecture CMAKE variable (ornladios#3931)
  perfstubs 2023-11-27 (845d0702) (ornladios#3944)
  Revert "Only rank 0 should print the initialization message in perfstub"
  Formatting
  Formatting
  Revision
  Added buffered data receive in the client side.
  A socket version of HTTP connector. Proxy server host is hardwired to "localhost" and port to 9999 Remote bpls: bpls -E bp4 -T "Library=HTTP" /remote_path/myVector_cpp.bp -d bpInts
@franzpoeschel
Copy link
Contributor Author

franzpoeschel commented Dec 13, 2023

This seems to hang on full-scale Frontier (half-scale works fine). My current suspicion is that this might have to do with the manual data progress; maybe #3964 can help there.
The MPI dataplane works fine at full scale Frontier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants