-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add draft launching script #70
base: main
Are you sure you want to change the base?
Conversation
Works for TP==1This works if TP size for context is 1:
The curl return successful result. Fails for TP>1When context TP is set above 1, then logs contain error about connecting to port 36183:
Reproduction
log:
|
The execution outside launching script is successful for config below:
|
You can test launching script at single 8xH100 machine:
It should print genai-perf output tables with performance and save results in |
I executed slurm cluster sbatch launch script at two nodes: #!/bin/bash
#SBATCH --partition=<PARTITION>
#SBATCH --account=<ACCOUNT>
#SBATCH --job-name=<JOB>
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --time=2:00:00
#SBATCH --mem=0 \
#SBATCH --no-kill \
#SBATCH --exclusive
#SBATCH --output=<ARTIFACTS>/%x_%j_%n_%N.out ### Slurm Output file, %x is job name, %j is job id
#SBATCH --error=<ARTIFACTS>/%x_%j_%n_%N.err ### Slurm Error file, %x is job
set -e
set -x
export HF_TOKEN=<TOKEN>
export JOB_DIR=<ARTIFACTS>
export LOGDIR=${JOB_DIR}/logs
export PROFILESDIR=${JOB_DIR}/profiles
export SCHEDULER_FILE=$LOGDIR/scheduler.json
export SCHEDULER_LOG=$LOGDIR/scheduler.log
export DONE_MARKER=$LOGDIR/done.txt
export DEVICE="gpu"
export INTERFACE="eth3"
export PROTOCOL="tcp"
export CPU_WORKER_MEMORY_LIMIT="14GB"
export RAPIDS_NO_INITIALIZE="1"
export CUDF_SPILL="1"
export RMM_SCHEDULER_POOL_SIZE="1GB"
export RMM_WORKER_POOL_SIZE="72GiB"
export LIBCUDF_CUFILE_POLICY=OFF
export DASK_DATAFRAME__QUERY_PLANNING=False
mkdir -p $LOGDIR
mkdir -p $PROFILESDIR
srun --container-mounts=/lustre/fsw/:/lustre/fsw/ --container-image=<CONTAINER> bash -c "python3 launch_workers.py \
--log-level INFO \
--model-name llama \
--model-ckpt neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 \
--context-workers 2 \
--context-tp-size 2 \
--generate-workers 1 \
--generate-tp-size 4 \
--benchmark \
--benchmark-timeout 800 \
--isl-cached 0 \
--isl-uncached 3000 \
--osl 150 \
--load-type concurrency \
--load-value 1 1 32 \
--min-request-count 20 \
--request-count-per-load-value 10 \
--artifact-dir <ARTIFACTS>" The output from logs suggest that NATS.io host is not propagating into any of NATS.io related components correctly.
|
…to piotrm/rewrite_old_launchers
Benchmark command for context TP=2 DP=2 generate TP=4 DP=1:
Benchmark command for context TP=4 DP=1 generate TP=4 DP=1:
|
No description provided.