Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tracking] E2EShark Model Tests Onnx Mode #566

Open
3 of 6 tasks
saienduri opened this issue Mar 28, 2024 · 11 comments
Open
3 of 6 tasks

[tracking] E2EShark Model Tests Onnx Mode #566

saienduri opened this issue Mar 28, 2024 · 11 comments
Assignees

Comments

@saienduri
Copy link
Contributor

saienduri commented Mar 28, 2024

Below is the list of issues we are hitting when running vision int8 models end to end using onnx mode (onnx export/import -> torch-mlir -> iree-compile -> iree-runtime). You can find the models that lead to each issue in the issue description.

To reproduce the error, please setup SHARK-TestSuite and then run the run.py file with the respective command line flags (More Instructions can be found here.)

To fix the issue, you need to either modify the OnnxToTorch lowering of the corresponding op or add the missing support in the TorchToLinalg lowering. You can find more information in either the model-run.log or iree-compile.log after running the test. This can help you create a smaller repro and then try to fix that, then check if it fixes the model.

You can find the specific logs on what is failing in these locations for <model_name> where SHARK-TestSuite/e2eshark/test-onnx is test run directory:

$ SHARK-TestSuite/e2eshark/test-onnx/pytorch/models/<model_name>/model-run.log
$ SHARK-TestSuite/e2eshark/test-onnx/pytorch/models/<model_name>/iree-compile.log

Issues:

torch-to-linalg

iree:

Onnx VAIQ Models


To run all tests :
python run.py --torchmlirbuild /path_to/torch-mlir/build --ireebuild /path-to/iree-build --cachedir /path-to/model-cache-dir -r test-onnx --tolerance .001 .001 --mode onnx --report -f onnx -g models

To run specific test (ex: onnx/models/AlexNet_vaiq_int8)
python run.py --torchmlirbuild /path_to/torch-mlir/build --ireebuild /path-to/iree-build --cachedir /path-to/model-cache-dir -r test-onnx --tolerance .001 .001 --mode onnx --report --tests onnx/models/AlexNet_vaiq_int8


Versions:
torch-mlir - main - a7302a68
iree - main - 40f25334d2

Status:
Check the latest run report in e2eshark-reports : e2eshark-reports/<DATE>/onnx_reports/statusreport.md

onnx model Pass(with --torchtolinalg):28/34, Day:08/08

tests torch-mlir iree-compile inference Comments
onnx/models/opt-125M-awq passed failed notrun move to privatestorage
onnx/models/retinanet_resnet50_fpn_vaiq_int8 passed notrun notrun onnx.if #696
onnx/models/KeypointRCNN_vaiq_int8 passed notrun notrun onnx.if #696
onnx/models/RAFT_vaiq_int8 passed failed notrun large vector size
onnx/models/Inception_v4_vaiq_int8 passed passed failed inference faile: 32 outputs specified but the provided variant list only has 1 elements;

pytorch model , Pass(with --torchtolinalg): 4/17/28, Day: 08/08

tests torch-mlir iree-compile inference Comments
pytorch/models/opt-1.3b passed failed notrun onnx.Add torchtolinalg
pytorch/models/bart-large passed failed notrun Crash & Number of dims and results of reindexed AffineMap doesn't match on Vectorization iree-org/iree#17591
pytorch/models/llama2-7b-hf notrun notrun notrun too big to run locally
pytorch/models/vicuna-13b-v1.3 notrun notrun notrun too big to run locally, running out 126G memory
pytorch/models/dlrm notrun notrun notrun onnx import incompatible function arguments.The following argument types are supported: 1. (arg0: str, arg1: str, arg2: str) -> str
pytorch/models/gpt2-xl notrun notrun notrun onnx import incompatible function arguments.
pytorch/models/llama2-7b-GPTQ notrun notrun notrun onnx import incompatible function arguments
pytorch/models/phi-1_5 notrun notrun notrun onnx import incompatible function arguments
pytorch/models/phi-2 notrun notrun notrun onnx import incompatible function arguments
pytorch/models/stablelm-3b-4e1t notrun notrun notrun onnx import incompatible function arguments
pytorch/models/t5-large notrun notrun notrun onnx import incompatible function arguments
@schnkmwt
Copy link

schnkmwt commented Apr 1, 2024

Please add Path to the logs directory to make it clear where to look for them. For model <model_name> they are located here assuming SHARK-TestSuite/e2eshark/test-onnx is the test run directory:

$ SHARK-TestSuite/e2eshark/test-onnx/pytorch/models/<model_name>/model-run.log
$ SHARK-TestSuite/e2eshark/test-onnx/pytorch/models/<model_name>/iree-compile.log

@schnkmwt
Copy link

schnkmwt commented Apr 1, 2024

Working on the "Add" Issue. Please assign: #586

@AmosLewis
Copy link
Contributor

AmosLewis commented May 1, 2024

A regression on 2024-04-30 https://github.com/nod-ai/e2eshark-reports/blob/main/2024-04-30/onnx_reports/statusreport.md
2024-04-18
| pytorch/models/gpt2-xl | passed | passed | passed | passed | passed |
| pytorch/models/resnet50 | passed | passed | passed | passed | passed |
2024-04-30
| pytorch/models/gpt2-xl | passed | passed | notrun | failed | notrun |
| pytorch/models/resnet50 | failed | notrun | notrun | notrun | notrun |

@zjgarvey
Copy link
Collaborator

zjgarvey commented May 13, 2024

I'm not sure what is the cause of the discrepancy with the current list of issues, but with an up-to-date torch-mlir with a few minor edits to the recent work in fuse quantized ops, here's a triage list for torch-mlir failures when running:

python run.py --cachedir="/home/zjgar/.cache/" --torchtolinalg -c "/home/zjgar/code/torch-mlir/build/" --mode=onnx --groups=models --framework=onnx

list of failures (and a brief triage):

Test onnx/models/VideoResNet_vaiq_int8 failed [torch-mlir]
    onnx.constant??
Test onnx/models/MobileNetV3_small_vaiq_int8 failed [torch-mlir]
    grouped q convolution
Test onnx/models/RegNet_y_8gf_vaiq_int8 failed [torch-mlir]
    grouped q convolution
Test onnx/models/Inception_v4_vaiq_int8 failed [torch-mlir]
    average Pool
Test onnx/models/pytorch-3dunet_vaiq_int8 failed [torch-mlir]
    resize
Test onnx/models/ShuffleNet_v2_x2_0_vaiq_int8 failed [torch-mlir]
    grouped q convolution
Test onnx/models/MNASNet_1_3_vaiq_int8 failed [torch-mlir]
    grouped q convolution
Test onnx/models/LRASPP_vaiq_int8 failed [torch-mlir]
    grouped q convolution
Test onnx/models/RRDB_ESRGAN_vaiq_int8 failed [torch-mlir]
    resize
Test onnx/models/KeypointRCNN_vaiq_int8 failed [torch-mlir]
    onnx if
Test onnx/models/EfficientNet_v2_s_vaiq_int8 failed [torch-mlir]
    grouped q convolution
Test onnx/models/retinanet_resnet50_fpn_vaiq_int8 failed [torch-mlir]
    onnx if
Test onnx/models/ConvNeXt_vaiq_int8 failed [torch-mlir]
    grouped q convolution

@renxida
Copy link
Contributor

renxida commented May 14, 2024

Ty! On it. Will dive deeper tomorrow.

Also, when posting commands, will so love it if you could do

python run.py --cachedir="~/.cache/" --torchtolinalg -c "~/torch-mlir/build/" --mode=onnx --groups=models --framework=onnx

I think a sizeable chunk have each repo directly cloned to our home dir so something like this would be directly runnable.

@AmosLewis
Copy link
Contributor

Test onnx/models/pytorch-3dunet_vaiq_int8 failed [torch-mlir]
resize @aldesilv

Test onnx/models/RRDB_ESRGAN_vaiq_int8 failed [torch-mlir]
resize @zjgarvey

@aldesilv
Copy link
Collaborator

aldesilv commented May 14, 2024

Test onnx/models/pytorch-3dunet_vaiq_int8 failed [torch-mlir] resize @aldesilv

the immediate issue is the dynamic dims in the input torch.vtensor<[?,256,?,?,?],f32> checked causing the compile error. Next would be the 3-d input

rsuderman pushed a commit to llvm/torch-mlir that referenced this issue May 17, 2024
…ering (#3351)

Addresses [Shark-Turbine
#196](nod-ai/SHARK-TestSuite#196)

Related tracker [Shark-Turbine
#566](nod-ai/SHARK-ModelDev#566)

Related onnx.Resize issues [Shark-Turbine
#616](nod-ai/SHARK-ModelDev#616)
BaneTrifa pushed a commit to BaneTrifa/torch-mlir that referenced this issue May 24, 2024
…ering (llvm#3351)

Addresses [Shark-Turbine
llvm#196](nod-ai/SHARK-TestSuite#196)

Related tracker [Shark-Turbine
llvm#566](nod-ai/SHARK-ModelDev#566)

Related onnx.Resize issues [Shark-Turbine
llvm#616](nod-ai/SHARK-ModelDev#616)
@AmosLewis
Copy link
Contributor

3 pytorch model failed again from 2024-05-29 to 2024-05-30 e2eshark-reports.

| pytorch/models/bert-large-uncased                | passed      | passed        | notrun       | failed         | notrun      |
| pytorch/models/bge-base-en-v1.5                  | passed      | passed        | notrun       | failed         | notrun      |
| pytorch/models/miniLM-L12-H384-uncased           | passed      | passed        | notrun       | failed         | notrun      |

@zjgarvey
Copy link
Collaborator

3 pytorch model failed again from 2024-05-29 to 2024-05-30 e2eshark-reports.

| pytorch/models/bert-large-uncased                | passed      | passed        | notrun       | failed         | notrun      |
| pytorch/models/bge-base-en-v1.5                  | passed      | passed        | notrun       | failed         | notrun      |
| pytorch/models/miniLM-L12-H384-uncased           | passed      | passed        | notrun       | failed         | notrun      |

Any idea what they are failing on?

@AmosLewis
Copy link
Contributor

AmosLewis commented May 30, 2024

Any idea what they are failing on?

Not sure, working with @saienduri to figure it out. I just tested with 0530 torch-mlir d7b8f00 and iree candidate-20240530.909 locally, they passed. It's kind of weird. Sai think it might pass with latest iree, let's see what's going on with 0531 report.

@saienduri
Copy link
Contributor Author

saienduri commented May 31, 2024

Any idea what they are failing on?

Not sure, working with @saienduri to figure it out.

We have rooted the 3 model regression to 40 passes in https://github.com/nod-ai/e2eshark-reports/tree/main/2024-05-31 to the convert-torch-onnx-to-torch pass being outdated in iree (generating different mlirs compared to torch-mlir TOM). So, once torch-mlir gets bumped in iree, they should pass again :)

vivekkhandelwal1 pushed a commit to llvm/torch-mlir that referenced this issue Jun 3, 2024
This addresses 7 of the model failures I'm seeing in the test suite. See
[Shark-Turbine issue
#566](nod-ai/SHARK-ModelDev#566).

Need the op ```linalg.conv_2d_ngchw_gfchw_q``` to be added upstream
before merging this. See [llvm-project PR #92136
](llvm/llvm-project#92136).

A small additional expansion to operand quantization is included in this
patch to address a model failure that occurs when unblocking the
quantized group convolutions in one of these onnx models.
sjarus pushed a commit to sjarus/torch-mlir that referenced this issue Jun 6, 2024
…ering (llvm#3351)

Addresses [Shark-Turbine
llvm#196](nod-ai/SHARK-TestSuite#196)

Related tracker [Shark-Turbine
llvm#566](nod-ai/SHARK-ModelDev#566)

Related onnx.Resize issues [Shark-Turbine
llvm#616](nod-ai/SHARK-ModelDev#616)
sjarus pushed a commit to sjarus/torch-mlir that referenced this issue Jun 6, 2024
This addresses 7 of the model failures I'm seeing in the test suite. See
[Shark-Turbine issue
llvm#566](nod-ai/SHARK-ModelDev#566).

Need the op ```linalg.conv_2d_ngchw_gfchw_q``` to be added upstream
before merging this. See [llvm-project PR #92136
](llvm/llvm-project#92136).

A small additional expansion to operand quantization is included in this
patch to address a model failure that occurs when unblocking the
quantized group convolutions in one of these onnx models.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants