Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more thorough log checks to see if the model server failed deployment during validation #293

Open
KaiyiLiu1234 opened this issue Oct 8, 2024 · 3 comments
Assignees

Comments

@KaiyiLiu1234
Copy link
Collaborator

If the model server fails to access a model, it will produce a "failed to" log which is not picked up by the metal ci e2e. Resolve this by adding checks for failure.

@KaiyiLiu1234 KaiyiLiu1234 self-assigned this Oct 8, 2024
@KaiyiLiu1234
Copy link
Collaborator Author

note: leverage model server exporter that will be implemented soon in the model server

@SamYuan1990
Copy link
Collaborator

we can use failure keywords. ref #288

@KaiyiLiu1234
Copy link
Collaborator Author

Referencing a major error: Did not check if the models, namely http://localhost:8080/AbsPower/BPFOnly/SGDRegressorTrainer_-1 existed or not. This name appears in the logs but it fails to check whether the url actually points to the model or not. Since the node index fix with hatch run, it should end with _0 and not _-1. Multiple runs for e2e failed to pick on this error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants