Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add try/except to test-tokenizer-random.py #10276

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

rmusser01
Copy link

Added try/except to main block;
Added (optional) additional cli args:
--max-errors, default=10, "Maximum number of errors before stopping"
--iterations, default=100, help="Number of iterations for random generators"
--tokenizers, nargs="+", help="List of tokenizers to test", default=tokenizers

refactored find_first_mismatch function;
Added self.free() to the LibLlama class

@github-actions github-actions bot added testing Everything test related python python script changes labels Nov 13, 2024
Added blank lines for Lint test;
Added sequence import from typing
Removed 'free' call from Object

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 1 out of 1 changed files in this pull request and generated no suggestions.

# return min(len(ids1), len(ids2))
# Rewritten to use zip() and next() instead of for loop
def find_first_mismatch(ids1: Sequence[Any], ids2: Sequence[Any]) -> int:
return next((i for i, (a, b) in enumerate(zip(ids1, ids2)) if a != b), -1)
Copy link
Collaborator

@jaime-m-p jaime-m-p Nov 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not equivalent with different list sizes.
Check if v3 is a good alternative:

def find_first_mismatch_v1(ids1, ids2):
    for i, (a, b) in enumerate(zip(ids1, ids2)):
        if a != b:
            return i
    if len(ids1) == len(ids2):
        return -1
    return min(len(ids1), len(ids2))

def find_first_mismatch_v2(ids1, ids2) -> int:
    return next((i for i, (a, b) in enumerate(zip(ids1, ids2)) if a != b), -1)

def find_first_mismatch_v3(ids1, ids2) -> int:
    index = next((i for i, (a, b) in enumerate(zip(ids1, ids2)) if a != b), -1)
    if index < 0 and len(ids1) != len(ids2):
        index = min(len(ids1), len(ids2))
    return index

print( find_first_mismatch_v1( [1,2,3], [1,2,3,4] ) )  # output: 3
print( find_first_mismatch_v2( [1,2,3], [1,2,3,4] ) )  # output: -1
print( find_first_mismatch_v3( [1,2,3], [1,2,3,4] ) )  # output: 3

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching that/sorry for not catching it myself, will fix and push a change to my PR later today.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just replaced it with your modified method.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. It was not my intention to suggest removing the types:

- def find_first_mismatch(ids1, ids2) -> int:
+ def find_first_mismatch(ids1: Sequence[Any], ids2: Sequence[Any]) -> int:

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-added the types.

Updated `find_first_mismatch` from suggestion by jaime-m-p regarding incorrect checking.
@jaime-m-p
Copy link
Collaborator

What are you planning to do with --tokenizers, vocab_file and dir_tokenizer?

The main was intended to test one tokenizer. I was procrastinatingusing __main__ as an external script to loop over tokenizers.

Re-added type declarations
@rmusser01
Copy link
Author

Ah, I added them as I wasn't exactly sure of all intended use-cases, and figured adding them as optional args (with defaults that matched the initial settings) would allow for greater customizability down the line while not interfering with existing operations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python python script changes testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants