Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: Why is the performance of GPTTokenizer much worse than Tiktoken? #9999

Open
openvino-book opened this issue Mar 6, 2025 · 0 comments
Assignees
Labels
question Further information is requested

Comments

@openvino-book
Copy link

请提出你的问题

Run the benchmark code

from paddlenlp.transformers import GPTTokenizer
import tiktoken as tik
import time
import statistics

def benchmark_tokenizers(text, num_iterations=1000):
    # 初始化tokenizers
    paddle_tokenizer = GPTTokenizer.from_pretrained(
        "gpt2-en",
        bos_token='<|endoftext|>'
    )
    tik_tokenizer = tik.get_encoding("gpt2")
    
    # 测试编码性能
    paddle_encode_times = []
    tik_encode_times = []
    
    for _ in range(num_iterations):
        # PaddleNLP tokenizer
        start = time.perf_counter()
        paddle_tokens = paddle_tokenizer.encode(text)
        paddle_encode_times.append(time.perf_counter() - start)
        
        # Tiktoken
        start = time.perf_counter()
        tik_tokens = tik_tokenizer.encode(text)
        tik_encode_times.append(time.perf_counter() - start)
    
    # 测试解码性能
    paddle_decode_times = []
    tik_decode_times = []
    
    for _ in range(num_iterations):
        # PaddleNLP tokenizer
        start = time.perf_counter()
        paddle_tokenizer.decode(paddle_tokens.input_ids)
        paddle_decode_times.append(time.perf_counter() - start)
        
        # Tiktoken
        start = time.perf_counter()
        tik_tokenizer.decode(tik_tokens)
        tik_decode_times.append(time.perf_counter() - start)
    
    return {
        "paddle_encode_avg": statistics.mean(paddle_encode_times) * 1000,  # 转换为毫秒
        "tik_encode_avg": statistics.mean(tik_encode_times) * 1000,
        "paddle_decode_avg": statistics.mean(paddle_decode_times) * 1000,
        "tik_decode_avg": statistics.mean(tik_decode_times) * 1000
    }

# 测试文本
text = """Hello, World! How are you doing today? What's the weather like in Beijing? 
This is a longer text sample to get more accurate benchmarking results. 
We need to process enough text to get meaningful performance measurements."""

# 运行基准测试
results = benchmark_tokenizers(text)

# 打印结果
print(f"\nPerformance Comparison (average time in milliseconds):")
print(f"{'=' * 50}")
print(f"PaddleNLP Tokenizer:")
print(f"  Encoding: {results['paddle_encode_avg']:.4f} ms")
print(f"  Decoding: {results['paddle_decode_avg']:.4f} ms")
print(f"\nTiktoken:")
print(f"  Encoding: {results['tik_encode_avg']:.4f} ms")
print(f"  Decoding: {results['tik_decode_avg']:.4f} ms")

You will see the performance of GPTTokenizer is much worse than Tiktoken

Image

@openvino-book openvino-book added the question Further information is requested label Mar 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants