2B模型是理解有问题吗，Demo如下 #18

summerHearts · 2025-01-23T07:18:43Z

import base64
from openai import OpenAI

deployment = "ollama"
instruction = "点击去出车按钮"
screenshot_path = "task2.jpeg"
assert deployment in ["ollama", "hf"]

if deployment == "ollama":
client = OpenAI(
base_url="http://127.0.0.1:11434/v1/",
api_key="ollama", # not used
)
# the model name created via ollama CLI, you can check it via command: ollama list
model = "ui-tars:latest"
else:
client = OpenAI(base_url="", api_key="")
model = "tgi"

prompt = "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nYou are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task. \n\n ## Output Format\n \n Action_Summary: ...\n Action: ...\n \n\n ## Action Space\n click(start_box=‘<|box_start|>(x1,y1)<|box_end|>’)\nlong_press(start_box=‘<|box_start|>(x1,y1)<|box_end|>’, time=‘’)\ntype(content=‘’)\nscroll(direction=‘down or up or right or left’)\nopen_app(app_name=‘’)\nnavigate_back()\nnavigate_home()\nWAIT()\nfinished() # Submit the task regardless of whether it succeeds or fails.\n\n ## Note\n - Use English in Action_Summary part.\n \n\n ## User Instruction\n"
with open(screenshot_path, "rb") as image_file:
encoded_string = base64.b64encode(image_file.read()).decode("utf-8")

response = client.chat.completions.create(
model=model,
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": prompt + instruction},
{"type": "image_url", "image_url": {"url": f"data:image/jpg;base64,{encoded_string}"}},
],
},
],
)
print(response.choices[0].message.content)

每运行一次都不相同：
python ui_tars.py
信息位于该车辆的左上角，是一个矩形框内部含有黑色文字

Action_Sumary: click(start_box='<|box_start|>(745,265)<|box_end|>')
(agent) mars@marsyaodemini Workspace % python   ui_tars.py
_url_0.jpg

Action_Sumary: click(start_box='<|box_start|>(213,465)<|box_end|>')
(agent) mars@marsyaodemini Workspace % python   ui_tars.py

(agent) mars@marsyaodemini Workspace % python   ui_tars.py
>


Action_Summary: click(start_box='<|box_start|>(246,155)<|box_end|>')
(agent) mars@marsyaodemini Workspace %

The text was updated successfully, but these errors were encountered:

JjjFangg · 2025-01-23T07:52:04Z

可以用我们最新的prompt再试一下

summerHearts · 2025-01-23T09:15:50Z

可以用我们最新的prompt再试一下

已经改了，其实还是很差，没有一次能成的，倒是https://huggingface.co/spaces/Aheader/gui_test_app ，用的7B每次都是对的，问题的原因是在哪里呢？

所有代码如下：

import base64
from openai import OpenAI
import json
import ast
import os
import re
import io
import math
import gradio as gr
from openai import OpenAI
from datetime import datetime
from PIL import ImageDraw

deployment = "ollama"
instruction = "点击去出车"
screenshot_path = "screenshot.jpg"
assert deployment in ["ollama", "hf"]

if deployment == "ollama":
    client = OpenAI(
        base_url="http://10.30.76.32:11434/v1/",
        api_key="ollama",  # not used
    )
    # the model name created via ollama CLI, you can check it via command: `ollama list`
    model = "ui-tars:latest"
else:
    client = OpenAI(base_url="<endpoint url>", api_key="<huggingface access tokens>")
    model = "tgi"

## Below is the prompt for mobile
prompt = r"""You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task. 

## Output Format
```\nThought: ...
Action: ...\n```

## Action Space
click(start_box='<|box_start|>(x1,y1)<|box_end|>')
long_press(start_box='<|box_start|>(x1,y1)<|box_end|>', time='')
type(content='')
scroll(start_box='<|box_start|>(x1,y1)<|box_end|>', end_box='<|box_start|>(x3,y3)<|box_end|>')
press_home()
press_back()
finished(content='') # Submit the task regardless of whether it succeeds or fails.

## Note
- Use English in `Thought` part.

- Write a small plan and finally summarize your next action (with its target element) in one sentence in `Thought` part.

## User Instruction
"""

with open(screenshot_path, "rb") as image_file:
    encoded_string = base64.b64encode(image_file.read()).decode("utf-8")

response = client.chat.completions.create(
    model=model,
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": prompt + instruction},
                {"type": "image_url", "image_url": {"url": f"data:image/jpg;base64,{encoded_string}"}},
            ],
        },
    ],
)
print(response.choices[0].message.content)

JjjFangg · 2025-01-23T09:39:54Z

The GGUF model has undergone quantization, but unfortunately, its performance cannot be guaranteed. As a result, we have decided to downgrade it.

brucevoin · 2025-01-23T10:16:29Z

Indeed, ollama+2b(https://huggingface.co/bytedance-research/UI-TARS-2B-gguf/tree/main) nearly doesn't work.

DecentMakeover · 2025-01-23T11:09:42Z

can we run this using 4o ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2B模型是理解有问题吗，Demo如下 #18

2B模型是理解有问题吗，Demo如下 #18

summerHearts commented Jan 23, 2025 •

edited

Loading

JjjFangg commented Jan 23, 2025

summerHearts commented Jan 23, 2025 •

edited

Loading

JjjFangg commented Jan 23, 2025 •

edited

Loading

brucevoin commented Jan 23, 2025

DecentMakeover commented Jan 23, 2025

2B模型是理解有问题吗，Demo如下 #18

2B模型是理解有问题吗，Demo如下 #18

Comments

summerHearts commented Jan 23, 2025 • edited Loading

JjjFangg commented Jan 23, 2025

summerHearts commented Jan 23, 2025 • edited Loading

JjjFangg commented Jan 23, 2025 • edited Loading

brucevoin commented Jan 23, 2025

DecentMakeover commented Jan 23, 2025

summerHearts commented Jan 23, 2025 •

edited

Loading

summerHearts commented Jan 23, 2025 •

edited

Loading

JjjFangg commented Jan 23, 2025 •

edited

Loading