Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2B模型是理解有问题吗,Demo如下 #18

Open
summerHearts opened this issue Jan 23, 2025 · 5 comments
Open

2B模型是理解有问题吗,Demo如下 #18

summerHearts opened this issue Jan 23, 2025 · 5 comments

Comments

@summerHearts
Copy link

summerHearts commented Jan 23, 2025

import base64
from openai import OpenAI

deployment = "ollama"
instruction = "点击去出车按钮"
screenshot_path = "task2.jpeg"
assert deployment in ["ollama", "hf"]

if deployment == "ollama":
client = OpenAI(
base_url="http://127.0.0.1:11434/v1/",
api_key="ollama", # not used
)
# the model name created via ollama CLI, you can check it via command: ollama list
model = "ui-tars:latest"
else:
client = OpenAI(base_url="", api_key="")
model = "tgi"

prompt = "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nYou are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task. \n\n ## Output Format\n \n Action_Summary: ...\n Action: ...\n \n\n ## Action Space\n click(start_box=‘<|box_start|>(x1,y1)<|box_end|>’)\nlong_press(start_box=‘<|box_start|>(x1,y1)<|box_end|>’, time=‘’)\ntype(content=‘’)\nscroll(direction=‘down or up or right or left’)\nopen_app(app_name=‘’)\nnavigate_back()\nnavigate_home()\nWAIT()\nfinished() # Submit the task regardless of whether it succeeds or fails.\n\n ## Note\n - Use English in Action_Summary part.\n \n\n ## User Instruction\n"
with open(screenshot_path, "rb") as image_file:
encoded_string = base64.b64encode(image_file.read()).decode("utf-8")

response = client.chat.completions.create(
model=model,
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": prompt + instruction},
{"type": "image_url", "image_url": {"url": f"data:image/jpg;base64,{encoded_string}"}},
],
},
],
)
print(response.choices[0].message.content)

Image

每运行一次都不相同:
python ui_tars.py
信息位于该车辆的左上角,是一个矩形框内部含有黑色文字

Action_Sumary: click(start_box='<|box_start|>(745,265)<|box_end|>')
(agent) mars@marsyaodemini Workspace % python   ui_tars.py
_url_0.jpg

Action_Sumary: click(start_box='<|box_start|>(213,465)<|box_end|>')
(agent) mars@marsyaodemini Workspace % python   ui_tars.py

(agent) mars@marsyaodemini Workspace % python   ui_tars.py
>


Action_Summary: click(start_box='<|box_start|>(246,155)<|box_end|>')
(agent) mars@marsyaodemini Workspace % 
@JjjFangg
Copy link

可以用我们最新的prompt再试一下

@summerHearts
Copy link
Author

summerHearts commented Jan 23, 2025

可以用我们最新的prompt再试一下

已经改了,其实还是很差,没有一次能成的,倒是https://huggingface.co/spaces/Aheader/gui_test_app ,用的7B每次都是对的,问题的原因是在哪里呢?

所有代码如下:

import base64
from openai import OpenAI
import json
import ast
import os
import re
import io
import math
import gradio as gr
from openai import OpenAI
from datetime import datetime
from PIL import ImageDraw

deployment = "ollama"
instruction = "点击去出车"
screenshot_path = "screenshot.jpg"
assert deployment in ["ollama", "hf"]

if deployment == "ollama":
    client = OpenAI(
        base_url="http://10.30.76.32:11434/v1/",
        api_key="ollama",  # not used
    )
    # the model name created via ollama CLI, you can check it via command: `ollama list`
    model = "ui-tars:latest"
else:
    client = OpenAI(base_url="<endpoint url>", api_key="<huggingface access tokens>")
    model = "tgi"

## Below is the prompt for mobile
prompt = r"""You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task. 

## Output Format
```\nThought: ...
Action: ...\n```

## Action Space
click(start_box='<|box_start|>(x1,y1)<|box_end|>')
long_press(start_box='<|box_start|>(x1,y1)<|box_end|>', time='')
type(content='')
scroll(start_box='<|box_start|>(x1,y1)<|box_end|>', end_box='<|box_start|>(x3,y3)<|box_end|>')
press_home()
press_back()
finished(content='') # Submit the task regardless of whether it succeeds or fails.

## Note
- Use English in `Thought` part.

- Write a small plan and finally summarize your next action (with its target element) in one sentence in `Thought` part.

## User Instruction
"""

with open(screenshot_path, "rb") as image_file:
    encoded_string = base64.b64encode(image_file.read()).decode("utf-8")

response = client.chat.completions.create(
    model=model,
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": prompt + instruction},
                {"type": "image_url", "image_url": {"url": f"data:image/jpg;base64,{encoded_string}"}},
            ],
        },
    ],
)
print(response.choices[0].message.content)

@JjjFangg
Copy link

JjjFangg commented Jan 23, 2025

The GGUF model has undergone quantization, but unfortunately, its performance cannot be guaranteed. As a result, we have decided to downgrade it.

@brucevoin
Copy link

Indeed, ollama+2b(https://huggingface.co/bytedance-research/UI-TARS-2B-gguf/tree/main) nearly doesn't work.

@DecentMakeover
Copy link

can we run this using 4o ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants