Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: simplify file type checking from MIME to extension #342

Merged

Conversation

kingdomad
Copy link
Contributor

The original upload_file method in GradioUI has several issues:

  • Using a list as a default parameter: This is unsafe because if the parameter is modified, it will affect the global state.
  • The logic for file type detection is quite problematic: It uses the mimetypes.guess_type method, which internally matches file types based on file extensions. After determining the file type with mimetypes.guess_type, it uses a reverse dictionary constructed from mimetypes.types_map to match the file type back to a file extension. This is redundant and overcomplicated. Additionally, the original code introduces bugs because multiple different file extensions can correspond to the same file type. This unintentionally modifies the original file extension.

My submitted code improves these two points by directly filtering valid files based on their extensions. This not only simplifies the logic but also makes the method more user-friendly.

Below is the problematic code from the original method:

 def upload_file(
        self,
        file,
        file_uploads_log,
        allowed_file_types=[
            "application/pdf",
            "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
            "text/plain",
        ],
    ):
        """
        Handle file uploads, default allowed types are .pdf, .docx, and .txt
        """
        import gradio as gr

        if file is None:
            return gr.Textbox("No file uploaded", visible=True), file_uploads_log

        try:
            mime_type, _ = mimetypes.guess_type(file.name)
        except Exception as e:
            return gr.Textbox(f"Error: {e}", visible=True), file_uploads_log

        if mime_type not in allowed_file_types:
            return gr.Textbox("File type disallowed", visible=True), file_uploads_log

        # Sanitize file name
        original_name = os.path.basename(file.name)
        sanitized_name = re.sub(
            r"[^\w\-.]", "_", original_name
        )  # Replace any non-alphanumeric, non-dash, or non-dot characters with underscores

        type_to_ext = {}
        for ext, t in mimetypes.types_map.items():
            if t not in type_to_ext:
                type_to_ext[t] = ext

        # Ensure the extension correlates to the mime type
        sanitized_name = sanitized_name.split(".")[:-1]
        sanitized_name.append("" + type_to_ext[mime_type])
        sanitized_name = "".join(sanitized_name)

        # Save the uploaded file to the specified folder
        file_path = os.path.join(self.file_upload_folder, os.path.basename(sanitized_name))
        shutil.copy(file.name, file_path)

        return gr.Textbox(f"File uploaded: {file_path}", visible=True), file_uploads_log + [file_path]

@kingdomad
Copy link
Contributor Author

@aymeric-roucher Can you review my PR?

@kingdomad
Copy link
Contributor Author

kingdomad commented Feb 9, 2025

Closes #566

@sysradium
Copy link
Contributor

Built-in mimetypes has a weak security, but this makes it essentially non-existent. Would not a better lib like python-magic work better? E.g. #569

@kingdomad
Copy link
Contributor Author

Built-in mimetypes has a weak security, but this makes it essentially non-existent. Would not a better lib like python-magic work better? E.g. #569

No need to overcomplicate it, just determine based on the file extension.

@sysradium
Copy link
Contributor

sysradium commented Feb 9, 2025

Then if you expose your setup via GradIO with shared=True you may get unwanted consequences ...
Let's see what maintainers have to say. I assume they had a similar idea adding this check in the first place.

@kingdomad
Copy link
Contributor Author

Then if you expose your setup via GradIO with shared=True you may get unwanted consequences ... Let's see what maintainers have to say. I assume they had a similar idea adding this check in the first place.

Security is undoubtedly important, but the current version of the code has more critical issues beyond security. The mimetypes detection logic in the current code is overly redundant and contains logical errors. Let’s address these issues first. My pull request does not introduce new problems. On the contrary, it simplifies the logic and resolves bugs. The current version of GradioUI is not suitable for sharing and can only be used as a personal demonstration toy, due to many other issues, such as the agent being shared across all threads.

@aymeric-roucher
Copy link
Collaborator

@kingdomad thank you for the submission!
Thinking about this, maybe keeping mimetypes is better to handle unexpected but correct types: for instance, your solution would not allow ".doc" files anymore althought they should technically be accepted.

Could you go back to using mimetypes, while fixing the other problems you pointed?

@kingdomad
Copy link
Contributor Author

@kingdomad thank you for the submission! Thinking about this, maybe keeping mimetypes is better to handle unexpected but correct types: for instance, your solution would not allow ".doc" files anymore althought they should technically be accepted.

Could you go back to using mimetypes, while fixing the other problems you pointed?

My code maintains the original code’s allowed file types for uploading. When the original code mimetypes.guess_type(file.name) is executed, its underlying code searches a mapping table based on the file’s extension, as follows:

class MimeTypes:
    def guess_type(self, url, strict=True):
        ...
        if ext in types_map:
            return types_map[ext], encoding
        ...

The two entries in the aforementioned types_map are as follows:

{
    ".docx": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
    ".doc": "application/msword"
}

Therefore, after the original code is executed, the .doc extension files are mapped to the mime_type “application/msword”, which is not included in the allowed_file_types list, as shown below:

class GradioUI:
    def upload_file(
        self,
        file,
        file_uploads_log,
        allowed_file_types=[
            "application/pdf",
            "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
            "text/plain",
        ],
    ):
        """
        Handle file uploads, default allowed types are .pdf, .docx, and .txt
        """
        import gradio as gr

        if file is None:
            return gr.Textbox("No file uploaded", visible=True), file_uploads_log

        try:
            mime_type, _ = mimetypes.guess_type(file.name)
        except Exception as e:
            return gr.Textbox(f"Error: {e}", visible=True), file_uploads_log

        if mime_type not in allowed_file_types:
            return gr.Textbox("File type disallowed", visible=True), file_uploads_log

The core logic of mimetypes.guess_type(file.name) is essentially the following two lines, which match using the file extension:

        if ext in types_map:
            return types_map[ext], encoding

It would be more straightforward to directly filter by extension in the GradioUI method. If you want the program to allow uploading .doc files, you can simply add ‘.doc’ to the allowed_file_types parameter in my modified method; the logic is very clear. Exposing the file extension directly gives users better control, whereas with MIME types, I’m not exactly sure what specific file they correspond to.
The common file types are actually not that many. When you can list out the file extensions that your program allows, you can then make targeted optimizations.
The original code has another serious issue. When using mimetypes, I found it impossible to achieve the goal of only allowing files with the .txt extension, because the mimetype "text/plain" corresponds to several file extensions.
What do you think?

@aymeric-roucher
Copy link
Collaborator

@kingdomad thank you for the explanation! Being unfamiliar with the specific challenges that people will experience when using files with this functionality (because I didn't use it myself), I was thinking a "standard" solution like mimetypes would be more robust. But your explanations make total sense ! So let's merge this! 😃

@aymeric-roucher
Copy link
Collaborator

aymeric-roucher commented Feb 13, 2025

But before that, could you add a test for uploading files?

@sysradium
Copy link
Contributor

I would have kept proper validation instead, just picking a better validator. @aymeric-roucher I would have discussed it within the team

@kingdomad
Copy link
Contributor Author

But before that, could you add a test for uploading files?

Sure, I'll submit it shortly.

@kingdomad
Copy link
Contributor Author

@aymeric-roucher Done.

@aymeric-roucher
Copy link
Collaborator

@sysradium let's merge this for short term functionality, since public usage is not that wide-spread yet that security would be a huge concern for this functionality. But I see you've opened #569, it's great, checking type with filetype might be a good solution to add!

@aymeric-roucher
Copy link
Collaborator

@kingdomad you have to run ruff check and ruff format to pass tests.

@kingdomad
Copy link
Contributor Author

@kingdomad you have to run ruff check and ruff format to pass tests.

Done.

@aymeric-roucher aymeric-roucher merged commit a940f42 into huggingface:main Feb 14, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants