Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix file filtering logic to handle file name patterns like output.txt #64

Closed
wants to merge 2 commits into from

Conversation

slusset
Copy link

@slusset slusset commented Jan 18, 2025

Description:

This pull request resolves an issue where the --exclude flag did not correctly handle file name patterns such as output.txt. The bug caused unexpected behavior when attempting to exclude files based solely on their name, without using a glob pattern or a fully qualified path.

The issue also extends to relative patterns like ./output.txt. However, as output.txt provides a simpler and more conventional way to specify a file name, support for ./output.txt has been intentionally omitted to maintain clarity and reduce ambiguity.

Steps to Reproduce the Issue

  1. Clone the repository and build the project.
  2. Run the following command in a directory containing a file named output.txt: code2prompt --tokens --output=output.txt . --exclude="output.txt"
  3. Observe that the file output.txt is not excluded, even though it matches the --exclude pattern.

Root Cause

  1. The path parameter in should_include_file is canonicalized to an absolute path (e.g., /Users/test/code2prompt/output.txt).
  2. Patterns like output.txt were not properly matched against the file name or canonicalized path.
  3. Relative patterns like ./output.txt require resolving the current directory, adding unnecessary complexity. These patterns are not supported in this fix.

Solution

The filtering logic was updated to:

  1. Match file name patterns (output.txt) directly against the file name extracted from the path, ensuring they work as expected.
  2. Use a centralized helper function, matches_any_pattern, to handle both include_patterns and exclude_patterns, ensuring consistency.
  3. Canonicalize patterns only when appropriate, avoiding unnecessary std::io::Errors for non-existent files or glob patterns.

Tests Added

  1. The following automated tests were added to validate the fix:
  2. Exclude by file name: Ensures that output.txt is excluded when the --exclude flag specifies the file name.
  3. Exclude by glob pattern:Ensures that files matching **/output.txt are excluded regardless of their directory.
  4. Handle non-existent files:Verifies that the program gracefully handles non-existent files without crashing due to std::io::Error.
  5. Include and exclude conflict resolution: Validates that include_priority resolves conflicts when a file matches both include_patterns and exclude_patterns.

How to Test

Build the project from this branch.
Run the following test commands:
Exclude a file by name: code2prompt --tokens --output=output.txt . --exclude="output.txt"
Exclude a file using a glob pattern: code2prompt --tokens --output=output.txt . --exclude="**/output.txt"
Verify that the file output.txt is excluded in all cases.

Notes

The version was incremented to 2.0.1.
Relative patterns like ./output.txt are not supported in this fix, as they introduce unnecessary complexity when specifying file names.

Ted Slusser added 2 commits January 18, 2025 11:55
Bumped the version to 2.0.1 and added new utility tests for file inclusion/exclusion functionality. Refactored filtering logic by centralizing pattern matching into a helper function, improving code readability and robustness.
@ODAncona ODAncona self-requested a review January 25, 2025 11:07
@ODAncona
Copy link
Collaborator

Hey @slusset,

Thanks for the effort you put into addressing this issue. Here's my take on the proposed changes:

  • glob supports ** but not ./, which is a legit use case you're tackling, so 👍 for identifying the gap.
  • That said, imo, we should avoid custom ad hoc code. It adds complexity/maintenance overhead and duplicates what a lib could do better. Pattern matching should be fully delegated, rather than re-implemented here.
  • Tbh, I'm not 100% sure glob is the right fit anymore. globset might be a good fit, but it might be overkill as it uses regular expressions. I'm concerned about the performance loss.

Relevant discussions:

Open to your thoughts! 🚀

@slusset
Copy link
Author

slusset commented Feb 5, 2025

Hi @ODAncona!

Yes, if a 'globbing' library is more suitable to this use case and is already available I agree it should be favored over a custom globbing implementation. Full disclosure, this is my first introduction to Rust, but I do find this tool (code2prompt) very useful for prompting LLM's and am happy to assist with its development!

I will try globset on this use case and let you know what I find out.

👍🏼

src/filter.rs Show resolved Hide resolved
@ODAncona ODAncona mentioned this pull request Feb 10, 2025
@ODAncona ODAncona closed this Feb 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants