Processing fixes #189

maehr · 2024-11-22T13:07:19Z

Pull request

Fix German umlauts in search #188

Proposed changes

Types of changes

New feature (non-breaking change which adds functionality).
Enhancement (non-breaking change which enhances functionality)
Bug Fix (non-breaking change which fixes an issue).
Breaking change (fix or feature that would cause existing functionality to change).

Checklist

I have read the CONTRIBUTING document.
My change requires a change to the documentation.
I have updated the documentation accordingly.

Summary by CodeRabbit

New Features
- Introduced a string normalization function to ensure consistent formatting of data.
- Enhanced image handling based on item visibility, using placeholder images for non-public items.
Improvements
- Streamlined data processing by combining item and media records for more efficient output.

changeset-bot · 2024-11-22T13:07:23Z

⚠️ No Changeset found

Latest commit: d9822d6

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

coderabbitai · 2024-11-22T13:07:29Z

Walkthrough

The changes in this pull request involve updates to the .github/workflows/process_data.py script. Key modifications include the addition of a unicodedata import for string normalization, updates to the extract_item_data and extract_media_data functions to manage image paths based on the visibility of items, and the introduction of a new normalize_record function. This function ensures all string fields are normalized to Unicode NFC form before saving processed data. The overall data processing flow is streamlined by consolidating item and media records into a single list for output.

Changes

File	Change Summary
`.github/workflows/process_data.py`	- Added import for `unicodedata`. - Updated `extract_item_data` and `extract_media_data` to conditionally set `local_image_path` based on `o:is_public`. - Introduced `normalize_record` function for Unicode NFC normalization. - Adjusted main processing logic to accumulate items into `items_processed` for output.

Possibly related PRs

Uv python #166: The changes in this PR involve modifications to the .github/workflows/process_data.py file, which is directly related to the main PR's updates to the same file, although the specific changes differ.

Suggested reviewers

koilebeit

Poem

In the garden where data flows,
A rabbit hops where the Unicode glows.
With images bright and strings so neat,
We gather our records, a joyful feat!
Normalize the paths, let consistency reign,
In our workflow's dance, we’ll never complain! 🐇✨

Warning

Rate limit exceeded

@maehr has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 22 minutes and 55 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 4a647a8 and d9822d6.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

Copilot reviewed 1 out of 1 changed files in this pull request and generated no suggestions.

Comments skipped due to low confidence (1)

.github/workflows/process_data.py:185

Duplicated comment. Please remove the redundant part.

# Download the thumbnail image if available and valid    # Download the thumbnail image if available and valid

coderabbitai

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (1)

.github/workflows/process_data.py (1)

229-235: Consider handling nested data structures

The function effectively normalizes top-level string fields but doesn't handle nested structures (e.g., lists or dictionaries within the record).

Consider this enhanced implementation:

 def normalize_record(record):
     """Normalizes all string fields in a record to Unicode NFC form."""
+    def normalize_value(value):
+        if isinstance(value, str):
+            return unicodedata.normalize("NFC", value)
+        elif isinstance(value, list):
+            return [normalize_value(v) for v in value]
+        elif isinstance(value, dict):
+            return {k: normalize_value(v) for k, v in value.items()}
+        return value
+
     return {
-        key: unicodedata.normalize("NFC", value) if isinstance(value, str) else value
+        key: normalize_value(value)
         for key, value in record.items()
     }

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 7b31584 and 0eafb0d.

📒 Files selected for processing (1)

.github/workflows/process_data.py (4 hunks)

🔇 Additional comments (2)

.github/workflows/process_data.py (2)

4-4: LGTM: Standard library import properly placed

The addition of the unicodedata import is well-placed with other standard library imports and is necessary for Unicode normalization functionality.

147-150: Verify placeholder image existence

The code now correctly handles non-public items by using a placeholder image. However, we should verify that the placeholder image exists at the specified path.

✅ Verification successful

Placeholder image exists at the correct location

The placeholder image placeholder.svg exists at the specified path assets/img/placeholder.svg, so the code is correctly handling non-public items by using an existing placeholder image.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if the placeholder image exists
if [ ! -f "assets/img/placeholder.svg" ]; then
    echo "Warning: Placeholder image not found at assets/img/placeholder.svg"
    # Look for the file in other common locations
    echo "Searching for placeholder.svg in repository:"
    fd placeholder.svg
fi

Length of output: 43

Script:

#!/bin/bash
# Let's try a more comprehensive search for any SVG files and potential placeholder images
echo "Searching for SVG files:"
fd -t f -e svg

echo -e "\nSearching for any files with 'placeholder' in the name:"
fd placeholder

echo -e "\nChecking if 'assets' directory exists and its structure:"
if [ -d "assets" ]; then
    ls -R assets/
fi

echo -e "\nChecking for any image directories in the repository:"
fd -t d "(images?|img)"

Length of output: 89353

.github/workflows/process_data.py

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 0eafb0d and 4a647a8.

📒 Files selected for processing (1)

.github/workflows/process_data.py (4 hunks)

🧰 Additional context used

📓 Learnings (1)

.github/workflows/process_data.py (1)

Learnt from: maehr
PR: Stadt-Geschichte-Basel/forschung.stadtgeschichtebasel.ch#189
File: .github/workflows/process_data.py:243-255
Timestamp: 2024-11-22T13:11:34.574Z
Learning: Unicode normalization performed in the `normalize_record` function in `.github/workflows/process_data.py` is not error-prone and does not require additional error handling.

🔇 Additional comments (4)

.github/workflows/process_data.py (4)

4-4: LGTM: Standard library import properly placed

The addition of the unicodedata import is well-placed with other standard library imports and is necessary for the Unicode normalization functionality.

229-235: LGTM: Clean implementation of Unicode normalization

The normalize_record function provides a clean implementation for normalizing string fields to NFC form while properly handling non-string values.

243-255: LGTM: Streamlined processing flow with proper normalization

The updated processing flow effectively combines items and media into a single list while ensuring proper Unicode normalization.

Let's verify the data processing flow:

✅ Verification successful

Processing flow and normalization implementation verified

The code correctly:

Processes each item and its associated media into a single list
Applies Unicode NFC normalization to all string fields
Maintains proper relationship between items and their media through objectid
Handles the complete data flow from extraction to normalization before saving

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for proper handling of items and media records
rg -A 5 "items_processed\.append" | grep -v "test"

Length of output: 50

Script:

#!/bin/bash
# Let's try a different approach to find the data processing logic
# First, let's locate the file
fd "process_data.py"

# Then examine the content around the processing logic
rg -B 2 -A 5 "extract_item_data" .github/workflows/process_data.py

Length of output: 696

Script:

#!/bin/bash
# Let's verify the normalization step and media handling
# Check normalize_record implementation
rg -B 2 -A 5 "normalize_record" .github/workflows/process_data.py

# Check get_media implementation to understand media data handling
rg -B 2 -A 5 "get_media" .github/workflows/process_data.py

# Check extract_media_data implementation
rg -B 2 -A 5 "extract_media_data" .github/workflows/process_data.py

Length of output: 2011

147-150: Verify handling of private items

Good security improvement by using a placeholder for non-public items. The implementation correctly prevents exposure of private content.

Let's verify the handling of private items:

✅ Verification successful

Verified: Private content protection is properly implemented

The implementation correctly handles private content protection:

The placeholder image exists at the correct path assets/img/placeholder.svg
The code consistently checks o:is_public flag before exposing content:
- In extract_item_data() for thumbnails
- In extract_media_data() for original media URLs
The fallback to placeholder.svg is properly implemented when content is not public

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for consistent handling of private items
rg -l "o:is_public" | xargs rg "placeholder\.svg"

Length of output: 49

Script:

#!/bin/bash
# Let's first check where o:is_public is used
rg "o:is_public" -A 2 -B 2

# Then separately check for placeholder.svg references
rg "placeholder\.svg" -A 2 -B 2

# Also check if there are any other thumbnail handling patterns
rg "thumbnail_display_urls" -A 2 -B 2

Length of output: 98

Script:

#!/bin/bash
# Let's check the content of the process_data.py file
cat .github/workflows/process_data.py

# Also check for any asset directories that might contain the placeholder
fd placeholder.svg

Length of output: 10562

.github/workflows/process_data.py

…raction

Copilot reviewed 1 out of 1 changed files in this pull request and generated no suggestions.

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)

.github/workflows/process_data.py (1)

254-255: Add logging for processed items count

Consider adding logging to track the number of processed items for better monitoring.

     # Normalize all string fields in the records to avoid decomposed Unicode form Umlaute ¨ + o -> ö
     items_normalized = [normalize_record(record) for record in items_processed]
+    logging.info(f"Processed {len(items_normalized)} records ({len([r for r in items_normalized if not r.get('parentid')])} items, {len([r for r in items_normalized if r.get('parentid')])} media files)")
 
     # Save data to CSV and JSON formats
     save_to_files(items_normalized, CSV_PATH, JSON_PATH)

Also applies to: 258-258

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 4a647a8 and d9822d6.

📒 Files selected for processing (1)

.github/workflows/process_data.py (4 hunks)

🧰 Additional context used

📓 Learnings (1)

.github/workflows/process_data.py (1)

Learnt from: maehr
PR: Stadt-Geschichte-Basel/forschung.stadtgeschichtebasel.ch#189
File: .github/workflows/process_data.py:243-255
Timestamp: 2024-11-22T13:11:34.574Z
Learning: Unicode normalization performed in the `normalize_record` function in `.github/workflows/process_data.py` is not error-prone and does not require additional error handling.

🔇 Additional comments (4)

.github/workflows/process_data.py (4)

4-4: LGTM! Good security improvement with visibility check

The addition of the unicodedata import and the visibility-based image path handling enhances security by properly restricting access to non-public items.

Also applies to: 149-150

189-191: Standardize fallback image handling

There's an inconsistency in fallback image paths:

Items use: assets/img/placeholder.svg
Media uses: assets/img/no-image.svg

229-235: LGTM! Clean implementation of Unicode normalization

The normalize_record function:

Correctly handles both string and non-string values
Uses NFC form which is appropriate for normalizing Umlauts
Maintains a clean dictionary comprehension structure

243-253: LGTM! Streamlined data processing flow

The simplified approach of using a single list for both items and media records improves code maintainability.

maehr added 2 commits November 21, 2024 16:47

fix: update thumbnail handling to use placeholder for non-public items

75ac9e1

feat: normalize string fields in records to Unicode NFC form

0eafb0d

maehr requested a review from Copilot November 22, 2024 13:07

Copilot AI reviewed Nov 22, 2024

View reviewed changes

maehr requested a review from koilebeit November 22, 2024 13:07

coderabbitai bot reviewed Nov 22, 2024

View reviewed changes

.github/workflows/process_data.py Show resolved Hide resolved

.github/workflows/process_data.py Outdated Show resolved Hide resolved

fix: correct comment formatting in media data extraction function

4a647a8

coderabbitai bot reviewed Nov 22, 2024

View reviewed changes

.github/workflows/process_data.py Show resolved Hide resolved

fix: improve readability of thumbnail download logic in item data ext…

d9822d6

…raction

maehr requested a review from Copilot November 22, 2024 13:15

Copilot AI reviewed Nov 22, 2024

View reviewed changes

coderabbitai bot reviewed Nov 22, 2024

View reviewed changes

koilebeit approved these changes Nov 25, 2024

View reviewed changes

maehr merged commit 42dd6de into Stadt-Geschichte-Basel:main Nov 25, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Processing fixes #189

Processing fixes #189

maehr commented Nov 22, 2024 •

edited

Loading

changeset-bot bot commented Nov 22, 2024 •

edited

Loading

coderabbitai bot commented Nov 22, 2024 •

edited

Loading

Rate limit exceeded

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

Processing fixes #189

Processing fixes #189

Conversation

maehr commented Nov 22, 2024 • edited Loading

Pull request

Proposed changes

Types of changes

Checklist

Summary by CodeRabbit

changeset-bot bot commented Nov 22, 2024 • edited Loading

⚠️ No Changeset found

coderabbitai bot commented Nov 22, 2024 • edited Loading

Walkthrough

Changes

Possibly related PRs

Suggested reviewers

Poem

Rate limit exceeded

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

maehr commented Nov 22, 2024 •

edited

Loading

changeset-bot bot commented Nov 22, 2024 •

edited

Loading

coderabbitai bot commented Nov 22, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)