Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Processing fixes #189

Merged
merged 4 commits into from
Nov 25, 2024
Merged

Processing fixes #189

merged 4 commits into from
Nov 25, 2024

Conversation

maehr
Copy link
Member

@maehr maehr commented Nov 22, 2024

Pull request

Proposed changes

Types of changes

  • New feature (non-breaking change which adds functionality).
  • Enhancement (non-breaking change which enhances functionality)
  • Bug Fix (non-breaking change which fixes an issue).
  • Breaking change (fix or feature that would cause existing functionality to change).

Checklist

  • I have read the CONTRIBUTING document.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.

Summary by CodeRabbit

  • New Features

    • Introduced a string normalization function to ensure consistent formatting of data.
    • Enhanced image handling based on item visibility, using placeholder images for non-public items.
  • Improvements

    • Streamlined data processing by combining item and media records for more efficient output.

@maehr maehr requested a review from Copilot November 22, 2024 13:07
Copy link

changeset-bot bot commented Nov 22, 2024

⚠️ No Changeset found

Latest commit: d9822d6

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

Copy link

coderabbitai bot commented Nov 22, 2024

Walkthrough

The changes in this pull request involve updates to the .github/workflows/process_data.py script. Key modifications include the addition of a unicodedata import for string normalization, updates to the extract_item_data and extract_media_data functions to manage image paths based on the visibility of items, and the introduction of a new normalize_record function. This function ensures all string fields are normalized to Unicode NFC form before saving processed data. The overall data processing flow is streamlined by consolidating item and media records into a single list for output.

Changes

File Change Summary
.github/workflows/process_data.py - Added import for unicodedata.
- Updated extract_item_data and extract_media_data to conditionally set local_image_path based on o:is_public.
- Introduced normalize_record function for Unicode NFC normalization.
- Adjusted main processing logic to accumulate items into items_processed for output.

Possibly related PRs

  • Uv python #166: The changes in this PR involve modifications to the .github/workflows/process_data.py file, which is directly related to the main PR's updates to the same file, although the specific changes differ.

Suggested reviewers

  • koilebeit

Poem

In the garden where data flows,
A rabbit hops where the Unicode glows.
With images bright and strings so neat,
We gather our records, a joyful feat!
Normalize the paths, let consistency reign,
In our workflow's dance, we’ll never complain! 🐇✨

Warning

Rate limit exceeded

@maehr has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 22 minutes and 55 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 4a647a8 and d9822d6.


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 1 out of 1 changed files in this pull request and generated no suggestions.

Comments skipped due to low confidence (1)

.github/workflows/process_data.py:185

  • Duplicated comment. Please remove the redundant part.
# Download the thumbnail image if available and valid    # Download the thumbnail image if available and valid
@maehr maehr requested a review from koilebeit November 22, 2024 13:07
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (1)
.github/workflows/process_data.py (1)

229-235: Consider handling nested data structures

The function effectively normalizes top-level string fields but doesn't handle nested structures (e.g., lists or dictionaries within the record).

Consider this enhanced implementation:

 def normalize_record(record):
     """Normalizes all string fields in a record to Unicode NFC form."""
+    def normalize_value(value):
+        if isinstance(value, str):
+            return unicodedata.normalize("NFC", value)
+        elif isinstance(value, list):
+            return [normalize_value(v) for v in value]
+        elif isinstance(value, dict):
+            return {k: normalize_value(v) for k, v in value.items()}
+        return value
+
     return {
-        key: unicodedata.normalize("NFC", value) if isinstance(value, str) else value
+        key: normalize_value(value)
         for key, value in record.items()
     }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 7b31584 and 0eafb0d.

📒 Files selected for processing (1)
  • .github/workflows/process_data.py (4 hunks)
🔇 Additional comments (2)
.github/workflows/process_data.py (2)

4-4: LGTM: Standard library import properly placed

The addition of the unicodedata import is well-placed with other standard library imports and is necessary for Unicode normalization functionality.


147-150: Verify placeholder image existence

The code now correctly handles non-public items by using a placeholder image. However, we should verify that the placeholder image exists at the specified path.

✅ Verification successful

Placeholder image exists at the correct location

The placeholder image placeholder.svg exists at the specified path assets/img/placeholder.svg, so the code is correctly handling non-public items by using an existing placeholder image.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if the placeholder image exists
if [ ! -f "assets/img/placeholder.svg" ]; then
    echo "Warning: Placeholder image not found at assets/img/placeholder.svg"
    # Look for the file in other common locations
    echo "Searching for placeholder.svg in repository:"
    fd placeholder.svg
fi

Length of output: 43


Script:

#!/bin/bash
# Let's try a more comprehensive search for any SVG files and potential placeholder images
echo "Searching for SVG files:"
fd -t f -e svg

echo -e "\nSearching for any files with 'placeholder' in the name:"
fd placeholder

echo -e "\nChecking if 'assets' directory exists and its structure:"
if [ -d "assets" ]; then
    ls -R assets/
fi

echo -e "\nChecking for any image directories in the repository:"
fd -t d "(images?|img)"

Length of output: 89353

.github/workflows/process_data.py Show resolved Hide resolved
.github/workflows/process_data.py Outdated Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 0eafb0d and 4a647a8.

📒 Files selected for processing (1)
  • .github/workflows/process_data.py (4 hunks)
🧰 Additional context used
📓 Learnings (1)
.github/workflows/process_data.py (1)
Learnt from: maehr
PR: Stadt-Geschichte-Basel/forschung.stadtgeschichtebasel.ch#189
File: .github/workflows/process_data.py:243-255
Timestamp: 2024-11-22T13:11:34.574Z
Learning: Unicode normalization performed in the `normalize_record` function in `.github/workflows/process_data.py` is not error-prone and does not require additional error handling.
🔇 Additional comments (4)
.github/workflows/process_data.py (4)

4-4: LGTM: Standard library import properly placed

The addition of the unicodedata import is well-placed with other standard library imports and is necessary for the Unicode normalization functionality.


229-235: LGTM: Clean implementation of Unicode normalization

The normalize_record function provides a clean implementation for normalizing string fields to NFC form while properly handling non-string values.


243-255: LGTM: Streamlined processing flow with proper normalization

The updated processing flow effectively combines items and media into a single list while ensuring proper Unicode normalization.

Let's verify the data processing flow:

✅ Verification successful

Processing flow and normalization implementation verified

The code correctly:

  • Processes each item and its associated media into a single list
  • Applies Unicode NFC normalization to all string fields
  • Maintains proper relationship between items and their media through objectid
  • Handles the complete data flow from extraction to normalization before saving
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for proper handling of items and media records
rg -A 5 "items_processed\.append" | grep -v "test"

Length of output: 50


Script:

#!/bin/bash
# Let's try a different approach to find the data processing logic
# First, let's locate the file
fd "process_data.py"

# Then examine the content around the processing logic
rg -B 2 -A 5 "extract_item_data" .github/workflows/process_data.py

Length of output: 696


Script:

#!/bin/bash
# Let's verify the normalization step and media handling
# Check normalize_record implementation
rg -B 2 -A 5 "normalize_record" .github/workflows/process_data.py

# Check get_media implementation to understand media data handling
rg -B 2 -A 5 "get_media" .github/workflows/process_data.py

# Check extract_media_data implementation
rg -B 2 -A 5 "extract_media_data" .github/workflows/process_data.py

Length of output: 2011


147-150: Verify handling of private items

Good security improvement by using a placeholder for non-public items. The implementation correctly prevents exposure of private content.

Let's verify the handling of private items:

✅ Verification successful

Verified: Private content protection is properly implemented

The implementation correctly handles private content protection:

  • The placeholder image exists at the correct path assets/img/placeholder.svg
  • The code consistently checks o:is_public flag before exposing content:
    • In extract_item_data() for thumbnails
    • In extract_media_data() for original media URLs
  • The fallback to placeholder.svg is properly implemented when content is not public
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for consistent handling of private items
rg -l "o:is_public" | xargs rg "placeholder\.svg"

Length of output: 49


Script:

#!/bin/bash
# Let's first check where o:is_public is used
rg "o:is_public" -A 2 -B 2

# Then separately check for placeholder.svg references
rg "placeholder\.svg" -A 2 -B 2

# Also check if there are any other thumbnail handling patterns
rg "thumbnail_display_urls" -A 2 -B 2

Length of output: 98


Script:

#!/bin/bash
# Let's check the content of the process_data.py file
cat .github/workflows/process_data.py

# Also check for any asset directories that might contain the placeholder
fd placeholder.svg

Length of output: 10562

.github/workflows/process_data.py Show resolved Hide resolved
@maehr maehr requested a review from Copilot November 22, 2024 13:15

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 1 out of 1 changed files in this pull request and generated no suggestions.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)
.github/workflows/process_data.py (1)

254-255: Add logging for processed items count

Consider adding logging to track the number of processed items for better monitoring.

     # Normalize all string fields in the records to avoid decomposed Unicode form Umlaute ¨ + o -> ö
     items_normalized = [normalize_record(record) for record in items_processed]
+    logging.info(f"Processed {len(items_normalized)} records ({len([r for r in items_normalized if not r.get('parentid')])} items, {len([r for r in items_normalized if r.get('parentid')])} media files)")
 
     # Save data to CSV and JSON formats
     save_to_files(items_normalized, CSV_PATH, JSON_PATH)

Also applies to: 258-258

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 4a647a8 and d9822d6.

📒 Files selected for processing (1)
  • .github/workflows/process_data.py (4 hunks)
🧰 Additional context used
📓 Learnings (1)
.github/workflows/process_data.py (1)
Learnt from: maehr
PR: Stadt-Geschichte-Basel/forschung.stadtgeschichtebasel.ch#189
File: .github/workflows/process_data.py:243-255
Timestamp: 2024-11-22T13:11:34.574Z
Learning: Unicode normalization performed in the `normalize_record` function in `.github/workflows/process_data.py` is not error-prone and does not require additional error handling.
🔇 Additional comments (4)
.github/workflows/process_data.py (4)

4-4: LGTM! Good security improvement with visibility check

The addition of the unicodedata import and the visibility-based image path handling enhances security by properly restricting access to non-public items.

Also applies to: 149-150


189-191: Standardize fallback image handling

There's an inconsistency in fallback image paths:

  • Items use: assets/img/placeholder.svg
  • Media uses: assets/img/no-image.svg

229-235: LGTM! Clean implementation of Unicode normalization

The normalize_record function:

  • Correctly handles both string and non-string values
  • Uses NFC form which is appropriate for normalizing Umlauts
  • Maintains a clean dictionary comprehension structure

243-253: LGTM! Streamlined data processing flow

The simplified approach of using a single list for both items and media records improves code maintainability.

@maehr maehr merged commit 42dd6de into Stadt-Geschichte-Basel:main Nov 25, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

German umlauts in search
2 participants