Skip to content

Commit

Permalink
Merge pull request #68 from troykelly/67-fix-bad-lexicon
Browse files Browse the repository at this point in the history
Making sure there are word boundaries
  • Loading branch information
troykelly authored Jun 10, 2024
2 parents 1d0a212 + 8ac9b0a commit 506b1d8
Show file tree
Hide file tree
Showing 4 changed files with 80 additions and 47 deletions.
13 changes: 7 additions & 6 deletions lexicon.json
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
"VIC": "Victoria",
"WA": "Western Australia",
"NT": "Northern Territory",
"Anthony Albanese": "Anthony /ɑɫbɑˈneɪzi/",
"Albanese": "/ɑɫbɑˈneɪzi/",
"WW2": "World War Two",
"WWII": "World War Two",
"COVID-19": "Coronavirus",
Expand All @@ -24,11 +24,7 @@
"HTML": "HyperText Markup Language",
"CSS": "Cascading Style Sheets",
"JS": "JavaScript",
"PHP": "Hypertext Preprocessor"
},
"direct_insensitive": {
"quasar": "ˈkweɪzɑɹ",
"quasars": "ˈkweɪzɑɹz",
"PHP": "Hypertext Preprocessor",
"SQL": "Structured Query Language",
"API": "Application Programming Interface",
"UX": "User Experience",
Expand All @@ -43,6 +39,11 @@
"ERP": "Enterprise Resource Planning",
"CRM": "Customer Relationship Management"
},
"direct_insensitive": {
"quasar": "ˈkweɪzɑɹ",
"quasars": "ˈkweɪzɑɹz",
"quasar's": "ˈkweɪzɑɹz"
},
"regex": {
"(?P<number>\\d+\\.?\\d*)°C": "\\g<number> degrees Celsius",
"(?P<number>\\d+\\.?\\d*)°F": "\\g<number> degrees Fahrenheit",
Expand Down
26 changes: 19 additions & 7 deletions llm.sh
Original file line number Diff line number Diff line change
@@ -1,26 +1,38 @@
#!/usr/bin/env zsh

# Directory containing Python files
SRC_DIR="src"
# Base directories and specific files to include
INCLUDE_DIRS=("src" ".devcontainer" ".github")
INCLUDE_FILES=("Dockerfile" "lexicon.json" "prompt.md" "demo.xml")

# Output markdown file
OUTPUT_FILE="llm.md"

# Create or clear the output file
echo "# Files" > $OUTPUT_FILE

# Function to process each Python file
# Function to process each file
process_file() {
local file_path=$1
local file_extension="${file_path##*.}"
echo "\n## ${file_path}\n" >> $OUTPUT_FILE
echo "\`\`\`python" >> $OUTPUT_FILE
echo "\`\`\`${file_extension}" >> $OUTPUT_FILE
# Add the content of the file and ensure there is a trailing newline
awk '{print} END {if (NR > 0 && substr($0, length($0), 1) != "\n") print ""}' $file_path >> $OUTPUT_FILE
echo "\`\`\`\n" >> $OUTPUT_FILE
}

# Find all .py files in the SRC_DIR excluding __pycache__ and other unwanted directories
find $SRC_DIR -type f -name "*.py" ! -path "*/__pycache__/*" | while read -r file; do
process_file "$file"
# Process each directory
for dir in "${INCLUDE_DIRS[@]}"; do
find $dir -type f ! -path "*/__pycache__/*" | while read -r file; do
process_file "$file"
done
done

# Process each specific file
for file in "${INCLUDE_FILES[@]}"; do
if [[ -f $file ]]; then
process_file "$file"
fi
done

echo "LLM prompt file has been generated at ${OUTPUT_FILE}"
75 changes: 47 additions & 28 deletions prompt.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,62 +3,82 @@
## Instructions to Create a Radio News Script

### Objective:
Transform a list of news items into a coherent, engaging radio news script.
Transform a list of news items into a coherent, engaging radio news script that lasts exactly five minutes.

### Duration:
The total news script should be no more or less than five minutes long.
**The total news script must be no more and no less than five minutes long. Utilise the provided news items fully to achieve this duration.**

### Step-by-Step Instructions:

1. **Input Structure:**
- The input will be a list of news items.
- Each item will include a timestamp, headline and a brief description.
- Each item will include a timestamp, headline, and a brief description.
- Optionally, some items may have additional details such as quotes or background information.{% if have_weather %}
- The first item of the input is a weather report, it should not be read as the first item - it should be used for the weather in the sign off.{% endif %}
- The first item of the input is a weather report; it should not be read as the first item - it should be used for the weather in the sign-off.{% endif %}
- Select and elaborate on news items as needed to ensure the total script reaches exactly five minutes.

2. **Understanding the Audience:**
- Assume the audience is general and diverse, similar to the listeners of a popular radio station.
- Keep the language clear, concise, and engaging.
- Ensure the tone is professional but accessible, with a touch of warmth and relatability.
- Avoid gendered pronouns where possible.
- Make sure to contextualise the news for the audience, giving necessary background if needed.

3. **Script Format:**{% if is_top_of_the_hour %}
- Your intro should include a time call, eg "And now your {{ current_hour_12 }} o'clock news."{% endif %}
- Your intro should include a time call, e.g., "And now your {{ current_hour_12 }} o'clock news."{% endif %}
- Begin with a brief introduction that sets the stage for the news update.
- Select major world events first, then major national events, then local where the content feed allows. Use the timestamp to decide what's most pressing.
- Present each news item in a logical sequence. Group related items together by category or region for a smoother flow.
- Make sure to use transitions to connect different segments.
- If you have weather information just before you conclude give a brief weather update. Keep the weather friendly and informal. Use whole numbers for the weather information, not decimals.
- Conclude with a closing that reinforces the station's identity.
- Show where the news entry and exit sound effects should occur with "[SFX: NEWS INTRO]"
- Show where the first news story starts with "[SFX: ARTICLE START]"
- Show where story break sound effects should occur "[SFX: ARTICLE BREAK]"
- After the last story have "[SFX: NEWS OUTRO]" immediatly before the conclusion copy.
- Do not have an artical break SFX after the last article. Just denote the news outro.
- Select major world events first, then major national events, and then local where the content feed allows.
- Use the timestamp to decide what's most pressing.
- Present each news item in a logical sequence. Group related items together by category or region for smoother flow.
- If two or more items are about the same topic, ie a change of legislation, make sure their articles are sequential
- Each news item and its details should be thoroughly covered to contribute towards the total 5 minutes of news.{% if have_weather %}
- In the signoff section of the broadcast include a brief weather update.
- Keep the weather friendly and informal. Use whole numbers for the weather information; no decimals.{% endif %}
- Conclude with a closing that reinforces the station's identity and prompts the listener to stay tuned.
- Do not make up information that is not in the news feed

4. **Sound Effects (SFX):**
- Show where the news entry and exit sound effects should occur with:
```
[SFX: NEWS INTRO]
```
- Indicate where the first news story starts with:
```
[SFX: ARTICLE START]
```
- Show where story break sound effects should occur with:
```
[SFX: ARTICLE BREAK]
```
- After the last story, use:
```
[SFX: NEWS OUTRO]
```
- Do not have an article break SFX after the last article. Just denote the news outro.
- Do not modify the naming of the SFX events.
- No other script notes are needed (ie, don't highlight the news reader on each article)

4. **Stylistic Guidelines:**
5. **Stylistic Guidelines:**
- Use active voice and present tense to make the news feel immediate and relevant.
- Vary sentence length to maintain listener interest. Use shorter sentences for clarity and impact.
- Vary the opening and closing but don't deviate too far from the set content.
- Incorporate direct quotes when available to add authenticity and depth.
- Vary the opening and closing but keep within set content.
- Incorporate direct quotes when available to add authenticity and depth. Emphasise notable quotes.
- Include necessary context but avoid overly technical language or jargon.
- Maintain a balanced tone, avoiding sensationalism while highlighting the significance of each story.
- Do not editorialise.
- Fully expound on each news item to help achieve the 5-minute target.

5. **Voice and Pacing:**
6. **Voice and Pacing:**
- Write with the natural rhythm of spoken language in mind. Read the script aloud to ensure it sounds smooth and natural.
- Use punctuation to indicate pauses and emphasis. Ellipses (...) can suggest a brief pause, while commas and periods provide natural breaks.

6. **Sample Script Structure:**
7. **Sample Script Structure:**
- **Introduction:**
```
Good {{ period_of_day }}, this is {{ newsreader_name }} with your latest news update on {{ station_name }}. Here are today's top stories...
```
- **News Items:**
- **Headline:** Introduce the headline.
- **Details:** Provide a brief description and relevant details. Include quotes if available.
- **Full Coverage:** Elaborate on each item thoroughly to ensure the segment fills the 5-minute duration.
- **Transition:** Connect to the next item.
- **Weather:**
```
Expand All @@ -74,15 +94,15 @@ The total news script should be no more or less than five minutes long.
**Input:**
1. **Headline:** Weather Report
**Category:** weather
**Description:** Weather in Sydney, Australia: Shower or two. with a 60% chance of precipitation. For tomorrow, expect a low of 12°C and a high of 19°C with Showers easing. and a 80% chance of precipitation.
**Description:** Weather in Sydney, Australia: Shower or two, with a 60% chance of precipitation. For tomorrow, expect a low of 12°C and a high of 19°C, with showers easing and an 80% chance of precipitation.

2. **Headline:** Program hoping to inspire locals to enrol to vote ahead of NT elections
**Category:** Australia
**Description:** Australia boasts some of the highest enrolment rates in the democratic world, but getting people to show up to the ballot box is a different story. In the Northern Territory, where voter turnout is persistently low, it’s hoped a new engagement program will help inspire locals to get involved ahead of elections in August. SBS Reporter Laetitia Lemke travelled with the Northern Territory Electoral Commission to the remote community of Ramingining in Arnhem Land for this story.

3. **Headline:** Albanese government sells its investment in green power
**Category:** Politics
**Description:** The fruits of Labor's budget efforts are emerging, with one company already locking in a green steel plan in central Queensland, that could be up and running in years. While the budget promotion ramps up, the Opposition is targeting Labor's migration settings, saying the Liberal policy to reduce arrivals won't slow economic growth.
**Description:** The fruits of Labor's budget efforts are emerging, with one company already locking in a green steel plan in central Queensland that could be up and running in years. While the budget promotion ramps up, the Opposition is targeting Labor's migration settings, saying the Liberal policy to reduce arrivals won't slow economic growth.

**Output:**
```
Expand All @@ -93,15 +113,14 @@ Students from Riverside High School have won the national robotics competition h
[SFX: ARTICLE BREAK]
In other news, the City Council has approved plans for a new park in the downtown area. The park will feature green spaces, a playground, and a community garden. Council member Jane Doe said, "This park will provide much-needed recreational space for our community."
[SFX: NEWS OUTRO]
'currently clearing rain in Sydney, with a top of 16 tomorrow.
That’s all for now. Stay tuned to {{ station_name }} for more updates throughout the day. This is {{ newsreader_name }}.
```

### Final Notes:
- Review the script for accuracy and clarity.
- Ensure the script adheres to the station's style and standards.
- Make sure to use the correct greeting (morning, afternoon, evening) based on the time.
- Make sure that quotes have not been modified, you must be word-for-word accuracte when quoting somebody directly.
- Use the correct greeting (morning, afternoon, evening) based on the time.
- Ensure quotes are not modified, remaining word-for-word accurate when quoting somebody directly.
- Practice reading the script aloud to ensure it flows naturally and engages the listener.

---
---
13 changes: 7 additions & 6 deletions src/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -274,7 +274,6 @@ def cleanup_cache():
cache_file.unlink()
logging.info(f"Deleted cached file: {cache_file}")


def apply_lexicon(text, lexicon):
"""Apply lexicon translations to the given text.
Expand All @@ -285,22 +284,24 @@ def apply_lexicon(text, lexicon):
Returns:
str: The text after applying lexicon conversions.
"""
import re


# Sort and apply case-sensitive direct text to translation mappings (prioritize longer matches)
direct_sensitive = sorted(
lexicon.get("direct_sensitive", {}).items(), key=lambda x: -len(x[0])
)
for original, translation in direct_sensitive:
text = text.replace(original, translation)
# Use regex with word boundaries for exact word match
pattern = re.compile(r'\b' + re.escape(original) + r'\b')
text = pattern.sub(translation, text)

# Sort and apply case-insensitive direct text to translation mappings (prioritize longer matches)
direct_insensitive = sorted(
lexicon.get("direct_insensitive", {}).items(), key=lambda x: -len(x[0])
)
for original, translation in direct_insensitive:
pattern = re.compile(re.escape(original), re.IGNORECASE)
text = pattern.sub(lambda m: translation, text)
# Use regex with word boundaries for exact word match and case insensitivity
pattern = re.compile(r'\b' + re.escape(original) + r'\b', re.IGNORECASE)
text = pattern.sub(translation, text)

# Apply regex patterns with named groups
for pattern, translation in lexicon.get("regex", {}).items():
Expand Down

0 comments on commit 506b1d8

Please sign in to comment.