Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3rd Iteration of Incubator Tools #700

Merged
merged 9 commits into from
Jan 2, 2024

Conversation

sana-google
Copy link
Contributor

  • Added 9 new tools
  • Updated README.md

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@sana-google sana-google marked this pull request as ready for review December 19, 2023 11:15
@sana-google sana-google requested a review from a team as a code owner December 19, 2023 11:15
@@ -0,0 +1,296 @@
{
Copy link
Collaborator

@holtskinner holtskinner Dec 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Provide more description of what these are.


Reply via ReviewNB

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is taken care, adding the description here also The objective of the tool is to evaluate the character error rate which measures the rate of errors at the character level in the generated output compared to the reference. and word error rate which measures the rate of errors at the word level in the generated output compared to the reference.Both CER and WER can be used to provide a quantitative measure of accuracy, with lower values indicating better performance.

@@ -0,0 +1,296 @@
{
Copy link
Collaborator

@holtskinner holtskinner Dec 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of this download from GCS code should be brought out into the utilities


Reply via ReviewNB

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible only for json files, the variable groundtruth_files are text files and cannot be downloaded with our functions in utilities. This is an exception so we decided not to include the text downloading function to our utilities module.

@@ -0,0 +1,296 @@
{
Copy link
Collaborator

@holtskinner holtskinner Dec 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #79.        print(f"Mean CER = {round(mean_cer,2)}%, Mean WER = {round(mean_wer,2)}%")

You can replace the round functions with {mean_cer:.2f} and {mean_wer:.2f}


Reply via ReviewNB

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged. Updated the code.

@@ -0,0 +1,296 @@
{
Copy link
Collaborator

@holtskinner holtskinner Dec 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #82.    if __name__ == "__main__":

This isn't needed in a Notebook


Reply via ReviewNB

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marking as done. Replaced the condition and called the main function without the condition.

@@ -0,0 +1,296 @@
{
Copy link
Collaborator

@holtskinner holtskinner Dec 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Provide more description of what these are.


Reply via ReviewNB

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged.

@@ -0,0 +1,296 @@
{
Copy link
Collaborator

@holtskinner holtskinner Dec 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of this download from GCS code should be brought out into the utilities


Reply via ReviewNB

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged.

@@ -0,0 +1,296 @@
{
Copy link
Collaborator

@holtskinner holtskinner Dec 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #79.        print(f"Mean CER = {round(mean_cer,2)}%, Mean WER = {round(mean_wer,2)}%")

You can replace the round functions with {mean_cer:.2f} and {mean_wer:.2f}


Reply via ReviewNB

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged.

@@ -0,0 +1,296 @@
{
Copy link
Collaborator

@holtskinner holtskinner Dec 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #82.    if __name__ == "__main__":

This isn't needed in a Notebook


Reply via ReviewNB

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged.

@@ -0,0 +1,354 @@
{
Copy link
Collaborator

@holtskinner holtskinner Dec 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make these variable names in code font


Reply via ReviewNB

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the variable names according to code font

@@ -0,0 +1,354 @@
{
Copy link
Collaborator

@holtskinner holtskinner Dec 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #17.        return json_data

This function should be in the utilities


Reply via ReviewNB

@@ -0,0 +1,354 @@
{
Copy link
Collaborator

@holtskinner holtskinner Dec 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #34.        df = pd.read_excel(excel_path)

There should be a built in function for pandas to convert this data frame to a dictionary


Reply via ReviewNB

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A specified format is needed that's why we have used this approach(key_names, values)

@@ -0,0 +1,354 @@
{
Copy link
Collaborator

@holtskinner holtskinner Dec 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #85.                )

Why does this variable need to be set twice?


Reply via ReviewNB

@@ -0,0 +1,354 @@
{
Copy link
Collaborator

@holtskinner holtskinner Dec 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #89.                    ]  # IF MENTION TEXT NEED NOT BE CHANGED COMMENT THIS LINE

I think this comment should go above the section (after formatting)


Reply via ReviewNB

@@ -0,0 +1,363 @@
{
Copy link
Collaborator

@holtskinner holtskinner Dec 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand what this notebook does or what the use case is


Reply via ReviewNB

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tool uses OCR to reorder PDF pages based on unique strings (logical identifiers) provided by the user, ideal for organizing out of sequence PDF pages into their correct order.

@@ -0,0 +1,1455 @@
{
Copy link
Collaborator

@holtskinner holtskinner Dec 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the comments


Reply via ReviewNB

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments are removed in updated notebook

@@ -0,0 +1,1455 @@
{
Copy link
Collaborator

@holtskinner holtskinner Dec 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this seems over complicated. This code can be simplified/made more modular and readable.


Reply via ReviewNB

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment taken-care in updated notebook

@@ -0,0 +1,1455 @@
{
Copy link
Collaborator

@holtskinner holtskinner Dec 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #490.    def batch_process_documents(

Not sure why this extra functuion is needed


Reply via ReviewNB

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In updated notebook, this staging function removed.

@@ -0,0 +1,243 @@
{
Copy link
Collaborator

@holtskinner holtskinner Dec 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a very basic tool using base Python libraries. Maybe should just be in utilities? Or not included.


Reply via ReviewNB

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the notebook. Updated it as a function in the utilities.py

@holtskinner holtskinner merged commit f4c03bd into GoogleCloudPlatform:main Jan 2, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants