Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hacktoberfest 2024 | Google Vision OCR 🤝 Workflows #692

Closed
PawelPeczek-Roboflow opened this issue Sep 30, 2024 · 6 comments
Closed

Hacktoberfest 2024 | Google Vision OCR 🤝 Workflows #692

PawelPeczek-Roboflow opened this issue Sep 30, 2024 · 6 comments

Comments

@PawelPeczek-Roboflow
Copy link
Collaborator

PawelPeczek-Roboflow commented Sep 30, 2024

Google Vision OCR in Workflows

Are you ready to make a meaningful contribution this Hacktoberfest? We are looking to integrate Google Vision OCR into our Workflows ecosystem! This new OCR block, will be a valuable addition, addressing a common challenge that many users face.

Join us in expanding our ecosystem and empowering users to effortlessly extract text and structure from their documents. Whether you’re a seasoned contributor or new to open source, your skills and ideas can help make this project a success. Let’s collaborate and bring this essential functionality to life!

🚧 Task description 🏗️

  • The task is to integrate OCR from Google Vision API into Workflows ecosystem
  • API should be adopted in a way that allow sending API key as Workflow input parameter, rather than using Google service account credentials - see Google Vision auth docs
  • We prefer light integration to REST API through requests library - 📖 REST API docs - in particular this may be useful - we do only want to enable TEXT_DETECTION and DOCUMENT_TEXT_DETECTION
  • output should be parsed into sv.Detections(...) object - recognised text should be label, additional metadata about structure (like category of region) should be added into data field of sv.Detections(...)
  • please raise any issues with the task in the discussion below

Cheatsheet

Scaffolding for the block

💻 Code snippet
from typing import List, Literal, Optional, Type, Union

from pydantic import ConfigDict
import supervision as sv
import requests

from inference.core.workflows.execution_engine.entities.base import (
    OutputDefinition,
    WorkflowImageData,
)
from inference.core.workflows.execution_engine.entities.types import (
    StepOutputImageSelector,
    WorkflowImageSelector,
    OBJECT_DETECTION_PREDICTION_KIND,
)
from inference.core.workflows.prototypes.block import (
    BlockResult,
    WorkflowBlock,
    WorkflowBlockManifest,
)


class BlockManifest(WorkflowBlockManifest):
    model_config = ConfigDict(
        json_schema_extra={
            "name": "Google Vision OCR",
            "version": "v1",
            "short_description": "TODO",
            "long_description": "TODO",
            "license": "Apache-2.0",
            "block_type": "model",
        },
        protected_namespaces=(),
    )
    type: Literal["roboflow_core/google_vision_ocr@v1"]
    image: Union[WorkflowImageSelector, StepOutputImageSelector]
    ocr_type: Literal["text_detection", "ocr_text_detection"]

    @classmethod
    def describe_outputs(cls) -> List[OutputDefinition]:
        return [
            OutputDefinition(
                name="predictions", kind=[OBJECT_DETECTION_PREDICTION_KIND]
            ),
        ]

    @classmethod
    def get_execution_engine_compatibility(cls) -> Optional[str]:
        return ">=1.0.0,<2.0.0"


class RoboflowObjectDetectionModelBlockV1(WorkflowBlock):

    @classmethod
    def get_manifest(cls) -> Type[WorkflowBlockManifest]:
        return BlockManifest

    def run(
        self,
        image: WorkflowImageData,
        ocr_type: Literal["text_detection", "ocr_text_detection"]
    ) -> BlockResult:
        results = requests.post(...)
        return {
            "predictions": sv.Detections(...)
        }
@brunopicinin
Copy link
Contributor

I forked the project and started to develop a new block, but one thing is not clear to me.

Given the following image: https://testsigma.com/blog/wp-content/uploads/What-is-the-OCR-Test-How-to-Create-Automate-It.png

Passing this image to Google API as such:

POST https://vision.googleapis.com/v1/images:annotate?key=[YOUR_API_KEY] HTTP/1.1

Authorization: Bearer [YOUR_ACCESS_TOKEN]
Accept: application/json
Content-Type: application/json

{
  "requests": [
    {
      "image": {
        "source": {
          "imageUri": "https://testsigma.com/blog/wp-content/uploads/What-is-the-OCR-Test-How-to-Create-Automate-It.png"
        }
      },
      "features": [
        {
          "type": "TEXT_DETECTION"
        }
      ]
    }
  ]
}

Results in the following response:

{
  "responses": [
    {
      "textAnnotations": [
        {
          "locale": "en",
          "description": "OCR test\nOCR",
          "boundingPoly": {
            "vertices": [
              {
                "x": 265,
                "y": 261
              },
              {
                "x": 940,
                "y": 261
              },
              {
                "x": 940,
                "y": 324
              },
              {
                "x": 265,
                "y": 324
              }
            ]
          }
        },
        {
          "description": "OCR",
          "boundingPoly": {
            "vertices": [
              {
                "x": 265,
                "y": 281
              },
              {
                "x": 382,
                "y": 282
              },
              {
                "x": 382,
                "y": 321
              },
              {
                "x": 265,
                "y": 320
              }
            ]
          }
        },
        {
          "description": "test",
          "boundingPoly": {
            "vertices": [
              {
                "x": 396,
                "y": 282
              },
              {
                "x": 505,
                "y": 283
              },
              {
                "x": 505,
                "y": 322
              },
              {
                "x": 396,
                "y": 321
              }
            ]
          }
        },
        {
          "description": "OCR",
          "boundingPoly": {
            "vertices": [
              {
                "x": 756,
                "y": 261
              },
              {
                "x": 940,
                "y": 262
              },
              {
                "x": 940,
                "y": 324
              },
              {
                "x": 756,
                "y": 323
              }
            ]
          }
        }
      ],
      "fullTextAnnotation": {
        ...
      }
    }
  ]
}

Should the block output sv.Detections(...) with the full text match only, the word matches only, or both?

@PawelPeczek-Roboflow
Copy link
Collaborator Author

Hi @brunopicinin,
At first, thanks for taking the challenge 💪

Regarding the question - good point, I believe that it would be good to have Workflow block output that would simply dump the whole recognised text + output with sv.Detections(...) that would denote each parsed region

@PawelPeczek-Roboflow PawelPeczek-Roboflow mentioned this issue Oct 1, 2024
2 tasks
@brunopicinin
Copy link
Contributor

Created a PR for this issue: #709

@PawelPeczek-Roboflow
Copy link
Collaborator Author

Amazing 💪 taking review now

@PawelPeczek-Roboflow
Copy link
Collaborator Author

posted PR review, great thanks for contribution

@PawelPeczek-Roboflow
Copy link
Collaborator Author

Approved PR, merged to main, great thanks for contribution 🏅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants