Skip to content

Latest commit

 

History

History
19 lines (13 loc) · 1.06 KB

README.md

File metadata and controls

19 lines (13 loc) · 1.06 KB

steps to run wikipedia_extractor: pip install wikipedia (to install wikipedia API) pip install json (if json is not already installed) run wiki_extractor.py

steps to run the pdf content extractor: Note: Extracting data from whole pdf is taking very long, so current configuration fetches 1-10 pages/pdf, if required more remove first and last page condtion from

    pages = convert_from_path(PDF_file,poppler_path=POPPLER_PATH,dpi=200,first_page=1,last_page=10,fmt='jpg') (function: extract_text, file: get_pdf.py)

install latest version of all the imports (pip install dependency_name if already exists do pip install --upgrade dependency_name
    download tesseract.exe and poppler
download marathi language pack from https://github.com/tesseract-ocr/tessdata/ or for marathi (https://raw.githubusercontent.com/tesseract-ocr/tessdata/main/mar.traineddata)
    paste the file C:\\Program Files\\Tesseract-OCR\\tessdata (or wherever Tesseract OCR is installed).
    set TESSER_PATH(.exe) and POPPLER_PATH (til bin)

    run get_pdf.py