chore: update readme

Wazzabeee · Apr 24, 2024 · b4a3569 · b4a3569
1 parent 4fa35a2
commit b4a3569
Showing 1 changed file with 46 additions and 19 deletions.
diff --git a/README.md b/README.md
@@ -1,30 +1,43 @@
 # Copy Spotter
+
+![PyPI - Version](https://img.shields.io/pypi/v/copy-spotter) ![PyPI - License](https://img.shields.io/pypi/l/copy-spotter)
+![Python](https://img.shields.io/badge/python-3.11-blue)
+
+
 ![GIF demo](data/img/example.gif)
 
 ## About
-This program will proccess pdf, txt, docx, and txt files that can be found in the given input directory, find similar sentences, calculate similarity percentage, display a similarity table with links to side by side comparison where similar sentences are highlighted.
-
-This project was made part of my internship at the "Human Computer Humans Interacting with Computers at University of Primorska" lab (HICUP Lab).
+This program will process pdf, txt, docx, and txt files that can be found in the given input directory, find similar sentences, calculate similarity percentage, display a similarity table with links to side by side comparison where similar sentences are highlighted.
 
 **Usage**
 ---
 
 ```
-Usage: python -m scripts.main.py input_directory [OPTIONS]
+pip install copy-spotter
+copy-spotter [-s] [-o] [-h] input_directory
+```
+***Positional Arguments:***
+* `input_directory`: Directory that contains one folder per pdf file (see `data/pdf/plagiarism` for example)
 
-  Performs a similarity analysis of all text files available in given input directory.
-  Developed by Clément Delteil -> (Github: Wazzabeee)
+***Optional Arguments:***
+* `-s`, `--block-size`: Set minimum number of consecutive and similar words detected. (Default is 2)
+* `-o`, `--out_dir`: Set the output directory for html files. (Default is creating a new directory called results)
+* `-h`, `--help`: Show this message and exit.
 
-Options:
-  -block_size, -s  Set minimum number of consecutive and similar words detected. (Default is 2)
-  -out_dir, -o     Set the output directory for html files. (Default is creating a new directory)
-  -help, -h        Show this message and exit.
+**Examples**
+---
 ```
+# Analyze documents in 'data/pdf/plagiarism', with default settings
+$ copy-spotter data/pdf/plagiarism
 
-**How to use**
+# Analyze with custom block size and specify output directory
+$ copy-spotter data/pdf/plagiarism -s 5 -o results/output
+```
+
+**Development Setup:**
 ---
 
-```bash
+```
 # Clone this repository
 $ git clone https://github.com/Wazzabeee/copy_spotter
 
@@ -33,11 +46,22 @@ $ cd copy_spotter
 
 # Install requirements
 $ pip install -r requirements.txt
+$ pip install -r requirements_lint.txt
+
+# Install precommit
+$ pip install pre-commit
+$ pre-commit install
+
+# Run tests
+$ pip install pytest
+$ pytest tests/
+
+# Run package locally
+$ python -m scripts.main.py [-s] [-o] [-h] input_directory
 
-# Run the app
-$ python -m scripts.main.py data/pdf/plagiarism -s 2
 ```
-**First run**
+
+**Issues**
 ---
 On the first run you might get :
 - an ImportError from pdfminer library 
@@ -63,12 +87,15 @@ To fix this you'll need to modify `class PDF(list):` in `C:/.../slate3k/classes.
 - Please make sure that all text files are closed before running the program.
 - In order to get the best results please provide text files of the same languages.
 - Pdf files that are made from scanned images won't be processed correctly.
+- Ensure you have writing access when using the package 
 - If a specific file is not processed correctly feel free to [contact me](mailto:<[email protected]>) so that I can address the issue.
 
 **TODO**
 ---
-- Add more tests
+- Add more tests on existing functions
+- Implement OCR with tesseract for scanned documents
 - Add info in console for timing (tqdm)
-- Add CSS to HTML Template
-- Add support for other folder structures
-- Fix Slate3k by installing custom fork
+- Add CSS to HTML Template to make the results better looking
+- Add support for other folder structures (right now the package is expecting one pdf files per folder)
+- Add custom naming option for pdf files instead of using first part before _
+- Fix Slate3k by installing custom fork (check if still relevant)