-
Notifications
You must be signed in to change notification settings - Fork 14
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
46 additions
and
19 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,30 +1,43 @@ | ||
# Copy Spotter | ||
|
||
![PyPI - Version](https://img.shields.io/pypi/v/copy-spotter) ![PyPI - License](https://img.shields.io/pypi/l/copy-spotter) | ||
![Python](https://img.shields.io/badge/python-3.11-blue) | ||
|
||
|
||
![GIF demo](data/img/example.gif) | ||
|
||
## About | ||
This program will proccess pdf, txt, docx, and txt files that can be found in the given input directory, find similar sentences, calculate similarity percentage, display a similarity table with links to side by side comparison where similar sentences are highlighted. | ||
|
||
This project was made part of my internship at the "Human Computer Humans Interacting with Computers at University of Primorska" lab (HICUP Lab). | ||
This program will process pdf, txt, docx, and txt files that can be found in the given input directory, find similar sentences, calculate similarity percentage, display a similarity table with links to side by side comparison where similar sentences are highlighted. | ||
|
||
**Usage** | ||
--- | ||
|
||
``` | ||
Usage: python -m scripts.main.py input_directory [OPTIONS] | ||
pip install copy-spotter | ||
copy-spotter [-s] [-o] [-h] input_directory | ||
``` | ||
***Positional Arguments:*** | ||
* `input_directory`: Directory that contains one folder per pdf file (see `data/pdf/plagiarism` for example) | ||
|
||
Performs a similarity analysis of all text files available in given input directory. | ||
Developed by Clément Delteil -> (Github: Wazzabeee) | ||
***Optional Arguments:*** | ||
* `-s`, `--block-size`: Set minimum number of consecutive and similar words detected. (Default is 2) | ||
* `-o`, `--out_dir`: Set the output directory for html files. (Default is creating a new directory called results) | ||
* `-h`, `--help`: Show this message and exit. | ||
|
||
Options: | ||
-block_size, -s Set minimum number of consecutive and similar words detected. (Default is 2) | ||
-out_dir, -o Set the output directory for html files. (Default is creating a new directory) | ||
-help, -h Show this message and exit. | ||
**Examples** | ||
--- | ||
``` | ||
# Analyze documents in 'data/pdf/plagiarism', with default settings | ||
$ copy-spotter data/pdf/plagiarism | ||
**How to use** | ||
# Analyze with custom block size and specify output directory | ||
$ copy-spotter data/pdf/plagiarism -s 5 -o results/output | ||
``` | ||
|
||
**Development Setup:** | ||
--- | ||
|
||
```bash | ||
``` | ||
# Clone this repository | ||
$ git clone https://github.com/Wazzabeee/copy_spotter | ||
|
@@ -33,11 +46,22 @@ $ cd copy_spotter | |
# Install requirements | ||
$ pip install -r requirements.txt | ||
$ pip install -r requirements_lint.txt | ||
# Install precommit | ||
$ pip install pre-commit | ||
$ pre-commit install | ||
# Run tests | ||
$ pip install pytest | ||
$ pytest tests/ | ||
# Run package locally | ||
$ python -m scripts.main.py [-s] [-o] [-h] input_directory | ||
# Run the app | ||
$ python -m scripts.main.py data/pdf/plagiarism -s 2 | ||
``` | ||
**First run** | ||
|
||
**Issues** | ||
--- | ||
On the first run you might get : | ||
- an ImportError from pdfminer library | ||
|
@@ -63,12 +87,15 @@ To fix this you'll need to modify `class PDF(list):` in `C:/.../slate3k/classes. | |
- Please make sure that all text files are closed before running the program. | ||
- In order to get the best results please provide text files of the same languages. | ||
- Pdf files that are made from scanned images won't be processed correctly. | ||
- Ensure you have writing access when using the package | ||
- If a specific file is not processed correctly feel free to [contact me](mailto:<[email protected]>) so that I can address the issue. | ||
|
||
**TODO** | ||
--- | ||
- Add more tests | ||
- Add more tests on existing functions | ||
- Implement OCR with tesseract for scanned documents | ||
- Add info in console for timing (tqdm) | ||
- Add CSS to HTML Template | ||
- Add support for other folder structures | ||
- Fix Slate3k by installing custom fork | ||
- Add CSS to HTML Template to make the results better looking | ||
- Add support for other folder structures (right now the package is expecting one pdf files per folder) | ||
- Add custom naming option for pdf files instead of using first part before _ | ||
- Fix Slate3k by installing custom fork (check if still relevant) |