Skip to content

maitrang-ng/FrenchKeyBert

Repository files navigation

Keywords extraction

Introduction

This program is used to extract keywords, keypharses from a French text. This program now supports as a CLI app. The program will accept text from CLI, txt file, and json format. For option of using server, please insert json format and use postman to test.

Prerequistites

Please ensure the following requirements are satisfied:

  • OS system: Linux or Mac OS
  • Python 3
  • You have installed Virtual Environement for Python (pipenv)

Installation

Open your CLI, go to the project path and run following commands

  • pipenv shell Initialize the Virtual Environement in your project directory
  • pipenv install -r requirements.txt Install all requirements for the project
  • pipenv check Check if there is any conflict in the environement
  • python3 -m spacy download fr_core_news_lg Download this model from spacy

Usage

If you want to type the text directly from your CLI, use following command

  • python3 -m run --text You can also insert your txt file path by using following command
  • python3 -m run -t yourtxtfilepath For choosing server option, you need to insert json format with method POST
  • python3 -m run -s

Citations

  • Keybert @misc{grootendorst2020keybert, author = {Maarten Grootendorst}, title = {KeyBERT: Minimal keyword extraction with BERT.}, year = 2020, publisher = {Zenodo}, version = {v0.3.0}, doi = {10.5281/zenodo.4461265}, url = {https://doi.org/10.5281/zenodo.4461265} }
  • Sbert @article{reimers2019sentence, title={Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks}, author={Nils Reimers, Iryna Gurevych}, journal={https://arxiv.org/abs/1908.10084}, year={2019} }
  • Camembert @article{martin2020camembert, title={CamemBERT: a Tasty French Language Mode}, author={Martin, Louis and Muller, Benjamin and Su{'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {'E}ric Villemonte and Seddah, Djam{'e} and Sagot, Beno{^\i}t}, journal={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics}, year={2020} }

References

Keywords extraction

Models

TODO

Improve performance of program

  • Improve filter POS speed
  • Multi processing of spacy model
  • Model of Alibaba

Improve accuracy

  • Re-training Sbert
  • Preprocessing with Regex
  • Preprocessing HTML file
  • Preprocessing PDF file

Next steps

  • Language detection
  • Topic detection

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages