`libfnl`™

Introduction

libfnl is an API and CLI facilitating data and text mining by providing a collection of easy-to-use tools. The library is designed to work with Python 3 (only). It is specifically tuned towards mining biomedical/scientific texts, but can be used in other contexts if need be, too. It is a complementary piece in the gnamed gene name repository daemon and the medic PubMed mirroring tool collection. In addtion, an (orphan) couchpy repository could provide a document storage facility.

The library contains the following packages:

fnl.nlp: tools to linguistically analyze text (tokenization, PoS tagging, phrase chunking, entity detection); modules to segment sentences (based on NLTK), and map text (strings) to entries in dictionaries this includes a Python wrapper for the GENIA Tagger, a Python wrapper for the NER Suite, and a handler for the GENIA corpus; furthermore, via NLTK 's wrapper for MegaM, a Maximum Entropy classifier is available, too;
fnl.stat: a module to evaluate inter-rater Kappa scores and a module to develop text classifiers based on Scikit-Learn
fnl.text: wrappers to work with text data (strings, tokens, segments, annotations, etc.)
fnl.utils: additional utilities and tools (currently, just for handling JSON)
scripts: the CLI scripts to manage data/text, representing the main value provided by this collection

The script directory provides the following command-line interfaces:

fnlclassi generate a classifier for [NER-tagged] text using Scikit-Learn.
fnlcorpus store corpora in JSON format in a CouchDB.
fnldgrep "grep" for tokens using a dictionary.
fnldictag tag semantic tokens from a dictionary in linguistically annotated text.
fnlgpcounter count gene/protein symbols in MEDLINE.
fnlkappa calculate inter-rater agreement scores.
fnlsegment segment text into sentences using NLTK (PunktSentenceTokenizer).
fnlsegtrain train a nltk.punkt.PunktSentenceTokenizer.
fnltok a fast, pure-Python, Unicode-aware string tokenizer.

Warning

This project is under "continuous development", better take your own snapshot.

Requirements

Python 3.2+
Numpy, SciPy, and Scikit-Learn 0.14+ (for fnlclassi)
NLTK 3.0+ (for the sentence segmenting tools fnlseg*)
DAWG (for fnlgpcounter; see Installation below)

Optional projects that work together with this project:

GENIA Tagger (optional, latest version)
NER Suite (optional, latest version, in turn requires CRF Suite)
MegaM - a MaxEnt classifier for NLTK with a (fast) L-BFGS optimizer
gnamed for creating gene/protein name repositories
medic for mirroring and handling PubMed citations
txtfnnl natural language processing tools based on Apache OpenNLP and UIMA

Installation

Into a Python 3 virtual environment:

pip install virtualenv # if virtualenv is not yet installed
git clone git://github.com/fnl/libfnl.git libfnl
virtualenv libfnl
cd libfnl
. bin/activate
pip install argparse # for python3 < 3.2
pip install numpy # because installing scipy fails if numpy isn't installed already
pip install -e . # installs all other dependencies

# if you prefer to install all other dependencies manually
# and/or prefer to use setup.py instead of pip:
# python setup.py install
pip install sqlalchemy
pip install sklearn
pip install matplotlib
pip install nltk --pre # to get 3.0

# if you want to install the test environment:
pip install pytest

# special steps to install DAWG
git clone git@github.com:fnl/DAWG.git
cd DAWG
python setup.py install
cd ..

License

All parts of this library are licensed under the GNU Affero GPL v3

See the attached LICENSE.txt file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.rst

README.rst

`libfnl`™

Introduction

Requirements

Installation

License

Copyright

Files

README.rst

Latest commit

History

README.rst

File metadata and controls

libfnl™

Introduction

Requirements

Installation

License

Copyright

`libfnl`™