README italian version can be found here.
Use Spark to cluster documents given their content and given Wikipedia categories.
Project is splitted into different folders.
dataset
contains input datasetoutput
contains processing output dataset, such as intermediate computations of Sparkresults
contains somecsv
reports used to make plotslatex
containstex
file of our reportsrc
contains code needed to perform computations and plot
To make handling classes and parameter simpler, we wrote a python
script,
namely make.py
. Its documentation is simply given through python3 make.py --help
.
Example usage is
python make.py --class Cluster (...args for Java main...)
Relevant classes with main Spark procedures can be found in
src/main/java/it/unipd/dei/dm1617/examples/
, descripted in next section.
They are splitted in different groups, each one providing a differe processing
step. Parameters are retrievable in main
method of each class.
-
preprocessing
CategoriesPreprocessing.java
counts articles per categoryTfidfCategories.java
ranks categories by their relevanceTfIdf.java
buildsbag-of-words
modelWord2VecFit.java
trains word2vec model using text corpusDoc2Vec.java
loads word2vec model and writes vector corresponding to each document inoutput/
folder
-
clustering
Cluster.java
clusters input data and outputs the trained separation model
-
result evaluation
HopkinsStatistic.java
computes Hopkins statistic on vectorized corpusEvaluationLDA.java
inspect LDA fit outputNMIRankedCategories.java
complutes NMI score considering one category per document onlyNMIOverlappingCategories.java
computes NMI score cosidering multiple categories per documentSimpleSilhouetteCoefficient.java
complutes simple silhouette score given a(vector, clusterID)
dataset
In src/
e results/
can be found python
scripts that process output/
files and build relevant plots.
src/hierarchicalClustering.py
tried to use scipy
clustering library, but it
was dropped given its RAM request.