This tool categorizes physics articles supplied by the ArXiv as technical or non-technical, where technical and non-technical as categories that contain the following:
Non-Technical Audience:
•General Physics
•History and Philosophy of Physics
•Physics and Society
•Physics Education
•Popular Physics
Technical Audience:
•Accelerator Physics
•Atmospheric and Oceanic Physics
•Atomic and Molecular Clusters
•Atomic Physics
•Biological Physics
•Chemical Physics
•Classical Physics
•Computational Physics
•Data Analysis
•Statistics and Probability
•Fluid Dynamics
•Geophysics
•Instrumentation and Detectors
•Medical Physics
•Optics
•Plasma Physics
•Space Physics
Use the package manager pip to install the necessary dependencies.
pip install pandas
pip install pdfminer
pip install nltk
pip install -U scikit-learn
pip install os
pip install plotly
conda install -c plotly plotly-orca
Run main.py if using python 3
Run main.ipynb if using jupyter notebook
Note: main_directory = must be set to the full directory where your physics file is located
main_directory = 'directory repositroy was cloned to' + '\physics'
Download the documents from Kaggle, in this experiment only the physics folders documents were downloaded. https://www.kaggle.com/Cornell-University/arxiv
pip install kaggle
List files available in the ArXiv buckent
# List files available in the ArXiv buckent
gsutil ls gs://arxiv-dataset/arxiv/
# Download all of the files in the physics folder into a local directory
gsutil cp gs://arxiv-dataset/arxiv/physics/ ./a_local_directory/
If using a regular python 3 environment
First run convertpdf.py
Note:
MODIFY: Modify the directory to where you downloaded the files
MODIFY: Lines 72 and 77 with the directory you would like to saves
the files to this should be the same directory as that of main.py.
Next run main.py
If using the Jupyter Notebook environment
First run convertpdf.ipynb
Note:
MODIFY: Modify the directory to where you downloaded the files
MODIFY: Lines 72 and 77 with the directory you would like to saves
the files to this should be the same directory as that of main.ipynb.
Next run main.ipynb