AOM 2024: Get To Know Your [Text]

This repository shares the materials associated with my presentation "Get To Know Your [Text]" at the 2024 Academy of Management Conference. The presentation is part of the Professional Development Workshop (PDW) titled Framing Novelty: New perspectives on language and communication in entrepreneurship and innovation research.

While I provide high-level descriptions of how to get started with the materials, I cannot provide detailed support/instruction on how to use Python or implement these tools. For detailed support, I provide a list of resources that can help you get started with Python and the tools I introduce at the bottom of this document. At the same time, I am happy to answer questions within the scope of my presentation.

High-Level Initial Setup Instructions

These instructions assume a bit of tech-savviness. If that's not you, or you get lost in one step or another, there are step-by-step details on the detailed setup instructions page.

Install Python. I used Python 3.11, but any relatively recent version should work.
Clone this GitHub repository to your computer
Suggested Create and activate a virtual environment for this repository
Install the required Python packages from the requirements.txt file
- If you want CUDA acceleration, you can upgrade the spaCy package following details on the spaCy website.
Install ollama and download the nomic-embed-text embedding model
Modify the settings.json file to customize the analyses to your needs. Details about the settings file are provided in the configuration instructions.
- At a minimum, you will need to replace "YourEmailHere" with your email address in the Crossref -> User-Agent field.

Setup Complete!

High-Level Instructions for Running the Analyses

These instructions assume a bit of tech-savviness. If that's not you, or you get lost in one step or another, there are step-by-step details on the running the code page.

Run 1_collect_data.py to collect the data from Crossref.
- If you have not replaced the "YourEmailHere" in the settings file, you will need to do so before running this script.
- Crossref is missing some abstracts (notably from JBV). If you have access to Scopus or some other source of abstracts and titles, you can create a scopus_download.parquet file in the data/ folder with the missing abstracts (only 'title' and 'abstract' fields are needed). If present, the script will automatically use this file to fill in the blanks. If not present, the script will leave the abstracts blank and the next Python script will eliminate them as 'missing data'.
  - Unfortunately, I cannot provide you with my datafile from Scopus.
  - The code will work fine without filling in missing abstracts, but you will not have abstracts for some articles/journals (e.g., JBVs).
Run 2_preprocess_abstracts.py to preprocess the abstracts.
Run 3_scattertext.py to create the Scattertext visualizations.
- The output will be saved in the output/ folder.
Run 4_embeddings.py to create the embeddings and t-SNE visualizations.
- You will need to have the Ollama server running in a separate terminal window for this script to work.
- The embedding similarity results will be displayed in the terminal.
- The scatterplot output will be saved in the output/ folder

Repository Structure

data/: Contains the data used in the analyses - this folder will be created by the code provided and does not need to be created manually. You generally do not need to modify the contents of this folder. One exception is for data collection, that is described in the details on running the 1_collect_data.py code.
docs/: Contains documentation for the repository. If you need detailed instructions on something, check here.
output/: Contains the output of the analyses - this folder will be created by the code provided and does not need to be created manually.
src/: Contains the 'behind-the-scenes' Python code used in the analyses. You generally do not need to modify the contents of this folder. However, you may benefit by looking at it if you want to see under the hood.
venv/: Contains the virtual environment for this repository. You probably created this during the setup process, if you followed the detailed setup instructions.
settings.json: Contains the settings for the analyses. Leave this as-is to run the analyses like I did for the presentation. Or, edit it to customize the analyses to your needs. Details about the settings file are provided in the configuration instructions.
1_collect_data.py, 2_preprocess_abstracts.py, 3_scattertext.py, 4_embeddings.py: These Python scripts run the analyses. You can run them in order to replicate the analyses I presented. Details about running the code are provided in the details on running the code.
- 1 and 2 need to be run first and sequentially. 3 and 4 can be run in any order after 1 and 2 have been run.
__init__.py, .gitignore, LICENSE, README.md, requirements.txt: These files are used to manage the repository and provide information about the repository.

Resources

Learn More About
- Git/GitHub
  - Free ebook: Pro Git
  - Good Youtube video: freeCodeCamp.org
- Python
  - Free ebook: Automate the Boring Stuff with Python
  - Good Youtube channel for Python content: Corey Schafer
- Scattertext
  - Scattertext documentation
  - Youtube Video
- Word Embeddings
Packages I use - XML parsing: Beautiful Soup with lxml - Vector database: Chroma - Phrase extraction: gensim - Embedding/LLM integration framework: langchain - Stop word corpus: nltk - LLM integration package: ollama - Data management: pandas - Scatterplot visualizations: plotly - API requests: requests - Scattertext visualizations: scattertext - t-SNE implementation: scikit-learn - NLP pipeline: spaCy
Data Sources/Models
- Article Abstracts
  - Main: Crossref
  - Fill-in-the-blanks: Scopus
    - Note: Scopus requires a subscription or access through an institution. I cannot provide access to Scopus or the data I used.
- Embedding Model
  - nomic-embed-text

A Final Note

I structured the repository this way to make it easy for you to tweak/run the analyses you want without having to dig into the code. However, I can't guarantee that the code will work perfectly for all tweaks because it's difficult to predict what people will want to tweak (and time-consuming to try to do so). It also may not be the most efficient way to do what you want to do - you may (will likely, is more like it) want to structure things differently on your research projects. So, please feel free to make whatever changes you want to the code/settings itself to make it work better for you.

On Stars and Citations

If you do like/use this repository, please consider giving it a "star" in GitHub so that I know that people are using it.

If you use my code directly in your research, please cite my presentation at the Academy of Management Conference. If you just use this for inspiration, no citation is needed - just enjoy (a star on the repository is still appreciated though 😉)!

McKenny, A. F. (2024, August 10). Get to know your [text] [Conference Presentation]. Academy of Management Annual Meeting, Chicago, IL, United States.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AOM 2024: Get To Know Your [Text]

High-Level Initial Setup Instructions

High-Level Instructions for Running the Analyses

Repository Structure

Resources

A Final Note

On Stars and Citations

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
docs		docs
src		src
.gitignore		.gitignore
1_collect_data.py		1_collect_data.py
2_preprocess_abstracts.py		2_preprocess_abstracts.py
3_scattertext.py		3_scattertext.py
4_embeddings.py		4_embeddings.py
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
requirements.txt		requirements.txt
settings.json		settings.json

License

amckenny/AOM2024_GetToKnowYourText

Folders and files

Latest commit

History

Repository files navigation

AOM 2024: Get To Know Your [Text]

High-Level Initial Setup Instructions

High-Level Instructions for Running the Analyses

Repository Structure

Resources

A Final Note

On Stars and Citations

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages