Domain-Specific LM Papers

A compilation of papers related to domain specific language model training and evaluation. We focus on language models trained for biomedicine, finance, law, education, etc.

0. Surveys
1. Domain Specific Pre-Training
2. Using Domain-Knowledge in Large Language Models
3. Miscellaneous

0. Surveys

Beyond One-Model-Fits-All: A Survey of Domain Specialization for Large Language Models. Ling, Chen et al. [abs], 2023

Do We Still Need Clinical Language Models? Lehman, Eric P. et al. [abs], 2023

Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers. Tay, Yi et al. [abs], 2021

The Shaky Foundations of Clinical Foundation Models: A Survey of Large Language Models and Foundation Models for EMRs. Wornow, Michael et al. [abs], 2023

A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity. Longpre, S. et al. [abs], 2023

Towards More Robust NLP System Evaluation: Handling Missing Scores in Benchmarks. Himmi, Anas et al. [abs], 2023

OpenAGI: When LLM Meets Domain Experts. Ge, Yingqiang et al. [abs], 2023

Domain Mastery Benchmark: An Ever-Updating Benchmark for Evaluating Holistic Domain Knowledge of Large Language Model-A Preliminary Release. Gu, Zhouhong et al. [abs], 2023

Adapting a Language Model While Preserving its General Knowledge. Ke, Zixuan et al. [abs] Conference on Empirical Methods in Natural Language Processing, 2023.

Large language models in biomedical natural language processing: benchmarks, baselines, and recommendations. Chen, Qingyu et al. [abs], 2023

1. Domain Specific Pre-Training

1.1 Pre-Training from Scratch

Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing. Gu, Yu et al. [doi], 2020.

BioMedLM: a Domain-Specific Large Language Model for Biomedical Text. Venigalla, Abhinav et al. [doi], 2022.

BioBART: Pretraining and Evaluation of A Biomedical Generative Language Model. Yuan, Hongyi et al. [workshop], 2022.

SecureBERT: A Domain-Specific Language Model for Cybersecurity. Aghaei, Ehsan et al. 2022.

LEGAL-BERT: The Muppets straight out of Law School. Chalkidis, Ilias et al. [abs], 2020

DarkBERT: A Language Model for the Dark Side of the Internet. Jin, Youngjin et al. [abs], 2023

A Japanese Masked Language Model for Academic Domain. Yamauchi, Hiroki et al. [SDP], 2022.

Galactica: A Large Language Model for Science. Taylor, Ross et al. [abs], 2022

Language Model for Statistics Domain. Jeong, Young-Seob et al. [doi], 2022

SsciBERT: a pre-trained language model for social science texts. Shen, Si et al. [doi], 2022.

XuanYuan 2.0: A Large Chinese Financial Chat Model with Hundreds of Billions Parameters. Zhang, Xuanyu et al. [abs], 2023

Strategy to Develop a Domain-specific Pre-trained Language Model: Case of V-BERT, a Language Model for the Automotive Industry. Kim, Younha et al. [source], 2023

ITALIAN-LEGAL-BERT: A Pre-trained Transformer Language Model for Italian Law. Licari, Daniele and Giovanni Comandé. [conference], 2022.

Unifying Molecular and Textual Representations via Multi-task Language Modelling. Christofidellis, Dimitrios et al. [abs], 2023

Is Domain Adaptation Worth Your Investment? Comparing BERT and FinBERT on Financial Tasks. Peng, Bo et al. [proceedings], 2021

PathologyBERT - Pre-trained Vs. A New Transformer Language Model for Pathology Domain. Santos, Thiago et al. [proceedings], 2022

BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining. Luo, Renqian et al. [article], 2022

AraLegal-BERT: A pretrained language model for Arabic Legal text. Al-Qurishi, Muhammad et al. [abs], 2022

ConfliBERT: A Pre-trained Language Model for Political Conflict and Violence. Hu, Yibo et al. [conf], 2022

MFinBERT: Multilingual Pretrained Language Model For Financial Domain. Nguyen, Duong et al. [doi], 2022.

AKI-BERT: a Pre-trained Clinical Language Model for Early Prediction of Acute Kidney Injury. Mao, Chengsheng et al. [abs], 2022

Bioformer: An Efficient Transformer Language Model for Biomedical Text Mining. Fang, Li et al. [arXiv], 2023

TourBERT: A pretrained language model for the tourism industry. Arefieva, Veronika and Roman Egger. [abs], 2022

LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development. Chalkidis, Ilias et al. [abs], 2023

PoliBERTweet: A Pre-trained Language Model for Analyzing Political Content on Twitter. Kawintiranon, Kornraphop and Lisa Singh. [conference], 2022.

CiteCaseLAW: Citation Worthiness Detection in Caselaw for Legal Assistive Writing. Khatri, Mann et al. [abs], 2023

MEDBERT.de: A Comprehensive German BERT Model for the Medical Domain. Bressem, Keno Kyrill et al. [abs], 2023

Constructing and analyzing domain-specific language model for financial text mining. Suzuki, Masahiro et al. [doi], 2023

ChestXRayBERT: A Pretrained Language Model for Chest Radiology Report Summarization. Cai, Xiaoyan et al. [doi], 2023.

1.2 Further Pre-Training

Don't Stop Pretraining: Adapt Language Models to Domains and Tasks. Gururangan, Suchin et al. [abs], 2020

SciBERT: A Pretrained Language Model for Scientific Text. Beltagy, Iz et al. [conf], 2019

Gradual Further Pre-training Architecture for Economics/Finance Domain Adaptation of Language Model. Sakaji, Hiroki et al. [doi], 2022.

1.3 Mixed Pre-Training

BloombergGPT: A Large Language Model for Finance. Wu, Shijie et al. [abs], 2023

1.4 Fine-Tuning

Large Language Models Encode Clinical Knowledge. Singhal, K. et al. [abs], 2022

Clinical Camel: An Open-Source Expert-Level Medical Language Model with Dialogue-Based Knowledge Encoding. Toma, Augustin et al. [abs], 2023

ChatDoctor: A Medical Chat Model Fine-tuned on LLaMA Model using Medical Domain Knowledge. Li, Yunxiang et al. [abs], 2023

Empower Large Language Model to Perform Better on Industrial Domain-Specific Question Answering. Wang, Zezhong et al. [abs], 2023

ExpertPrompting: Instructing Large Language Models to be Distinguished Experts. Xu, Benfeng et al. [abs], 2023

SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language Models. Thangarasa, Vithursan et al. [abs], 2023

MedJEx: A Medical Jargon Extraction Model with Wiki’s Hyperlink Span and Contextualized Masked Language Model Score. Kwon, Sunjae et al. [proceedings], 2022.

Exploring the Trade-Offs: Unified Large Language Models vs Local Fine-Tuned Models for Highly-Specific Radiology NLI Task. Wu, Zihao et al. [abs], 2023

Flan-MoE: Scaling Instruction-Finetuned Language Models with Sparse Mixture of Experts. Shen, Sheng et al. [abs], 2023

PMC-LLaMA: Further Finetuning LLaMA on Medical Papers. Wu, Chaoyi et al. [abs], 2023

1.5 Beyond Language Modeling

LinkBERT: Pretraining Language Models with Document Links. Yasunaga, Michihiro et al. [abs], 2022

Deep Bidirectional Language-Knowledge Graph Pretraining. Yasunaga, Michihiro et al. [abs], 2022

BiomedGPT: A Unified and Generalist Biomedical Generative Pre-trained Transformer for Vision, Language, and Multimodal Tasks. Zhang, Kaiyuan et al. [abs], 2023

Exploiting Language Characteristics for Legal Domain-Specific Language Model Pretraining. Nair, Inderjeet and Natwar Modani. [Findings], 2023.

Farewell to Aimless Large-scale Pretraining: Influential Subset Selection for Language Model. Wang, Xiao et al. [abs], 2023

OPAL: Ontology-Aware Pretrained Language Model for End-to-End Task-Oriented Dialogue. Chen, Zhi et al. [article], 2022.

Editing Language Model-based Knowledge Graph Embeddings. Cheng, Siyuan et al. [abs], 2023

CaseEncoder: A Knowledge-enhanced Pre-trained Model for Legal Case Encoding. Ma, Yixiao et al. [abs], 2023

KALA: Knowledge-Augmented Language Model Adaptation. Kang, Minki et al. [conference], 2022

Patton: Language Model Pretraining on Text-Rich Networks. Jin Bowen et al. [abs], 2023

Tool Use

GeneGPT: Augmenting Large Language Models with Domain Tools for Improved Access to Biomedical Information. Jin, Qiao et al. [arXiv], 2023

Almanac: Knowledge-Grounded Language Models for Clinical Medicine. Zakka, Cyril et al. [abs], 2023

CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing. Gou, Zhibin et al. [abs], 2023

2. Using Domain-Knowledge in Large Language Models

2.1 Black-Box Retrieval Augmentation

REPLUG: Retrieval-Augmented Black-Box Language Models. Shi, Weijia et al. [abs], 2023

WHEN GIANT LANGUAGE BRAINS JUST AREN’T ENOUGH! DOMAIN PIZZAZZ WITH KNOWLEDGE SPARKLE DUST. Nguyen, Minh-Tien et al. [abs], 2023

2.2 Retrieval-Based Pre-Training

Knowledge-in-Context: Towards Knowledgeable Semi-Parametric Language Models. Pan, Xiaoman et al. [abs], 2022

Atlas: Few-shot Learning with Retrieval Augmented Language Models. Izacard, Gautier et al. [abs], 2022

2.3 Generalist and Domain-Specific Ensembles

Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models. Li, Margaret et al. [abs], 2022

CooK: Empowering General-Purpose Language Models with Modular and Collaborative Knowledge. Feng, Shangbin et al. [abs], 2023

Scaling Expert Language Models with Unsupervised Domain Discovery. Gururangan, Suchin et al. [abs], 2023

3. Miscellaneous

Scaling Data-Constrained Language Models. Muennighoff, Niklas et al. [abs], 2023

Leveraging Domain Knowledge for Inclusive and Bias-aware Humanitarian Response Entry Classification. Tamagnone, Nicolò et al. [abs], 2023

ChatGraph: Interpretable Text Classification by Converting ChatGPT Knowledge to Graphs. Shi, Yucheng et al. [abs], 2023

Language Model Crossover: Variation through Few-Shot Prompting. Meyerson, Elliot et al. [abs], 2023

Domain Knowledge Transferring for Pre-trained Language Model via Calibrated Activation Boundary Distillation. Choi, Dongha et al. [conf], 2022.

Reprogramming Pretrained Language Models for Protein Sequence Representation Learning. Vinod, Ria et al. [abs], 2023

Reasoning with Language Model is Planning with World Model. Hao, Shibo et al. [abs], 2023

Few-shot Learning with Retrieval Augmented Language Models. Izacard, Gautier et al. [abs], 2022

Unified Demonstration Retriever for In-Context Learning. Li, Xiaonan et al. [abs], 2023

AutoScrum: Automating Project Planning Using Large Language Models. Schroder, Martin. 2023.

Explainable Automated Debugging via Large Language Model-driven Scientific Debugging. Kang, Sungmin et al. [abs], 2023

ModuleFormer: Learning Modular Large Language Models From Uncurated Data. Shen, Yikang et al. 2023.

Galactic ChitChat: Using Large Language Models to Converse with Astronomy Literature. Ciucă, Ioana and Yuan-sen Ting. [abs], 2023

Grammar Prompting for Domain-Specific Language Generation with Large Language Models. Wang, Bailin et al. [abs], 2023

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Domain-Specific LM Papers

0. Surveys

1. Domain Specific Pre-Training

1.1 Pre-Training from Scratch

1.2 Further Pre-Training

1.3 Mixed Pre-Training

1.4 Fine-Tuning

1.5 Beyond Language Modeling

Tool Use

2. Using Domain-Knowledge in Large Language Models

2.1 Black-Box Retrieval Augmentation

2.2 Retrieval-Based Pre-Training

2.3 Generalist and Domain-Specific Ensembles

3. Miscellaneous

About

Releases

Packages

bernaljg/DomainSpecificLMs

Folders and files

Latest commit

History

Repository files navigation

Domain-Specific LM Papers

0. Surveys

1. Domain Specific Pre-Training

1.1 Pre-Training from Scratch

1.2 Further Pre-Training

1.3 Mixed Pre-Training

1.4 Fine-Tuning

1.5 Beyond Language Modeling

Tool Use

2. Using Domain-Knowledge in Large Language Models

2.1 Black-Box Retrieval Augmentation

2.2 Retrieval-Based Pre-Training

2.3 Generalist and Domain-Specific Ensembles

3. Miscellaneous

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages