Skip to content
Ari Hershowitz edited this page Jan 25, 2023 · 12 revisions

Introduction

The aims of this project are to investigate the use of machine learning for summarization of Congressional bills. Initial investigations show that most existing summarization models produce summaries with significant inaccuracies. However, newer large language models (LLMs), such as GPT 3.5 from OpenAI represent a leap forward in providing a working baseline for summarization. An 'out-of-the-box' approach to this task is likely to fail or produce wrong results (e.g. summaries that mischaracterize significant exceptions in legislation).

However, there are powerful opportunities to combine the knowledge of experts, such as attorneys in the Congresssional Research Service, with specialized workflow processes to produce much higher-value analysis of bills than either human experts or the AI systems could do alone. A key component of such systems is to provide a workflow interface where experts can easily interact with model output, correct it and produce an edited final product or summary.

One such interface would a) identify similar bills, b) find previous summaries written for the bills, c) compare the bills, c) summarize the differences between the similar bills, and d) update the prior summary with the differences between the bills.

Methodology

  • Generate a dataset consisting of a) U.S. Congressional bills and b) the summaries prepared by subject matter experts at the Congressional Research Service. The dataset will be configurable to include only bills of a certain size, measured by number of sections. The initial dataset consists of bills that are 10 sections or fewer in length.

  • Prepare the dataset to be hosted on HuggingFace, a platform for Machine Learning datasets and models. This Congressional summary dataset is similar to the earlier dataset here.

  • Test existing summarization models with this new dataset to produce auto-generated summaries. We describe these tests and results in the other wiki pages. Some of these models can be tested directly from an interface on Huggingface. For example:

  • Train a custom model on the dataset to improve results. We have used a training framework by Salesforce to do this, and present the initial results also in this wiki.

  • Propose future avenues to improve automated summarization as a starting point for CRS summaries.

Results

We created the dataset and posted it on HuggingFace here. The code in this repository will allow processing of bills with a configurable size of bills (initially set to <10 sections). We then tested ten existing open source summarization models against the dataset and found that the results were poor: they were very generic, and often incorrect. In some cases, the summary describes the opposite of the bill’s effect. See Existing summarization models: poor initial results

After speaking with @Alex-Fabbri, we used the Salesforce query-focused model to build a custom model trained on bills < 10 sections. The initial results here show some improvement, but still inaccurate results.

We also tried the out-of-the box summarization for the OpenAI davinci model (using tl;dr at the end of the text, which triggers summarization). The initial results were promising, in that the summaries-- while not comprehensive — are accurate and well-written. A tuned LLM (large language model) such as this one may have potential as a starting-point for CRS summaries. See samples:

Future Avenues

  • Create a pipeline that improves accuracy. One such approach is to start with a bill similarity algorithm to identify 'close matches'; then identify the differences between the full text of the current (unsummarized) document and the previous matched documents; summarize the differences in the documents; feed those differences into a ML model to update the existing summary, or to provide an indicator to a human user of what needs to be changed in the existing summary. LLMs are impressively able to identify differences in text and describe the significance of these differences. Two examples of prompts to 'summarize the changes' are here:

  • Start with more powerful LLMs like OpenAI’s DaVinci model and tune those. To train an LLM with the bill datasest — even limited to 10 sections — will require chaining multiple queries, using a technique such as LangChain. Cost considerations are discussed below. The most powerful approach will use a 'human in the loop' approach:

    • tune an LLM with existing summaries

    • produce automated summaries for new documents

    • provide an interface for experts to rate and correct the automatically generated summaries

    • update the model using the expert corrections

  • Combine the two approaches above, using both a expert-system pipeline and more powerful LLM models to produce more accurate results.

Note
Training with commercial LLM APIs such as the Davinci model may require a budget; at $0.0200 / 1K tokens (1 token = 4 characters), training on a dataset of ~ 10,000 bills of 10 sections each (total ~ 50M tokens) could cost on the order of $1,000. Other open source models could be used for training, and the compute costs would need to be taken into consideration.