Skip to content

Latest commit

 

History

History
78 lines (58 loc) · 3.33 KB

README.md

File metadata and controls

78 lines (58 loc) · 3.33 KB

HC3 Collection

This repository contains scripts for

  • Downloading and assembling the document collection
  • Removing unavailable documents from run files and qrels

And other collection resources, including

  • Topic jsonl files (resources/{train,eval}.topics.jsonl)
  • Qrels files containing the relevance judgments (resources/{zho,fas}.{train,eval}.qrels )
  • TREC run files of all baseline runs reported in the paper (run_files/*.trec)

Assembing Document Collections

To assemble the document collection, please download the Tweets via Twitter API 2.0. We provide the scripts for downloading Tweets, constructing the document jsonl file, and verify the MD5 hashs of the documents.

The scripts are light weight Python scripts that only requires requsts and tqdm to run.

Get Started

Please provide the Bearer tokens of your Twitter Developer Project through environment variable BEARER_TOKEN.

export BEARER_TOKEN=<your bearer token>

Download

Please extract the Tweet IDs from the resouces/{zho,fas}.doc_tweet_ids.*.jsonl.gz via

./extract.tweet.ids.sh ./resources/$lang.doc.tweet.ids.jsonl.gz > $lang.tweet.ids.txt

or simply

zcat ./resources/$lang.doc.tweet.ids.*.jsonl.gz | jq -cr '.tweet_ids[] | .[1]' > $lang.tweet.ids.txt

or anything that put one required Tweet IDs each line of the file.

The following command will download the Tweets, passing an exisitng file to the --tweet_output tag will resume the download and not redownloading the existing Tweets.

python download_tweets.py --tweetlist $lang.tweet.ids.txt \
                          --tweet_output ./downloads/$lang.tweets.jsonl \
                          --error_output ./downloads/err.log

Construct the document jsonl file

The following command will assemble the individual Tweets into conversations. Note that missing any Tweet in the specified language will result in dropping the whole conversation.

python make_collection.py --downloaded_tweets ./downloads/$lang.tweets.jsonl \
                          --reference_doc_ids ./resources/$lang.doc.tweet.ids.*.jsonl.gz \
                          --lang $lang \
                          --output_file ./downloads/$lang.docs.jsonl

Verify

Please use the following command to verify the collection you assembled.

python verify.py --doc_file ./downloads/$lang.test.docs.jsonl --id_files ./resources/$lang.doc.tweet.ids.*.jsonl.gz 

Excluding Unavailable Documents

To ensure comparability, please use the filter_docs.py script to exclude documents that are not available in your collection. This script can filter both run files and qrels. Please provide a text file that contains all available document ids (one in each line) or simply pass in your document jsonl file through --ids.

The following command is an example

python filter_docs.py --ids your-zho-docs.jsonl --runs ./run_files/zho.comb.BM25-QHT.trec 

Citation

Dawn Lawrie, James Mayfield, Douglas W. Oard, Eugene Yang, Suraj Nair, and Petra Galuščáková. 2023. HC3: A Suite of Test Collections for CLIR Evaluation over Informal Text. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '23). Association for Computing Machinery, New York, NY, USA, 2880–2889. https://doi.org/10.1145/3539618.3591893