Quickstart and Setup #28

DelaramRajaei · 2023-06-23T18:43:01Z

@IsaacJ60
This is an issue page to log your progress. Please let us know if you have any concerns or questions.

IsaacJ60 · 2023-06-28T20:48:20Z

Hi Delaram,

I was just curious if there is a rough timeframe on when I should be done reading through the IR doc. Thanks.

DelaramRajaei · 2023-06-28T21:24:50Z

Can you complete the reading of the IR document by Monday? Once that is done, we can proceed to read the papers on backtranslation.

IsaacJ60 · 2023-06-28T21:36:36Z

Got it, thanks!

IsaacJ60 · 2023-07-03T12:02:47Z

Finished reading the IR doc. I'm ready for the next steps.

DelaramRajaei · 2023-07-03T18:24:02Z

@IsaacJ60

Awesome! Do you have any questions about the fundamentals? The document is not fully completed in the metric part. You can find some helpful information in this link.

If you don't have any questions, we can move forward with reading backtranslation papers. We can schedule a brief meeting for me to explain how you can write a paper summary (either online or in person) and discuss how to proceed with your next task.

I found some papers on backtranslation and you can find their links in this doc under BackTranslation header.
There are two surveys about data augmentation in this document. I recommend starting with them. They'll give you a good overview of the topic.

IsaacJ60 · 2023-07-03T18:28:12Z

Not too many questions at the moment, but I'm sure there will be some moments where I'll need to refer back to the document.

Are you available in the mornings anytime before 12? I am tentatively available this week on Thursday and Friday.

DelaramRajaei · 2023-07-03T18:31:58Z

Sure, I'm available in the morning on any day except for Friday.

IsaacJ60 · 2023-07-03T18:41:47Z

Okay, can we try and meet at 10 AM on Thursday?

Also, quick question: what does adding noise to text look like? Is there a specific example that can give me a better picture?

Thanks

DelaramRajaei · 2023-07-03T19:29:47Z

Yes, that time works for me.

Noise can be added to a text in various ways, by adding errors, inconsistencies, or irrelevant information. Here are some examples of text noise sources:

Repetitions
Spelling and grammatical errors
Incomplete sentences
Irrelevant information
...

Here is an example for adding noise to a sentence by removing some vowels and doubling some letters:

Original Sentence: "The quick brown fox jumps over the lazy dog."
Noisy Sentence: "The quck brwn fox jumps ovver the lazy dog."

I also came across this link that might be helpful.

But pay attention that these sources of noise can impact the readability, clarity, and overall quality of the text, making it essential to perform noise reduction or text cleaning techniques in various natural language processing applications.

IsaacJ60 · 2023-07-03T19:43:32Z

Thanks for the explanation.

Do you think we can meet virtually? I can't access the university WiFi so I think that would work out better.

DelaramRajaei · 2023-07-03T21:53:35Z

Yes, that is fine by me. I will set up a google meet then.

IsaacJ60 · 2023-07-06T14:08:26Z

Hi Deleram, just wondering if you are joining the meeting soon?

DelaramRajaei · 2023-07-08T18:40:05Z

@IsaacJ60

Hello Isaac,

For your next task, start by searching for papers on backtranslation and make a list of them in the provided document. When choosing a paper, read the abstract first and look for the answers to these questions:

What is the issue or problem being addressed in the paper?
What specific approach or methodology do the authors employ in their research?
Why did the authors choose this particular approach?

If you want to know more about the paper, check out the introduction for further info. If the paper is good and relates to our work, add the paper's name to the list. Just keep in mind that some papers might be more technical, so you can include them in the list to expand your knowledge and come back to them later.

For finding papers you can search in dblp.org or scholar.google.com .
Ensure that you select papers from reputable sources or conferences.

After that, you can dive into reading and summarizing the papers. To make your summaries comprehensive, consider including the following sections:

Main problem
Proposed Method
Input/Output
Example
Rekated works + their gaps

I suggest you begin by reading the two survey papers and their summaries, "A Survey of Data Augmentation Approaches for NLP" and "A Survey of Text Data Augmentation" . After that, you can complete the list of papers.

Make sure to complete the paper list and read at least two papers, summarizing them by Thursday.
If you have any questions, let me know.

IsaacJ60 · 2023-07-10T18:35:24Z

@DelaramRajaei

Hi Delaram, I was reading Iterative Back-Translation for Neural Machine Translation, and came across this particular line: "back-translated data is used to build better translation systems in forward and backward directions, which in turn is used to re-back-translate monolingual data"

From what I understand, it's saying that upon back-translating some data from language 1 -> language 2 -> language 1, you can reuse that back-translated data to do more back-translations. Could you let me know if I'm on the right track here?

Thanks.

IsaacJ60 · 2023-07-10T19:14:17Z

Also, should I write summaries in google docs or in an issue? Thanks!

DelaramRajaei · 2023-07-11T19:07:16Z

@IsaacJ60

Hey Isaac, yeah, you got it right.
The paper you mentioned introduces a method to do backtranslation in an iterative manner. Their main objective is to build a translation machine that performs well. So, they suggest using backtranslation iteratively as a way to achieve this.

About your second question, please provide a summary of each paper you have read in a new issue.
Please create a new issue using the specified format for the name:

Start with the year of the paper, then mention the name of the venue, and finally, include the title of the paper.

Have you finished compiling the list of papers? If so, could you please provide a log detailing the tasks you have completed?

IsaacJ60 · 2023-07-11T20:39:07Z

I will start doing the two summaries on Wednesday. Still have to pick one more paper. Should I write the log in this issue?

DelaramRajaei · 2023-07-11T23:31:40Z

Yes, please record the log of your tasks on this page and provide summaries on a new issue.

IsaacJ60 · 2023-07-12T00:16:20Z

July 11th, 2023 - Task Log

Read & drafted summaries for two papers: https://aclanthology.org/W18-2703.pdf & https://aclanthology.org/2022.naacl-main.32.pdf

IsaacJ60 · 2023-07-12T18:29:13Z

@DelaramRajaei

Hi Delaram,

I was reading up on beam search and sampling, and from what I understood, the difference is that sampling introduces more variance than beam searching which produces more repetitive but fluent text. Do you know what goes on in each method and why this is the case? Thanks!

DelaramRajaei · 2023-07-12T20:59:56Z

@IsaacJ60

Hello Isaac,

I searched about it and here's what I found:

Here's how beam search works:

Given an input sequence, the model predicts the probabilities of the next possible tokens.
Instead of selecting the token with the highest probability at each step, beam search keeps track of the top K tokens with the highest probabilities, where K is a predetermined beam width.
The model continues generating tokens for each partial sequence, considering all possible extensions for the tokens in the beam.
At each step, the probabilities of the generated sequences are multiplied, and the top K sequences with the highest probabilities are retained.
This process continues until a complete sequence is generated or a predefined maximum length is reached.
Finally, the sequence with the highest overall probability is selected as the output.
Beam search helps overcome the issue of locally optimal decisions by exploring multiple possibilities simultaneously. It allows for more diverse and coherent output sequences compared to greedy decoding, which selects the most likely token at each step.

Here's how sampling works:

Similar to beam search, the model predicts the probabilities of the next possible tokens given an input sequence.
Instead of selecting the token with the highest probability, sampling chooses tokens probabilistically based on their predicted probabilities. The higher the probability, the more likely the token is to be selected.
Sampling introduces a temperature parameter that controls the level of randomness. Higher values of temperature result in more diverse and random outputs, while lower values make the sampling process more focused and deterministic.
The sampling process continues until a complete sequence is generated or a maximum length is reached.

Comparing:

Sampling brings more variety and allows for greater exploration and creativity in the generated sequences. However, it can also lead to less organized or less reliable outputs compared to beam search, which takes the overall probabilities of the entire sequence into consideration. In simpler terms, sampling focuses more on predicting the next word in a phrase, while beam search looks at the bigger picture of the whole phrase for better results.

IsaacJ60 · 2023-07-12T22:29:50Z

That makes sense, thanks!

IsaacJ60 · 2023-07-12T22:30:31Z

July 12th, 2023 - Task Log

Completed summaries for 2 papers and posted as issues.

DelaramRajaei · 2023-07-13T18:34:13Z

@IsaacJ60

Thank you for the logs and the helpful summaries. I'll make sure to check them out.

Now, for the next task, we need to find datasets related to medicine and law. Currently, we're working with these datasets:

cluweb09b: ClueWeb 2009 web document collection.
antique: ANTIQUE is a non-factoid quesiton answering dataset based on the questions and answers of Yahoo!
robust04: News articles
gov2: Web document collection
dbpedia: Extracted from Wikipedia dumps

Unfortunately, the results for backtranslation weren't great, especially with cluweb. However, I've come across some papers and websites mentioning that backtranslation is often used with medical, law, and financial datasets.

To put this to the test, we need to find different datasets in the medical and law fields.
Please find datasets related to medicine and law. Once you've found them, provide a summary of each dataset here so we can add them to our project.

To discover information retrieval (IR) datasets, you may find the following resources useful: ir-datasets.com and huggingface.co/datasets.

Please let me know if you have any questions or want to schedule a meeting.

IsaacJ60 · 2023-07-13T19:12:29Z

Medical Datasets:

Medline

https://ir-datasets.com/medline.html#medline
Biomedical articles & abstracts.

Clinical Trials

https://ir-datasets.com/clinicaltrials.html
Data from clinical trials

CORD-19 (may contain medical documents)

https://ir-datasets.com/cord19.html
Covid-19 related scientific articles

Highwire (TREC Genomics 2006-07)

https://ir-datasets.com/highwire.html
Biomedical journal articles from Highwire Press

NFCorpus (NutritionFacts)

https://ir-datasets.com/nfcorpus.html
Non-technical documents from nutritionfacts.org and technical documents mostly from PubMed

PubMed Central (TREC CDS)

https://ir-datasets.com/pmc.html
Bio-medical articles from PubMed Central.

Law Datasets

Pile of Law

https://huggingface.co/datasets/pile-of-law/pile-of-law
Legal and administrative data

IsaacJ60 · 2023-07-13T19:14:32Z

I'll continue to add to this list, but before I do so, I have a few questions:

do the datasets have to be in english?
do they need to be documents or question/answer pairs as seen in the datasets we are working with right now or can they be in any other forms?

DelaramRajaei · 2023-07-13T20:29:51Z

Thank you for the list.

Yes, they should be in english.
They can be in any format.

Please include a brief summary that covers the dataset's size, format, variables/features, as well as any relevant statistics or distributions.

DelaramRajaei · 2023-07-18T16:18:00Z

@IsaacJ60
Hello Isaac,

Could you please create an analysis function within the compare.py file?

These are the two files we are currently working on:
topics.robust04.bm25.map.all.csv
Analyzing_Results.txt

Please convert the .csv file to .txt format, just like the example file provided above. Make sure to generate the subtraction of the original map from each map of every language in the backtranslated query (backtranslated query - original query).

Here's some additional information about a line:

qid	abstractqueryexpansion	abstractqueryexpansion.bm25.map	backtranslation_pes_arab	semsim	backtranslation_pes_arab.bm25.map	subtraction
224109	what is citriscidal?	0	What is citricidal	0.906306744	0.25	0.25

IsaacJ60 · 2023-07-19T22:41:26Z

Sure, I'll aim to finish it by Tuesday, but hopefully sooner.

IsaacJ60 · 2023-07-24T17:41:29Z

Hi Delaram,

I wasn't able to make time for a meeting today, but I'll try and finish up the analysis function tonight.

IsaacJ60 · 2023-07-25T00:33:53Z

Hi Delaram, I finished with the analysis function. Where should I upload it for you to review?

DelaramRajaei · 2023-07-25T17:08:36Z

Hello Isaac,

Would you be able to fork the ReQue project from my github page to your repository and then submit a pull request? By doing so, your commits and contributions to the project will be visible and trackable.

IsaacJ60 · 2023-08-02T01:24:19Z

Hi Delaram,

I'm running into some issues with the conda environment when running:

conda env create -f environment.yml

The full output from the console is below. Most importantly it says that many specifications were found to be incompatible with each other. (Omitted text in the middle)

Thanks!

Collecting package metadata (repodata.json): done
Solving environment: \
Found conflicts! Looking for incompatible packages.
This can take several minutes.  Press CTRL-C to abort.
failed                                                                                  /
Solving environment: |
Found conflicts! Looking for incompatible packages.
This can take several minutes.  Press CTRL-C to abort.
failed                                                    /

UnsatisfiableError: The following specifications were found to be incompatible with each other:

Output in format: Requested package -> Available versions

Package openblas conflicts for:
pandas[version='>=0.23.3'] -> numpy[version='>=1.16,<2.0a0'] -> openblas[version='0.2.20|0.2.20.*|>=0.2.20,<0.2.21.0a0|>=0.3.3,<0.3.4.0a0']
seaborn -> numpy[version='>=1.15'] -> openblas[version='0.2.20|0.2.20.*|>=0.2.20,<0.2.21.0a0|>=0.3.3,<0.3.4.0a0']
numpy[version='>=1.18.1'] -> libblas[version='>=3.8.0,<4.0a0'] -> openblas[version='0.3.5.*|0.3.6|>=0.3.6,<0.3.7.0a0',build=h828a276_2]
nltk[version='>=3.4'] -> numpy -> openblas[version='0.2.20|0.2.20.*|>=0.2.20,<0.2.21.0a0|>=0.3.3,<0.3.4.0a0']
scikit-learn[version='>=0.22'] -> numpy[version='>=1.14.6,<2.0a0'] -> openblas[version='0.2.20|0.2.20.*|>=0.2.20,<0.2.21.0a0|>=0.3.3,<0.3.4.0a0']
gensim=4 -> numpy[version='>=1.11.3'] -> openblas[version='0.2.20|0.2.20.*|>=0.2.20,<0.2.21.0a0|>=0.3.3,<0.3.4.0a0']
transformers==4.0.0 -> numpy -> openblas[version='0.2.20|0.2.20.*|>=0.2.20,<0.2.21.0a0|>=0.3.3,<0.3.4.0a0']

Package ucrt conflicts for:
seaborn -> matplotlib-base[version='>=3.1,!=3.6.1'] -> ucrt[version='>=10.0.20348.0']
nltk[version='>=3.4'] -> regex[version='>=2021.8.3'] -> ucrt[version='>=10.0.20348.0']
python=3.8 -> openssl[version='>=3.0.9,<4.0a0'] -> ucrt[version='>=10.0.20348.0']
transformers==4.0.0 -> numpy -> ucrt[version='>=10.0.20348.0']
tqdm==4.45.0 -> python -> ucrt[version='>=10.0.20348.0']
networkx[version='>=2.2'] -> python[version='>=3.11,<3.12.0a0'] -> ucrt[version='>=10.0.20348.0']
spacy==2.2.4 -> cymem[version='>=2.0.2,<2.1.0'] -> ucrt[version='>=10.0.20348.0']
gensim=4 -> ucrt[version='>=10.0.20348.0']
urllib3=1.25.8 -> cryptography[version='>=1.3.4'] -> ucrt[version='>=10.0.20348.0']
pandas[version='>=0.23.3'] -> ucrt[version='>=10.0.20348.0']
numpy[version='>=1.18.1'] -> ucrt[version='>=10.0.20348.0']
scikit-learn[version='>=0.22'] -> ucrt[version='>=10.0.20348.0']

Package certifi conflicts for:
nltk[version='>=3.4'] -> requests -> certifi[version='>=2017.4.17']
seaborn -> matplotlib-base[version='>=3.1,!=3.6.1'] -> certifi[version='>=2020.06.20']
requests=2.22.0 -> urllib3[version='>=1.21.1,<1.26,!=1.25.0,!=1.25.1'] -> certifi

...

Package fonttools conflicts for:
seaborn -> matplotlib-base[version='>=3.1,!=3.6.1'] -> fonttools[version='>=4.22.0']
networkx[version='>=2.2'] -> matplotlib-base[version='>=3.4'] -> fonttools[version='>=4.22.0']

networkx[version='>=2.2'] -> matplotlib[version='>=3.3'] -> matplotlib-base[version='>=3.3.0,<3.3.1.0a0|>=3.3.1,<3.3.2.0a0|>=3.3.2,<3.3.3.0a0|>=3.3.4,<3.3.5.0a0|>=3.4.2,<3.4.3.0a0|>=3.4.3,<3.4.4.0a0|>=3.5.0,<3.5.1.0a0|>=3.5.1,<3.5.2.0a0|>=3.5.2,<3.5.3.0a0|>=3.5.3,<3.5.4.0a0|>=3.6.2,<3.6.3.0a0|>=3.7.0,<3.7.1.0a0|>=3.7.1,<3.7.2.0a0|>=3.7.2,<3.7.3.0a0|>=3.6.3,<3.6.4.0a0|>=3.6.1,<3.6.2.0a0|>=3.6.0,<3.6.1.0a0|>=3.4.1,<3.4.2.0a0|>=3.3.3,<3.3.4.0a0']The following specifications were found to be incompatible with your system:

  - feature:/win-64::__win==0=0
  - feature:|@/win-64::__win==0=0
  - nltk[version='>=3.4'] -> click -> __unix
  - nltk[version='>=3.4'] -> click -> __win
  - urllib3=1.25.8 -> pysocks[version='>=1.5.6,<2.0,!=1.5.7'] -> __unix
  - urllib3=1.25.8 -> pysocks[version='>=1.5.6,<2.0,!=1.5.7'] -> __win

Your installed version is: 0

DelaramRajaei · 2023-08-02T15:02:46Z

Hello Isaac,

Have you tried the installation with requirement.txt file? Did you have the same issue with that?
Also this my issue for installing ReQue project. I logged all the problems I had with the installation. It may help you.

DelaramRajaei · 2023-08-03T20:17:22Z

@IsaacJ60
Hi Isaac,

I updated the version of the libraries and also added some libraries to the environment.yml and requirement files. Could you please try these files? Let me know if you encounter any errors.

requirements.txt
environment.zip
(It didn't support .yml files, I sent you the zip.)

IsaacJ60 · 2023-08-03T20:24:53Z

Hi Delaram,

I'll try those when I get home. So far, installing via the requirements.txt returned no errors (using the environment.yml file never works, it always finds conflicts and fails to resolve them), but the anserini installation leads me to believe that there is something wrong with the libraries. I will try and post more specific details if the new files still don't work.

IsaacJ60 · 2023-08-05T13:35:41Z

Hi Delaram,

So installing with the new environment.yml file works with no errors, and I was able to install anserini after some tinkering.

During the pyserini installation process, conda install faiss-cpu -c pytorch doesn't work and it fails to solve environment. I'm following these steps. It still thinks I'm using Python 3.11, even though I deleted it and removed the environment path variables. I'll try and find any left over traces later today.

IsaacJ60 · 2023-08-06T14:02:04Z

Hi Delaram,

conda install faiss-cpu -c pytorch runs without returning any errors now, so I believe the installation is done. Where can I find the robust04 dataset? The link in the README file shows this:

Could you help clarify if there are any additional steps for downloading the datasets?

Thanks.

DelaramRajaei · 2023-08-07T16:54:59Z

Hello Isaac,

Are you available to visit the lab this week? I already have the indexed robust04, so you do not need to index them again. I can share the indexed files with you.

IsaacJ60 · 2023-08-07T20:43:31Z

Yeah sure I can drop by Wednesday morning at around 10:30-11:00 AM if that works for you.

DelaramRajaei · 2023-08-08T22:54:35Z

Yes, that's perfect.
Also for the internet, I can help you with it.

IsaacJ60 · 2023-08-09T14:48:20Z

Hi Delaram,

I arrived at the building but the door to 215 is closed. Where can I find you?

DelaramRajaei · 2023-08-09T15:43:02Z

Hello Isaac,

Apologies, at first, I was waiting downstairs for you.

For your next task, please write all the translated queries in a text file. The format could be something like this:

query_id + '\t' + original_query + '\t' + translated_query + '\t' + backtranslated_query

After that, please add the new datasets and make sure to index them first. You can find the necessary commands in ReQue's readme.

IsaacJ60 · 2023-08-11T00:17:13Z

Hi Delaram,

Running python -u main.py --corpus robust04 --output ./output/robust04/ --ranker bm25 --metric map 2>&1 | tee robust04.bm25.log & returns this error:

Running with the tee command returns this error (after downloading some json files):

Some weights of the model checkpoint at C:\Users\isaac/.cache\torch\sentence_transformers\johngiorgi_declutr-small were not used when initializing RobertaModel: ['lm_head.decoder.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.decoder.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Starting threads per expanders for ['generate', 'search', 'evaluate'] ...
INFO: MAIN: GENERATE: There has been error in <expanders.abstractqexpander.AbstractQExpander object at 0x00000279E2B31940>!
Traceback (most recent call last):
  File "main.py", line 51, in generate
    expander.write_expanded_queries(Qfilename, Q_filename)
  File "C:\Users\isaac\Documents\Isaac\CS\ReQue\ReQue\qe\expanders\abstractqexpander.py", line 42, in write_expanded_queries      
  File "main.py", line 51, in generate
    expander.write_expanded_queries(Qfilename, Q_filename)
  File "C:\Users\isaac\Documents\Isaac\CS\ReQue\ReQue\qe\expanders\abstractqexpander.py", line 42, in write_expanded_queries      
    with open(Q_filename, 'w', encoding='UTF-8') as Q_file:
FileNotFoundError: [Errno 2] No such file or directory: './output/robust04/robust04/topics.robust04.backtranslation_fra_latn.txt' 


INFO: MAIN: THREAD: abstractqueryexpansion: There has been error in <expanders.abstractqexpander.AbstractQExpander object at 0x00000279E2B31940>!
Traceback (most recent call last):
  File "main.py", line 199, in worker_thread
    if 'generate' in op: generate(Qfilename=param.corpora[corpus]['topics'], expander=expander, output=output_)
  File "main.py", line 51, in generate
    expander.write_expanded_queries(Qfilename, Q_filename)
  File "C:\Users\isaac\Documents\Isaac\CS\ReQue\ReQue\qe\expanders\abstractqexpander.py", line 42, in write_expanded_queries      
    with open(Q_filename, 'w', encoding='UTF-8') as Q_file:
FileNotFoundError: [Errno 2] No such file or directory: './output/robust04/robust04/topics.robust04.abstractqueryexpansion.txt'   
INFO: MAIN: THREAD: backtranslation_fra_latn: There has been error in <expanders.backtranslation.BackTranslation object at 0x00000279E2B31820>!
Traceback (most recent call last):
  File "main.py", line 199, in worker_thread
    if 'generate' in op: generate(Qfilename=param.corpora[corpus]['topics'], expander=expander, output=output_)
  File "main.py", line 51, in generate
    expander.write_expanded_queries(Qfilename, Q_filename)
  File "C:\Users\isaac\Documents\Isaac\CS\ReQue\ReQue\qe\expanders\abstractqexpander.py", line 42, in write_expanded_queries      
    with open(Q_filename, 'w', encoding='UTF-8') as Q_file:
FileNotFoundError: [Errno 2] No such file or directory: './output/robust04/robust04/topics.robust04.backtranslation_fra_latn.txt' 


Traceback (most recent call last):
  File "main.py", line 291, in <module>
    run(corpus=args.corpus.lower(),
  File "main.py", line 247, in run
    result = initialize(corpus, rankers, metrics, output_, rf, op, topicreader)
  File "main.py", line 225, in initialize
    result = aggregate(expanders=expanders, rankers=rankers, metrics=metrics, output=output)
  File "main.py", line 153, in aggregate
    df.to_csv(filename, index=False)
  File "C:\Users\isaac\anaconda3\envs\Reque\lib\site-packages\pandas\core\generic.py", line 3772, in to_csv
    return DataFrameRenderer(formatter).to_csv(
  File "C:\Users\isaac\anaconda3\envs\Reque\lib\site-packages\pandas\io\formats\format.py", line 1186, in to_csv
    csv_formatter.save()
  File "C:\Users\isaac\anaconda3\envs\Reque\lib\site-packages\pandas\io\formats\csvs.py", line 240, in save
    with get_handle(
  File "C:\Users\isaac\anaconda3\envs\Reque\lib\site-packages\pandas\io\common.py", line 737, in get_handle
    check_parent_directory(str(handle))
  File "C:\Users\isaac\anaconda3\envs\Reque\lib\site-packages\pandas\io\common.py", line 600, in check_parent_directory
    raise OSError(rf"Cannot save file into a non-existent directory: '{parent}'")
OSError: Cannot save file into a non-existent directory: 'output\robust04\robust04'

Running this on powershell where tee is apparently available returns this error:

python : Traceback (most recent call last):
At line:1 char:1
+ python -u main.py --corpus robust04 --output ./output/robust04/ --ran ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (Traceback (most recent call last)::String) [], RemoteException
    + FullyQualifiedErrorId : NativeCommandError

  File "main.py", line 5, in <module>
    from pyserini.search.lucene import querybuilder
ModuleNotFoundError: No module named 'pyserini.search.lucene'

Could this be due to the conda environment? I was having issues activating it. I can share more details on that if you need.

DelaramRajaei · 2023-08-11T20:39:03Z

Hello Isaac,

I faced the same error I mentioned in this issue.

  File "main.py", line 5, in <module>
    from pyserini.search.lucene import querybuilder
ModuleNotFoundError: No module named 'pyserini.search.lucene'

The pip installation would encounter castorini/pyserini#1025 and it's better to use the development installation.

If you encounter any errors while running the program, you can resolve them by executing the following commands: first, use pip uninstall pyserini to uninstall Pyserini, and then reinstall it using pip install pyserini.
I recall that the issue was resolved by uninstalling and then reinstalling pyserini.

IsaacJ60 · 2023-08-12T00:14:39Z

Hi Delaram,

I believe it fixed that error, but I am getting some other ModuleNotFound errors. It leads me to believe that the environment isn't being activated properly. When I run conda activate ReQue, it returns the following error:

Unless you have any other ideas, I will try uninstalling and reinstalling anaconda.

DelaramRajaei · 2023-08-12T03:08:09Z

I'm not sure why this problem is happening.

In PyCharm settings, you can also switch to the Conda environment for the Python interpreter. If you're using Visual Studio Code, there might be a similar option. You can look for it in the settings, before reinstalling anaconda.

IsaacJ60 · 2023-08-15T02:51:24Z

I will try reinstalling conda shortly.

If I cannot get the project to run, could we set up an online meeting to debug?

IsaacJ60 · 2023-08-15T14:20:22Z

Hi Delaram, reinstalling conda worked. I will provide a more detailed log of all the steps I took to run the project soon.

IsaacJ60 · 2023-08-16T19:24:17Z

I also wanted to inform you that I'll be attending Hack the 6ix in Toronto over the weekend, so I may have to wait until I'm back to resume with the full log and tasks.

DelaramRajaei · 2023-08-16T19:55:36Z

Great. Apologies for the delay in getting back to you.
Please keep recording your findings in the log.
Additionally, focus on completing this task before proceeding with adding new datasets following your trip to Toronto.

IsaacJ60 · 2023-08-21T15:28:50Z

Hello Isaac,

Apologies, at first, I was waiting downstairs for you.

For your next task, please write all the translated queries in a text file. The format could be something like this:

query_id + '\t' + original_query + '\t' + translated_query + '\t' + backtranslated_query

After that, please add the new datasets and make sure to index them first. You can find the necessary commands in ReQue's readme.

Hi Delaram,

After running ReQue, this is the file that was outputted.
topics.robust04.bm25.map.all.csv
Where would I get the translated query?

I'm also getting the following error:

"../anserini/eval/trec_eval.9.0.4/trec_eval" -q -m map ../ds/robust04/qrels.robust04.txt ./output/robust04/robust04/topics.robust04.backtranslation_fra_latn.bm25.txt > ./output/robust04/robust04/topics.robust04.backtranslation_fra_latn.bm25.map.txt
The system cannot find the path specified.

Looks like it's not outputting to that map file. Not sure if this would affect anything?

DelaramRajaei · 2023-09-04T16:38:07Z

Hello Isaac,

The file you've shared reveals that the backtranslated expander didn't function correctly, resulting in empty columns. The issue seems to be related to trec_eval. To resolve this, you employ Cygwin to Make the trec_eval, which will result in the .exe file.

After that, please go to the backtranslated.py file. In the get_expanded_query function, make sure to save the translated_query into a .txt file with the CSV format.

Please make sure to do this as soon as you can, and then you can move on to adding new datasets.

IsaacJ60 · 2023-09-05T23:02:57Z

Hi Delaram,

Here is the new file it generated. Could you please verify that it is correct? I don't seem to see any error messages upon running the project.

topics.robust04.bm25.map.all.csv

DelaramRajaei · 2023-09-06T16:57:41Z

Yes, everything looks good. Please make sure to save the translated queries as I mentioned earlier, and then you can go ahead with adding the new datasets.

IsaacJ60 · 2023-09-09T20:44:26Z

Hi Delaram, sorry to bother you again. I'm having trouble understanding how to retrieve the qid. Am I able to retrieve it from some dataframe or should I pass it into the get_expanded_query?

DelaramRajaei · 2023-09-10T20:11:50Z

Hello Isaac, the qid is already passed as the args. I believe you can find it in args[0]. By debugging and printing you can make sure it is correct.

IsaacJ60 · 2023-09-27T00:31:33Z

Hi Delaram,

Sorry for the delay. I submitted a PR to your repo 2 weeks ago, and will begin working on adding in datasets. How do you suggest I go about doing that? Thanks!

DelaramRajaei added the good first issue Good for newcomers label Jun 23, 2023

DelaramRajaei assigned IsaacJ60 Jun 23, 2023

Quickstart and Setup #28

Quickstart and Setup #28

Comments

DelaramRajaei commented Jun 23, 2023

IsaacJ60 commented Jun 28, 2023

DelaramRajaei commented Jun 28, 2023 • edited Loading

IsaacJ60 commented Jun 28, 2023

IsaacJ60 commented Jul 3, 2023

DelaramRajaei commented Jul 3, 2023 • edited Loading

IsaacJ60 commented Jul 3, 2023

DelaramRajaei commented Jul 3, 2023

IsaacJ60 commented Jul 3, 2023

DelaramRajaei commented Jul 3, 2023

IsaacJ60 commented Jul 3, 2023

DelaramRajaei commented Jul 3, 2023

IsaacJ60 commented Jul 6, 2023

DelaramRajaei commented Jul 8, 2023 • edited Loading

IsaacJ60 commented Jul 10, 2023 • edited Loading

IsaacJ60 commented Jul 10, 2023

DelaramRajaei commented Jul 11, 2023 • edited Loading

IsaacJ60 commented Jul 11, 2023

DelaramRajaei commented Jul 11, 2023

IsaacJ60 commented Jul 12, 2023

IsaacJ60 commented Jul 12, 2023

DelaramRajaei commented Jul 12, 2023

IsaacJ60 commented Jul 12, 2023

IsaacJ60 commented Jul 12, 2023

DelaramRajaei commented Jul 13, 2023 • edited Loading

IsaacJ60 commented Jul 13, 2023

Medical Datasets:

Medline

Clinical Trials

CORD-19 (may contain medical documents)

Highwire (TREC Genomics 2006-07)

NFCorpus (NutritionFacts)

PubMed Central (TREC CDS)

Law Datasets

Pile of Law

IsaacJ60 commented Jul 13, 2023 • edited Loading

DelaramRajaei commented Jul 13, 2023

DelaramRajaei commented Jul 18, 2023

IsaacJ60 commented Jul 19, 2023

IsaacJ60 commented Jul 24, 2023

IsaacJ60 commented Jul 25, 2023

DelaramRajaei commented Jul 25, 2023

IsaacJ60 commented Aug 2, 2023

DelaramRajaei commented Aug 2, 2023

DelaramRajaei commented Aug 3, 2023

IsaacJ60 commented Aug 3, 2023

IsaacJ60 commented Aug 5, 2023

IsaacJ60 commented Aug 6, 2023

DelaramRajaei commented Aug 7, 2023

IsaacJ60 commented Aug 7, 2023

DelaramRajaei commented Aug 8, 2023 • edited by hosseinfani Loading

IsaacJ60 commented Aug 9, 2023

DelaramRajaei commented Aug 9, 2023

IsaacJ60 commented Aug 11, 2023

DelaramRajaei commented Aug 11, 2023

IsaacJ60 commented Aug 12, 2023

DelaramRajaei commented Aug 12, 2023

IsaacJ60 commented Aug 15, 2023

IsaacJ60 commented Aug 15, 2023

IsaacJ60 commented Aug 16, 2023

DelaramRajaei commented Aug 16, 2023

IsaacJ60 commented Aug 21, 2023

DelaramRajaei commented Sep 4, 2023

IsaacJ60 commented Sep 5, 2023

DelaramRajaei commented Sep 6, 2023

IsaacJ60 commented Sep 9, 2023

DelaramRajaei commented Sep 10, 2023

IsaacJ60 commented Sep 27, 2023 • edited Loading

DelaramRajaei commented Jun 28, 2023 •

edited

Loading

DelaramRajaei commented Jul 3, 2023 •

edited

Loading

DelaramRajaei commented Jul 8, 2023 •

edited

Loading

IsaacJ60 commented Jul 10, 2023 •

edited

Loading

DelaramRajaei commented Jul 11, 2023 •

edited

Loading

DelaramRajaei commented Jul 13, 2023 •

edited

Loading

IsaacJ60 commented Jul 13, 2023 •

edited

Loading

DelaramRajaei commented Aug 8, 2023 •

edited by hosseinfani

Loading

IsaacJ60 commented Sep 27, 2023 •

edited

Loading