Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quickstart and Setup #28

Open
DelaramRajaei opened this issue Jun 23, 2023 · 64 comments
Open

Quickstart and Setup #28

DelaramRajaei opened this issue Jun 23, 2023 · 64 comments
Assignees
Labels
good first issue Good for newcomers

Comments

@DelaramRajaei
Copy link
Member

@IsaacJ60
This is an issue page to log your progress. Please let us know if you have any concerns or questions.

@IsaacJ60
Copy link
Contributor

Hi Delaram,

I was just curious if there is a rough timeframe on when I should be done reading through the IR doc. Thanks.

@DelaramRajaei
Copy link
Member Author

DelaramRajaei commented Jun 28, 2023

@IsaacJ60
Hi Isaac,

Can you complete the reading of the IR document by Monday? Once that is done, we can proceed to read the papers on backtranslation.

@IsaacJ60
Copy link
Contributor

Got it, thanks!

@IsaacJ60
Copy link
Contributor

IsaacJ60 commented Jul 3, 2023

Finished reading the IR doc. I'm ready for the next steps.

@DelaramRajaei
Copy link
Member Author

DelaramRajaei commented Jul 3, 2023

@IsaacJ60

Awesome! Do you have any questions about the fundamentals? The document is not fully completed in the metric part. You can find some helpful information in this link.

If you don't have any questions, we can move forward with reading backtranslation papers. We can schedule a brief meeting for me to explain how you can write a paper summary (either online or in person) and discuss how to proceed with your next task.

I found some papers on backtranslation and you can find their links in this doc under BackTranslation header.
There are two surveys about data augmentation in this document. I recommend starting with them. They'll give you a good overview of the topic.

@IsaacJ60
Copy link
Contributor

IsaacJ60 commented Jul 3, 2023

Not too many questions at the moment, but I'm sure there will be some moments where I'll need to refer back to the document.

Are you available in the mornings anytime before 12? I am tentatively available this week on Thursday and Friday.

@DelaramRajaei
Copy link
Member Author

Sure, I'm available in the morning on any day except for Friday.

@IsaacJ60
Copy link
Contributor

IsaacJ60 commented Jul 3, 2023

Okay, can we try and meet at 10 AM on Thursday?

Also, quick question: what does adding noise to text look like? Is there a specific example that can give me a better picture?

Thanks

@DelaramRajaei
Copy link
Member Author

Yes, that time works for me.

Noise can be added to a text in various ways, by adding errors, inconsistencies, or irrelevant information. Here are some examples of text noise sources:

  • Repetitions
  • Spelling and grammatical errors
  • Incomplete sentences
  • Irrelevant information
  • ...

Here is an example for adding noise to a sentence by removing some vowels and doubling some letters:

  • Original Sentence: "The quick brown fox jumps over the lazy dog."
  • Noisy Sentence: "The quck brwn fox jumps ovver the lazy dog."

I also came across this link that might be helpful.

But pay attention that these sources of noise can impact the readability, clarity, and overall quality of the text, making it essential to perform noise reduction or text cleaning techniques in various natural language processing applications.

@IsaacJ60
Copy link
Contributor

IsaacJ60 commented Jul 3, 2023

Thanks for the explanation.

Do you think we can meet virtually? I can't access the university WiFi so I think that would work out better.

@DelaramRajaei
Copy link
Member Author

Yes, that is fine by me. I will set up a google meet then.

@IsaacJ60
Copy link
Contributor

IsaacJ60 commented Jul 6, 2023

Hi Deleram, just wondering if you are joining the meeting soon?

@DelaramRajaei
Copy link
Member Author

DelaramRajaei commented Jul 8, 2023

@IsaacJ60

Hello Isaac,

For your next task, start by searching for papers on backtranslation and make a list of them in the provided document. When choosing a paper, read the abstract first and look for the answers to these questions:

  • What is the issue or problem being addressed in the paper?
  • What specific approach or methodology do the authors employ in their research?
  • Why did the authors choose this particular approach?

If you want to know more about the paper, check out the introduction for further info. If the paper is good and relates to our work, add the paper's name to the list. Just keep in mind that some papers might be more technical, so you can include them in the list to expand your knowledge and come back to them later.

For finding papers you can search in dblp.org or scholar.google.com .
Ensure that you select papers from reputable sources or conferences.

After that, you can dive into reading and summarizing the papers. To make your summaries comprehensive, consider including the following sections:

  • Main problem
  • Proposed Method
  • Input/Output
  • Example
  • Rekated works + their gaps

I suggest you begin by reading the two survey papers and their summaries, "A Survey of Data Augmentation Approaches for NLP" and "A Survey of Text Data Augmentation" . After that, you can complete the list of papers.

Make sure to complete the paper list and read at least two papers, summarizing them by Thursday.
If you have any questions, let me know.

@IsaacJ60
Copy link
Contributor

IsaacJ60 commented Jul 10, 2023

@DelaramRajaei

Hi Delaram, I was reading Iterative Back-Translation for Neural Machine Translation, and came across this particular line: "back-translated data is used to build better translation systems in forward and backward directions, which in turn is used to re-back-translate monolingual data"

From what I understand, it's saying that upon back-translating some data from language 1 -> language 2 -> language 1, you can reuse that back-translated data to do more back-translations. Could you let me know if I'm on the right track here?

Thanks.

@IsaacJ60
Copy link
Contributor

Also, should I write summaries in google docs or in an issue? Thanks!

@DelaramRajaei
Copy link
Member Author

DelaramRajaei commented Jul 11, 2023

@IsaacJ60

Hey Isaac, yeah, you got it right.
The paper you mentioned introduces a method to do backtranslation in an iterative manner. Their main objective is to build a translation machine that performs well. So, they suggest using backtranslation iteratively as a way to achieve this.

About your second question, please provide a summary of each paper you have read in a new issue.
Please create a new issue using the specified format for the name:

image

Start with the year of the paper, then mention the name of the venue, and finally, include the title of the paper.

Have you finished compiling the list of papers? If so, could you please provide a log detailing the tasks you have completed?

@IsaacJ60
Copy link
Contributor

I will start doing the two summaries on Wednesday. Still have to pick one more paper. Should I write the log in this issue?

@DelaramRajaei
Copy link
Member Author

Yes, please record the log of your tasks on this page and provide summaries on a new issue.

@IsaacJ60
Copy link
Contributor

July 11th, 2023 - Task Log

@IsaacJ60
Copy link
Contributor

@DelaramRajaei

Hi Delaram,

I was reading up on beam search and sampling, and from what I understood, the difference is that sampling introduces more variance than beam searching which produces more repetitive but fluent text. Do you know what goes on in each method and why this is the case? Thanks!

@DelaramRajaei
Copy link
Member Author

@IsaacJ60

Hello Isaac,

I searched about it and here's what I found:

Here's how beam search works:

  • Given an input sequence, the model predicts the probabilities of the next possible tokens.
  • Instead of selecting the token with the highest probability at each step, beam search keeps track of the top K tokens with the highest probabilities, where K is a predetermined beam width.
  • The model continues generating tokens for each partial sequence, considering all possible extensions for the tokens in the beam.
  • At each step, the probabilities of the generated sequences are multiplied, and the top K sequences with the highest probabilities are retained.
  • This process continues until a complete sequence is generated or a predefined maximum length is reached.
  • Finally, the sequence with the highest overall probability is selected as the output.
  • Beam search helps overcome the issue of locally optimal decisions by exploring multiple possibilities simultaneously. It allows for more diverse and coherent output sequences compared to greedy decoding, which selects the most likely token at each step.

Here's how sampling works:

  • Similar to beam search, the model predicts the probabilities of the next possible tokens given an input sequence.
  • Instead of selecting the token with the highest probability, sampling chooses tokens probabilistically based on their predicted probabilities. The higher the probability, the more likely the token is to be selected.
  • Sampling introduces a temperature parameter that controls the level of randomness. Higher values of temperature result in more diverse and random outputs, while lower values make the sampling process more focused and deterministic.
  • The sampling process continues until a complete sequence is generated or a maximum length is reached.

Comparing:

Sampling brings more variety and allows for greater exploration and creativity in the generated sequences. However, it can also lead to less organized or less reliable outputs compared to beam search, which takes the overall probabilities of the entire sequence into consideration. In simpler terms, sampling focuses more on predicting the next word in a phrase, while beam search looks at the bigger picture of the whole phrase for better results.

@IsaacJ60
Copy link
Contributor

That makes sense, thanks!

@IsaacJ60
Copy link
Contributor

July 12th, 2023 - Task Log

Completed summaries for 2 papers and posted as issues.

@DelaramRajaei
Copy link
Member Author

DelaramRajaei commented Jul 13, 2023

@IsaacJ60

Thank you for the logs and the helpful summaries. I'll make sure to check them out.

Now, for the next task, we need to find datasets related to medicine and law. Currently, we're working with these datasets:

  • cluweb09b: ClueWeb 2009 web document collection.
  • antique: ANTIQUE is a non-factoid quesiton answering dataset based on the questions and answers of Yahoo!
  • robust04: News articles
  • gov2: Web document collection
  • dbpedia: Extracted from Wikipedia dumps

Unfortunately, the results for backtranslation weren't great, especially with cluweb. However, I've come across some papers and websites mentioning that backtranslation is often used with medical, law, and financial datasets.

To put this to the test, we need to find different datasets in the medical and law fields.
Please find datasets related to medicine and law. Once you've found them, provide a summary of each dataset here so we can add them to our project.

To discover information retrieval (IR) datasets, you may find the following resources useful: ir-datasets.com and huggingface.co/datasets.

Please let me know if you have any questions or want to schedule a meeting.

@IsaacJ60
Copy link
Contributor

Medical Datasets:

Medline

https://ir-datasets.com/medline.html#medline
Biomedical articles & abstracts.


Clinical Trials

https://ir-datasets.com/clinicaltrials.html
Data from clinical trials


CORD-19 (may contain medical documents)

https://ir-datasets.com/cord19.html
Covid-19 related scientific articles


Highwire (TREC Genomics 2006-07)

https://ir-datasets.com/highwire.html
Biomedical journal articles from Highwire Press


NFCorpus (NutritionFacts)

https://ir-datasets.com/nfcorpus.html
Non-technical documents from nutritionfacts.org and technical documents mostly from PubMed


PubMed Central (TREC CDS)

https://ir-datasets.com/pmc.html
Bio-medical articles from PubMed Central.


Law Datasets

Pile of Law

https://huggingface.co/datasets/pile-of-law/pile-of-law
Legal and administrative data

@IsaacJ60
Copy link
Contributor

IsaacJ60 commented Jul 13, 2023

I'll continue to add to this list, but before I do so, I have a few questions:

  • do the datasets have to be in english?
  • do they need to be documents or question/answer pairs as seen in the datasets we are working with right now or can they be in any other forms?

@DelaramRajaei
Copy link
Member Author

Thank you for the list.

  • Yes, they should be in english.
  • They can be in any format.

Please include a brief summary that covers the dataset's size, format, variables/features, as well as any relevant statistics or distributions.

@DelaramRajaei
Copy link
Member Author

@IsaacJ60
Hello Isaac,

Could you please create an analysis function within the compare.py file?

These are the two files we are currently working on:
topics.robust04.bm25.map.all.csv
Analyzing_Results.txt

Please convert the .csv file to .txt format, just like the example file provided above. Make sure to generate the subtraction of the original map from each map of every language in the backtranslated query (backtranslated query - original query).

Here's some additional information about a line:

qid abstractqueryexpansion abstractqueryexpansion.bm25.map backtranslation_pes_arab semsim backtranslation_pes_arab.bm25.map subtraction
224109 what is citriscidal? 0 What is citricidal 0.906306744 0.25 0.25

@IsaacJ60
Copy link
Contributor

Sure, I'll aim to finish it by Tuesday, but hopefully sooner.

@IsaacJ60
Copy link
Contributor

Hi Delaram,

I wasn't able to make time for a meeting today, but I'll try and finish up the analysis function tonight.

@IsaacJ60
Copy link
Contributor

Hi Delaram, I finished with the analysis function. Where should I upload it for you to review?

@DelaramRajaei
Copy link
Member Author

Hello Isaac,

Would you be able to fork the ReQue project from my github page to your repository and then submit a pull request? By doing so, your commits and contributions to the project will be visible and trackable.

@IsaacJ60
Copy link
Contributor

IsaacJ60 commented Aug 2, 2023

Hi Delaram,

I'm running into some issues with the conda environment when running:

conda env create -f environment.yml

The full output from the console is below. Most importantly it says that many specifications were found to be incompatible with each other. (Omitted text in the middle)

Thanks!

Collecting package metadata (repodata.json): done
Solving environment: \
Found conflicts! Looking for incompatible packages.
This can take several minutes.  Press CTRL-C to abort.
failed                                                                                  /
Solving environment: |
Found conflicts! Looking for incompatible packages.
This can take several minutes.  Press CTRL-C to abort.
failed                                                    /

UnsatisfiableError: The following specifications were found to be incompatible with each other:

Output in format: Requested package -> Available versions

Package openblas conflicts for:
pandas[version='>=0.23.3'] -> numpy[version='>=1.16,<2.0a0'] -> openblas[version='0.2.20|0.2.20.*|>=0.2.20,<0.2.21.0a0|>=0.3.3,<0.3.4.0a0']
seaborn -> numpy[version='>=1.15'] -> openblas[version='0.2.20|0.2.20.*|>=0.2.20,<0.2.21.0a0|>=0.3.3,<0.3.4.0a0']
numpy[version='>=1.18.1'] -> libblas[version='>=3.8.0,<4.0a0'] -> openblas[version='0.3.5.*|0.3.6|>=0.3.6,<0.3.7.0a0',build=h828a276_2]
nltk[version='>=3.4'] -> numpy -> openblas[version='0.2.20|0.2.20.*|>=0.2.20,<0.2.21.0a0|>=0.3.3,<0.3.4.0a0']
scikit-learn[version='>=0.22'] -> numpy[version='>=1.14.6,<2.0a0'] -> openblas[version='0.2.20|0.2.20.*|>=0.2.20,<0.2.21.0a0|>=0.3.3,<0.3.4.0a0']
gensim=4 -> numpy[version='>=1.11.3'] -> openblas[version='0.2.20|0.2.20.*|>=0.2.20,<0.2.21.0a0|>=0.3.3,<0.3.4.0a0']
transformers==4.0.0 -> numpy -> openblas[version='0.2.20|0.2.20.*|>=0.2.20,<0.2.21.0a0|>=0.3.3,<0.3.4.0a0']

Package ucrt conflicts for:
seaborn -> matplotlib-base[version='>=3.1,!=3.6.1'] -> ucrt[version='>=10.0.20348.0']
nltk[version='>=3.4'] -> regex[version='>=2021.8.3'] -> ucrt[version='>=10.0.20348.0']
python=3.8 -> openssl[version='>=3.0.9,<4.0a0'] -> ucrt[version='>=10.0.20348.0']
transformers==4.0.0 -> numpy -> ucrt[version='>=10.0.20348.0']
tqdm==4.45.0 -> python -> ucrt[version='>=10.0.20348.0']
networkx[version='>=2.2'] -> python[version='>=3.11,<3.12.0a0'] -> ucrt[version='>=10.0.20348.0']
spacy==2.2.4 -> cymem[version='>=2.0.2,<2.1.0'] -> ucrt[version='>=10.0.20348.0']
gensim=4 -> ucrt[version='>=10.0.20348.0']
urllib3=1.25.8 -> cryptography[version='>=1.3.4'] -> ucrt[version='>=10.0.20348.0']
pandas[version='>=0.23.3'] -> ucrt[version='>=10.0.20348.0']
numpy[version='>=1.18.1'] -> ucrt[version='>=10.0.20348.0']
scikit-learn[version='>=0.22'] -> ucrt[version='>=10.0.20348.0']

Package certifi conflicts for:
nltk[version='>=3.4'] -> requests -> certifi[version='>=2017.4.17']
seaborn -> matplotlib-base[version='>=3.1,!=3.6.1'] -> certifi[version='>=2020.06.20']
requests=2.22.0 -> urllib3[version='>=1.21.1,<1.26,!=1.25.0,!=1.25.1'] -> certifi

...

Package fonttools conflicts for:
seaborn -> matplotlib-base[version='>=3.1,!=3.6.1'] -> fonttools[version='>=4.22.0']
networkx[version='>=2.2'] -> matplotlib-base[version='>=3.4'] -> fonttools[version='>=4.22.0']

networkx[version='>=2.2'] -> matplotlib[version='>=3.3'] -> matplotlib-base[version='>=3.3.0,<3.3.1.0a0|>=3.3.1,<3.3.2.0a0|>=3.3.2,<3.3.3.0a0|>=3.3.4,<3.3.5.0a0|>=3.4.2,<3.4.3.0a0|>=3.4.3,<3.4.4.0a0|>=3.5.0,<3.5.1.0a0|>=3.5.1,<3.5.2.0a0|>=3.5.2,<3.5.3.0a0|>=3.5.3,<3.5.4.0a0|>=3.6.2,<3.6.3.0a0|>=3.7.0,<3.7.1.0a0|>=3.7.1,<3.7.2.0a0|>=3.7.2,<3.7.3.0a0|>=3.6.3,<3.6.4.0a0|>=3.6.1,<3.6.2.0a0|>=3.6.0,<3.6.1.0a0|>=3.4.1,<3.4.2.0a0|>=3.3.3,<3.3.4.0a0']The following specifications were found to be incompatible with your system:

  - feature:/win-64::__win==0=0
  - feature:|@/win-64::__win==0=0
  - nltk[version='>=3.4'] -> click -> __unix
  - nltk[version='>=3.4'] -> click -> __win
  - urllib3=1.25.8 -> pysocks[version='>=1.5.6,<2.0,!=1.5.7'] -> __unix
  - urllib3=1.25.8 -> pysocks[version='>=1.5.6,<2.0,!=1.5.7'] -> __win

Your installed version is: 0

@DelaramRajaei
Copy link
Member Author

Hello Isaac,

Have you tried the installation with requirement.txt file? Did you have the same issue with that?
Also this my issue for installing ReQue project. I logged all the problems I had with the installation. It may help you.

@DelaramRajaei
Copy link
Member Author

@IsaacJ60
Hi Isaac,

I updated the version of the libraries and also added some libraries to the environment.yml and requirement files. Could you please try these files? Let me know if you encounter any errors.

requirements.txt
environment.zip
(It didn't support .yml files, I sent you the zip.)

@IsaacJ60
Copy link
Contributor

IsaacJ60 commented Aug 3, 2023

Hi Delaram,

I'll try those when I get home. So far, installing via the requirements.txt returned no errors (using the environment.yml file never works, it always finds conflicts and fails to resolve them), but the anserini installation leads me to believe that there is something wrong with the libraries. I will try and post more specific details if the new files still don't work.

@IsaacJ60
Copy link
Contributor

IsaacJ60 commented Aug 5, 2023

Hi Delaram,

So installing with the new environment.yml file works with no errors, and I was able to install anserini after some tinkering.

During the pyserini installation process, conda install faiss-cpu -c pytorch doesn't work and it fails to solve environment. I'm following these steps. It still thinks I'm using Python 3.11, even though I deleted it and removed the environment path variables. I'll try and find any left over traces later today.

@IsaacJ60
Copy link
Contributor

IsaacJ60 commented Aug 6, 2023

Hi Delaram,

conda install faiss-cpu -c pytorch runs without returning any errors now, so I believe the installation is done. Where can I find the robust04 dataset? The link in the README file shows this:

image

Could you help clarify if there are any additional steps for downloading the datasets?

Thanks.

@DelaramRajaei
Copy link
Member Author

Hello Isaac,

Are you available to visit the lab this week? I already have the indexed robust04, so you do not need to index them again. I can share the indexed files with you.

@IsaacJ60
Copy link
Contributor

IsaacJ60 commented Aug 7, 2023

Yeah sure I can drop by Wednesday morning at around 10:30-11:00 AM if that works for you.

@DelaramRajaei
Copy link
Member Author

DelaramRajaei commented Aug 8, 2023

Yes, that's perfect.
Also for the internet, I can help you with it.

@IsaacJ60
Copy link
Contributor

IsaacJ60 commented Aug 9, 2023

Hi Delaram,

I arrived at the building but the door to 215 is closed. Where can I find you?

@DelaramRajaei
Copy link
Member Author

Hello Isaac,

Apologies, at first, I was waiting downstairs for you.

For your next task, please write all the translated queries in a text file. The format could be something like this:

query_id + '\t' + original_query + '\t' + translated_query + '\t' + backtranslated_query

After that, please add the new datasets and make sure to index them first. You can find the necessary commands in ReQue's readme.

@IsaacJ60
Copy link
Contributor

Hi Delaram,

Running python -u main.py --corpus robust04 --output ./output/robust04/ --ranker bm25 --metric map 2>&1 | tee robust04.bm25.log & returns this error:
image

Running with the tee command returns this error (after downloading some json files):

Some weights of the model checkpoint at C:\Users\isaac/.cache\torch\sentence_transformers\johngiorgi_declutr-small were not used when initializing RobertaModel: ['lm_head.decoder.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.decoder.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Starting threads per expanders for ['generate', 'search', 'evaluate'] ...
INFO: MAIN: GENERATE: There has been error in <expanders.abstractqexpander.AbstractQExpander object at 0x00000279E2B31940>!
Traceback (most recent call last):
  File "main.py", line 51, in generate
    expander.write_expanded_queries(Qfilename, Q_filename)
  File "C:\Users\isaac\Documents\Isaac\CS\ReQue\ReQue\qe\expanders\abstractqexpander.py", line 42, in write_expanded_queries      
  File "main.py", line 51, in generate
    expander.write_expanded_queries(Qfilename, Q_filename)
  File "C:\Users\isaac\Documents\Isaac\CS\ReQue\ReQue\qe\expanders\abstractqexpander.py", line 42, in write_expanded_queries      
    with open(Q_filename, 'w', encoding='UTF-8') as Q_file:
FileNotFoundError: [Errno 2] No such file or directory: './output/robust04/robust04/topics.robust04.backtranslation_fra_latn.txt' 


INFO: MAIN: THREAD: abstractqueryexpansion: There has been error in <expanders.abstractqexpander.AbstractQExpander object at 0x00000279E2B31940>!
Traceback (most recent call last):
  File "main.py", line 199, in worker_thread
    if 'generate' in op: generate(Qfilename=param.corpora[corpus]['topics'], expander=expander, output=output_)
  File "main.py", line 51, in generate
    expander.write_expanded_queries(Qfilename, Q_filename)
  File "C:\Users\isaac\Documents\Isaac\CS\ReQue\ReQue\qe\expanders\abstractqexpander.py", line 42, in write_expanded_queries      
    with open(Q_filename, 'w', encoding='UTF-8') as Q_file:
FileNotFoundError: [Errno 2] No such file or directory: './output/robust04/robust04/topics.robust04.abstractqueryexpansion.txt'   
INFO: MAIN: THREAD: backtranslation_fra_latn: There has been error in <expanders.backtranslation.BackTranslation object at 0x00000279E2B31820>!
Traceback (most recent call last):
  File "main.py", line 199, in worker_thread
    if 'generate' in op: generate(Qfilename=param.corpora[corpus]['topics'], expander=expander, output=output_)
  File "main.py", line 51, in generate
    expander.write_expanded_queries(Qfilename, Q_filename)
  File "C:\Users\isaac\Documents\Isaac\CS\ReQue\ReQue\qe\expanders\abstractqexpander.py", line 42, in write_expanded_queries      
    with open(Q_filename, 'w', encoding='UTF-8') as Q_file:
FileNotFoundError: [Errno 2] No such file or directory: './output/robust04/robust04/topics.robust04.backtranslation_fra_latn.txt' 


Traceback (most recent call last):
  File "main.py", line 291, in <module>
    run(corpus=args.corpus.lower(),
  File "main.py", line 247, in run
    result = initialize(corpus, rankers, metrics, output_, rf, op, topicreader)
  File "main.py", line 225, in initialize
    result = aggregate(expanders=expanders, rankers=rankers, metrics=metrics, output=output)
  File "main.py", line 153, in aggregate
    df.to_csv(filename, index=False)
  File "C:\Users\isaac\anaconda3\envs\Reque\lib\site-packages\pandas\core\generic.py", line 3772, in to_csv
    return DataFrameRenderer(formatter).to_csv(
  File "C:\Users\isaac\anaconda3\envs\Reque\lib\site-packages\pandas\io\formats\format.py", line 1186, in to_csv
    csv_formatter.save()
  File "C:\Users\isaac\anaconda3\envs\Reque\lib\site-packages\pandas\io\formats\csvs.py", line 240, in save
    with get_handle(
  File "C:\Users\isaac\anaconda3\envs\Reque\lib\site-packages\pandas\io\common.py", line 737, in get_handle
    check_parent_directory(str(handle))
  File "C:\Users\isaac\anaconda3\envs\Reque\lib\site-packages\pandas\io\common.py", line 600, in check_parent_directory
    raise OSError(rf"Cannot save file into a non-existent directory: '{parent}'")
OSError: Cannot save file into a non-existent directory: 'output\robust04\robust04'

Running this on powershell where tee is apparently available returns this error:

python : Traceback (most recent call last):
At line:1 char:1
+ python -u main.py --corpus robust04 --output ./output/robust04/ --ran ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (Traceback (most recent call last)::String) [], RemoteException
    + FullyQualifiedErrorId : NativeCommandError

  File "main.py", line 5, in <module>
    from pyserini.search.lucene import querybuilder
ModuleNotFoundError: No module named 'pyserini.search.lucene'

Could this be due to the conda environment? I was having issues activating it. I can share more details on that if you need.

@DelaramRajaei
Copy link
Member Author

Hello Isaac,

I faced the same error I mentioned in this issue.

  File "main.py", line 5, in <module>
    from pyserini.search.lucene import querybuilder
ModuleNotFoundError: No module named 'pyserini.search.lucene'

The pip installation would encounter castorini/pyserini#1025 and it's better to use the development installation.

If you encounter any errors while running the program, you can resolve them by executing the following commands: first, use pip uninstall pyserini to uninstall Pyserini, and then reinstall it using pip install pyserini.
I recall that the issue was resolved by uninstalling and then reinstalling pyserini.

@IsaacJ60
Copy link
Contributor

Hi Delaram,

I believe it fixed that error, but I am getting some other ModuleNotFound errors. It leads me to believe that the environment isn't being activated properly. When I run conda activate ReQue, it returns the following error:
image

Unless you have any other ideas, I will try uninstalling and reinstalling anaconda.

@DelaramRajaei
Copy link
Member Author

I'm not sure why this problem is happening.

In PyCharm settings, you can also switch to the Conda environment for the Python interpreter. If you're using Visual Studio Code, there might be a similar option. You can look for it in the settings, before reinstalling anaconda.

@IsaacJ60
Copy link
Contributor

I will try reinstalling conda shortly.

If I cannot get the project to run, could we set up an online meeting to debug?

@IsaacJ60
Copy link
Contributor

Hi Delaram, reinstalling conda worked. I will provide a more detailed log of all the steps I took to run the project soon.

@IsaacJ60
Copy link
Contributor

I also wanted to inform you that I'll be attending Hack the 6ix in Toronto over the weekend, so I may have to wait until I'm back to resume with the full log and tasks.

@DelaramRajaei
Copy link
Member Author

Great. Apologies for the delay in getting back to you.
Please keep recording your findings in the log.
Additionally, focus on completing this task before proceeding with adding new datasets following your trip to Toronto.

@IsaacJ60
Copy link
Contributor

Hello Isaac,

Apologies, at first, I was waiting downstairs for you.

For your next task, please write all the translated queries in a text file. The format could be something like this:

query_id + '\t' + original_query + '\t' + translated_query + '\t' + backtranslated_query

After that, please add the new datasets and make sure to index them first. You can find the necessary commands in ReQue's readme.

Hi Delaram,

After running ReQue, this is the file that was outputted.
topics.robust04.bm25.map.all.csv
Where would I get the translated query?

I'm also getting the following error:

"../anserini/eval/trec_eval.9.0.4/trec_eval" -q -m map ../ds/robust04/qrels.robust04.txt ./output/robust04/robust04/topics.robust04.backtranslation_fra_latn.bm25.txt > ./output/robust04/robust04/topics.robust04.backtranslation_fra_latn.bm25.map.txt
The system cannot find the path specified.

Looks like it's not outputting to that map file. Not sure if this would affect anything?

@DelaramRajaei
Copy link
Member Author

Hello Isaac,

The file you've shared reveals that the backtranslated expander didn't function correctly, resulting in empty columns. The issue seems to be related to trec_eval. To resolve this, you employ Cygwin to Make the trec_eval, which will result in the .exe file.

After that, please go to the backtranslated.py file. In the get_expanded_query function, make sure to save the translated_query into a .txt file with the CSV format.

Please make sure to do this as soon as you can, and then you can move on to adding new datasets.

@IsaacJ60
Copy link
Contributor

IsaacJ60 commented Sep 5, 2023

Hi Delaram,

Here is the new file it generated. Could you please verify that it is correct? I don't seem to see any error messages upon running the project.

topics.robust04.bm25.map.all.csv

@DelaramRajaei
Copy link
Member Author

Yes, everything looks good. Please make sure to save the translated queries as I mentioned earlier, and then you can go ahead with adding the new datasets.

@IsaacJ60
Copy link
Contributor

IsaacJ60 commented Sep 9, 2023

Hi Delaram, sorry to bother you again. I'm having trouble understanding how to retrieve the qid. Am I able to retrieve it from some dataframe or should I pass it into the get_expanded_query?

@DelaramRajaei
Copy link
Member Author

Hello Isaac, the qid is already passed as the args. I believe you can find it in args[0]. By debugging and printing you can make sure it is correct.

@IsaacJ60
Copy link
Contributor

IsaacJ60 commented Sep 27, 2023

Hi Delaram,

Sorry for the delay. I submitted a PR to your repo 2 weeks ago, and will begin working on adding in datasets. How do you suggest I go about doing that? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants