-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quickstart and Setup #28
Comments
Hi Delaram, I was just curious if there is a rough timeframe on when I should be done reading through the IR doc. Thanks. |
@IsaacJ60 Can you complete the reading of the IR document by Monday? Once that is done, we can proceed to read the papers on backtranslation. |
Got it, thanks! |
Finished reading the IR doc. I'm ready for the next steps. |
Awesome! Do you have any questions about the fundamentals? The document is not fully completed in the metric part. You can find some helpful information in this link. If you don't have any questions, we can move forward with reading backtranslation papers. We can schedule a brief meeting for me to explain how you can write a paper summary (either online or in person) and discuss how to proceed with your next task. I found some papers on backtranslation and you can find their links in this doc under BackTranslation header. |
Not too many questions at the moment, but I'm sure there will be some moments where I'll need to refer back to the document. Are you available in the mornings anytime before 12? I am tentatively available this week on Thursday and Friday. |
Sure, I'm available in the morning on any day except for Friday. |
Okay, can we try and meet at 10 AM on Thursday? Also, quick question: what does adding noise to text look like? Is there a specific example that can give me a better picture? Thanks |
Yes, that time works for me. Noise can be added to a text in various ways, by adding errors, inconsistencies, or irrelevant information. Here are some examples of text noise sources:
Here is an example for adding noise to a sentence by removing some vowels and doubling some letters:
I also came across this link that might be helpful. But pay attention that these sources of noise can impact the readability, clarity, and overall quality of the text, making it essential to perform noise reduction or text cleaning techniques in various natural language processing applications. |
Thanks for the explanation. Do you think we can meet virtually? I can't access the university WiFi so I think that would work out better. |
Yes, that is fine by me. I will set up a google meet then. |
Hi Deleram, just wondering if you are joining the meeting soon? |
Hello Isaac, For your next task, start by searching for papers on backtranslation and make a list of them in the provided document. When choosing a paper, read the abstract first and look for the answers to these questions:
If you want to know more about the paper, check out the introduction for further info. If the paper is good and relates to our work, add the paper's name to the list. Just keep in mind that some papers might be more technical, so you can include them in the list to expand your knowledge and come back to them later. For finding papers you can search in dblp.org or scholar.google.com . After that, you can dive into reading and summarizing the papers. To make your summaries comprehensive, consider including the following sections:
I suggest you begin by reading the two survey papers and their summaries, "A Survey of Data Augmentation Approaches for NLP" and "A Survey of Text Data Augmentation" . After that, you can complete the list of papers. Make sure to complete the paper list and read at least two papers, summarizing them by Thursday. |
Hi Delaram, I was reading Iterative Back-Translation for Neural Machine Translation, and came across this particular line: "back-translated data is used to build better translation systems in forward and backward directions, which in turn is used to re-back-translate monolingual data" From what I understand, it's saying that upon back-translating some data from language 1 -> language 2 -> language 1, you can reuse that back-translated data to do more back-translations. Could you let me know if I'm on the right track here? Thanks. |
Also, should I write summaries in google docs or in an issue? Thanks! |
Hey Isaac, yeah, you got it right. About your second question, please provide a summary of each paper you have read in a new issue. Start with the year of the paper, then mention the name of the venue, and finally, include the title of the paper. Have you finished compiling the list of papers? If so, could you please provide a log detailing the tasks you have completed? |
I will start doing the two summaries on Wednesday. Still have to pick one more paper. Should I write the log in this issue? |
Yes, please record the log of your tasks on this page and provide summaries on a new issue. |
July 11th, 2023 - Task Log
|
Hi Delaram, I was reading up on beam search and sampling, and from what I understood, the difference is that sampling introduces more variance than beam searching which produces more repetitive but fluent text. Do you know what goes on in each method and why this is the case? Thanks! |
Hello Isaac, I searched about it and here's what I found: Here's how beam search works:
Here's how sampling works:
Comparing: Sampling brings more variety and allows for greater exploration and creativity in the generated sequences. However, it can also lead to less organized or less reliable outputs compared to beam search, which takes the overall probabilities of the entire sequence into consideration. In simpler terms, sampling focuses more on predicting the next word in a phrase, while beam search looks at the bigger picture of the whole phrase for better results. |
That makes sense, thanks! |
July 12th, 2023 - Task Log Completed summaries for 2 papers and posted as issues. |
Thank you for the logs and the helpful summaries. I'll make sure to check them out. Now, for the next task, we need to find datasets related to medicine and law. Currently, we're working with these datasets:
Unfortunately, the results for backtranslation weren't great, especially with cluweb. However, I've come across some papers and websites mentioning that backtranslation is often used with medical, law, and financial datasets. To put this to the test, we need to find different datasets in the medical and law fields. To discover information retrieval (IR) datasets, you may find the following resources useful: ir-datasets.com and huggingface.co/datasets. Please let me know if you have any questions or want to schedule a meeting. |
Medical Datasets:Medlinehttps://ir-datasets.com/medline.html#medline Clinical Trialshttps://ir-datasets.com/clinicaltrials.html CORD-19 (may contain medical documents)https://ir-datasets.com/cord19.html Highwire (TREC Genomics 2006-07)https://ir-datasets.com/highwire.html NFCorpus (NutritionFacts)https://ir-datasets.com/nfcorpus.html PubMed Central (TREC CDS)https://ir-datasets.com/pmc.html Law DatasetsPile of Lawhttps://huggingface.co/datasets/pile-of-law/pile-of-law |
I'll continue to add to this list, but before I do so, I have a few questions:
|
Thank you for the list.
Please include a brief summary that covers the dataset's size, format, variables/features, as well as any relevant statistics or distributions. |
@IsaacJ60 Could you please create an analysis function within the compare.py file? These are the two files we are currently working on: Please convert the .csv file to .txt format, just like the example file provided above. Make sure to generate the subtraction of the original map from each map of every language in the backtranslated query (backtranslated query - original query). Here's some additional information about a line:
|
Sure, I'll aim to finish it by Tuesday, but hopefully sooner. |
Hi Delaram, I wasn't able to make time for a meeting today, but I'll try and finish up the analysis function tonight. |
Hi Delaram, I finished with the analysis function. Where should I upload it for you to review? |
Hello Isaac, Would you be able to fork the ReQue project from my github page to your repository and then submit a pull request? By doing so, your commits and contributions to the project will be visible and trackable. |
Hi Delaram, I'm running into some issues with the conda environment when running:
The full output from the console is below. Most importantly it says that many specifications were found to be incompatible with each other. (Omitted text in the middle) Thanks!
|
Hello Isaac, Have you tried the installation with requirement.txt file? Did you have the same issue with that? |
@IsaacJ60 I updated the version of the libraries and also added some libraries to the environment.yml and requirement files. Could you please try these files? Let me know if you encounter any errors. requirements.txt |
Hi Delaram, I'll try those when I get home. So far, installing via the requirements.txt returned no errors (using the environment.yml file never works, it always finds conflicts and fails to resolve them), but the anserini installation leads me to believe that there is something wrong with the libraries. I will try and post more specific details if the new files still don't work. |
Hi Delaram, So installing with the new environment.yml file works with no errors, and I was able to install anserini after some tinkering. During the pyserini installation process, |
Hello Isaac, Are you available to visit the lab this week? I already have the indexed robust04, so you do not need to index them again. I can share the indexed files with you. |
Yeah sure I can drop by Wednesday morning at around 10:30-11:00 AM if that works for you. |
Yes, that's perfect. |
Hi Delaram, I arrived at the building but the door to 215 is closed. Where can I find you? |
Hello Isaac, Apologies, at first, I was waiting downstairs for you. For your next task, please write all the translated queries in a text file. The format could be something like this:
After that, please add the new datasets and make sure to index them first. You can find the necessary commands in ReQue's readme. |
Hello Isaac, I faced the same error I mentioned in this issue.
|
I'm not sure why this problem is happening. In PyCharm settings, you can also switch to the Conda environment for the Python interpreter. If you're using Visual Studio Code, there might be a similar option. You can look for it in the settings, before reinstalling anaconda. |
I will try reinstalling conda shortly. If I cannot get the project to run, could we set up an online meeting to debug? |
Hi Delaram, reinstalling conda worked. I will provide a more detailed log of all the steps I took to run the project soon. |
I also wanted to inform you that I'll be attending Hack the 6ix in Toronto over the weekend, so I may have to wait until I'm back to resume with the full log and tasks. |
Great. Apologies for the delay in getting back to you. |
Hi Delaram, After running ReQue, this is the file that was outputted. I'm also getting the following error:
Looks like it's not outputting to that map file. Not sure if this would affect anything? |
Hello Isaac, The file you've shared reveals that the backtranslated expander didn't function correctly, resulting in empty columns. The issue seems to be related to trec_eval. To resolve this, you employ Cygwin to Make the trec_eval, which will result in the .exe file. After that, please go to the backtranslated.py file. In the get_expanded_query function, make sure to save the translated_query into a .txt file with the CSV format. Please make sure to do this as soon as you can, and then you can move on to adding new datasets. |
Hi Delaram, Here is the new file it generated. Could you please verify that it is correct? I don't seem to see any error messages upon running the project. |
Yes, everything looks good. Please make sure to save the translated queries as I mentioned earlier, and then you can go ahead with adding the new datasets. |
Hi Delaram, sorry to bother you again. I'm having trouble understanding how to retrieve the qid. Am I able to retrieve it from some dataframe or should I pass it into the get_expanded_query? |
Hello Isaac, the qid is already passed as the args. I believe you can find it in args[0]. By debugging and printing you can make sure it is correct. |
Hi Delaram, Sorry for the delay. I submitted a PR to your repo 2 weeks ago, and will begin working on adding in datasets. How do you suggest I go about doing that? Thanks! |
@IsaacJ60
This is an issue page to log your progress. Please let us know if you have any concerns or questions.
The text was updated successfully, but these errors were encountered: