-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance documentation #53
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @kouloumos, I've reviewed all the files that you've updated and provided comments wherever necessary. Please feel free to add TODO
and merge this PR to the main after resolving conflicts. Once merged, I'll start refactoring the code you highlighted along with some I found while investigating the code.
src/xml_utils.py
Outdated
if len(xmls_list) > 0 and not any(combined_filename in item for item in files_list): | ||
# If individual summaries exist but no combined summary, | ||
# extract and append their content to the dictionary | ||
# @? in what scenario is this true? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are using it as a fallback strategy, there are many cases where individual summaries are present and not the combined ones, for example:
i. The combined summary file generated isn't up to the mark due to some garbage individual posts. So, we have to manually delete the combined summary file and regenerate it.
ii. For delving bitcoin data we haven't generated all the XML files for the month before November 2023 so it might help in some exceptional cases of individual summary files.
So, let's keep it that way only.
@@ -55,6 +55,8 @@ | |||
logger.success(f"TOTAL THREADS RECEIVED FOR '{dev_name}': {len(data_list)}") | |||
|
|||
# NEW THREADS POSTS | |||
# @TODO you already identify the original post by type==original_post |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out, It seems like we can implement this change now. Earlier not all the domains had a type
field so we used to filter the original post based on created_at
and title
.
src/xml_utils.py
Outdated
|
||
# For each individual summary (XML file) that exists for the | ||
# given thread, extract and append their content to the dictionary | ||
# @? This method is called for every post without a summary, which means that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function file_not_present_df
indeed needs a refactoring. Especially, improvements in file handling, reducing the number of operations inside the loop, and streamlining the data-appending logic.
Please transform it to TODO
so I can make the necessary changes in it.
src/xml_utils.py
Outdated
@@ -180,6 +204,9 @@ def file_not_present_df(self, columns, source_cols, df_dict, files_list, dict_da | |||
self.append_columns(df_dict, file, title, namespace) | |||
|
|||
if combined_filename in file: | |||
# @? the code will never reach this point |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, Let's add it to TODO
will update this function.
@urvishp80 I rebased and converted my observations to During rebase, I also modified a comment based on my observation here. |
Refined project documentation and added comments during an in-depth review to map out the project's architecture and component interactions.
@urvishp80 I would appreciate if you can review and verify that my attempt to explain the flow and what each script does is correct. Also, I prefixed a few comments with
@?
, those are questions that I have and I thought that it would be easier to discuss/answer them in-line. I will delete them (or transform them to@TODO
) before merging this.The cron-jobs are also documented on the sequence diagram that I created for Bitcoin Search & friends. I would also appreciate a review of that one.