-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate new aggregation subsystem based content importer module #1361
Conversation
…Store in favour of newly introduced PDF aggregation service Initial implementation, proof of concept.
…Store in favour of newly introduced PDF aggregation service Refined implementation. Subworkflow still to be integrated with IIS main workflow.
…Store in favour of newly introduced PDF aggregation service Limiting entries to the ones with actual_url not null in other not to violate the DocumentContentURL schema constraints.
…Store in favour of newly introduced PDF aggregation service Renaming `actual_url` to `location` as a column reference pointing to S3 content location.
…Store in favour of newly introduced PDF aggregation service Enabling the new hive-based content metadata importer with content_url importer uber workflow and metadataextraction cache builder.
…Store in favour of newly introduced PDF aggregation service Excluding conflicting jackson libraries.
…he builder workflow
…Store in favour of newly introduced PDF aggregation service The following new properties are supported: * `import_content_pdfaggregation_table_name` to be specified in at runtime 'dbname.tablename' format to indicate both hive database and table * `import_content_pdfaggregation_hive_metastore_uris` with URIs pointing to the hive metastore utilized by the PDF aggregation subsystem, to be defined statically in IIS environment (default-config.xml file) New hive-based PDF aggregation service support is automatically enabled by providing explicit `import_content_pdfaggregation_table_name` parameter value at runtime. When the parameter is unspecified then content importer module works in legacy mode (objectstore compatible).
...mport/src/main/resources/eu/dnetlib/iis/wf/importer/content_url/chain/oozie_app/workflow.xml
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I think I understand what's going on here.
|
||
@BeforeEach | ||
public void beforeEach() { | ||
super.beforeEach(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this convention rather confusing. If we have a method in the superclass that is explicitly annotated with @BeforeEach
, overriding it and then using super
instead of just letting the inherited method be run seems like the wrong thing to do. But I see that this is a common convention in iis so this PR is not a place to change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. I didn't pay attention to this, just copy&pasted from some other test class.
I guess it might be some leftover from the time when the IIS test code base was switched from junit4 to junit5 as it seems the default behavior has changed. Again, I didn't check this thoroughly so let's do the cleanup as a part of the dedicated #1365 task.
} | ||
|
||
//@Test | ||
//TODO reenable this test once Hive is installed on CI Test hadoop cluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A perfect case for @Disabled("TODO reenable this test once Hive is installed on CI Test hadoop cluster")
, I think.
Even @DisabledIf
could be fun if Hive is easily detectable, but that's not necessarily a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will stick to @Disabled
instead of @DisabledIf
because in the end we want to have it enabled just need to fix the environment.
@@ -239,7 +280,7 @@ | |||
|
|||
<decision name="decision-existence-filter"> | |||
<switch> | |||
<case to="content-url-dispatcher">${input_id eq "$UNDEFINED$"}</case> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This particular change could be confusing but it is bound to a specific commit which is closing the #1342 issue spotted when working on #1298.
Maybe it is not worth to elevate it to a dedicated PR but we should find a nice way to indicate this solution somehow in the commit comment once the commits get squshed before the merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was surprised at first but found the reason for this change.
I personally prefer just not squashing, but adding "Also closes #1342, which was found during this work, by going to content-url-dedup from decision-existence-filter in the content_url_chain" should be enough, I think.
…Store in favour of newly introduced PDF aggregation service Applying code review fixes.
Already merged with IIS master branch with this squashed commit. |
This pull request closes #1298 by introducing support for importing
DocumentContentUrl
records from the newly introduced hive-based content aggregation subsystem.The aggregation subsystem support is automatically enabled after providing explicit
import_content_pdfaggregation_table_name
parameter value at runtime. When the parameter is unspecified then content importer module works in legacy mode (objectstore compatible).This pull request includes
agg_subsystem_based_importer
integration test which works properly on the main ocean cluster but cannot be run on a dedicated CI cluster due to the missing hive-service therefore it is currently disabled ineu.dnetlib.iis.wf.importer.content.WorkflowTest
class and should be reenabled after installing Hive service on CI cluster (dedicated issue covering this task was already crated: #1360).