-
Notifications
You must be signed in to change notification settings - Fork 5
[270] GtR mapping (also table name typo) #283
Conversation
@@ -292,7 +292,7 @@ class SoftwareAndTechnicalProducts(Base): | |||
|
|||
|
|||
class DocumentClusters(Base): | |||
__tablename__ = 'grt_doc_clusters' | |||
__tablename__ = 'gtr_doc_clusters' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixing this issue is a little moot, since the table gtr_doc_clusters
is empty
Note @mindrones that this PR is in advance of #271 where I will pipe the GtR data with this mapping into ES7, from where it can be validated, including deeper iteration on the schema as features are understood or desired. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left a couple of questions below.
"currencyCode": { | ||
"type": "keyword" | ||
}, | ||
"end": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly I'd use date_end
and date_start
here
"terms_iso2_project": { | ||
"type": "keyword" | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about terms_countryId_project
with the assumption that we always use ISO alpha 2.
Can we add some custom doc string in ES mappings? Like:
"terms_iso2_project": {
"type": "keyword",
"__doc": "country ISO alpha 2 codes"
},
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
regarding countryId
- for consistency i will stick with iso2
, and then this issue #285 will consider replacing iso2
--> countryId
so that all indexes can be consistent.
regarding the docstring - it doesn't seem so - but there's nothing stopping us adding custom docstrings and then stripping them out again so I added an issue #284
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome, thanks a lot
"terms_institutes_project": { | ||
"type": "keyword" | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about this one, should we be able to search it? If so I'd use text+keyword
}, | ||
"type": "nested" | ||
}, | ||
"json_outcomes_project": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about this one: does this mean that the type is determined at runtime? or that it can have multiple types?
What's the reason for using this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I should have put a note here (I will write into the metadata):
The dynamic
/nested
bit will mean that this field is basically just an unstructured nosql store, and that the nested types will be determined on ingestion. This will lead to bad behaviour for inconsistently formatted numeric, date and boolean fields. However, I took this decision because there are currently fourteen additional/supplementary outcomes
tables in GtR, each with subtly different schemas:
artisticandcreativeproducts
collaborations
disseminations
furtherfundings
impactsummaries
intellectualproperties
keyfindings
policyinfluences
products
publications
researchdatabaseandmodels
researchmaterials
softwareandtechnicalproducts
spinouts
which all reflect different outcomes after the projects finished. For most projects, I believe that these are empty, but this is in desperate need of EDA before we can decide how to structure the data more formally.
The plan was to dump all of these fields (where they exist) directly into this json_outcomes_project
field, where they can be analysed later. I'm confident that structure can be added to unify the interface to the outcomes
fields, but I think it's a little out of scope for this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK thanks a lot for the explainer, once this will be up we'll discuss about how to make EDA on this kind of fields, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to merge for my part.
* make sure conf dir is empty * simplified es config * added orm es config reader * modified setup_es to pick up new es config * swapped es_mode for boolean * aliases now consistent with config * aliases now automatically located * added endpoint field to estasks * added endpoint field to sql2estasks * changed branch name * mappings build * updated docs * updated docs * updated docs * added docstrings * pruned deprecated schema transformations * updated fos fieldname on arxlive * unified data set schema transformations * restructured directory * refactored references to schema_transformation * refactored references to schema_transformation * slimmed down transformations, and included entity_type * pruned ontology * tidied schemas * consistency tests * reverted unrelated json file * added dynamic strict to settings * removed index.json in favour of a single defaults file * harmonised name fieldsofstudy across arxiv * using soft alias until a future PR to minimise changes * added novelty back in * sorted json * sorted json * sorted json * changed schema_transformor to use new simpler mapping * removed to/from keys * new null syntax mapping implemented * cleaned and sorted json * adding temporary eurito-dev index to avoid conflating es7 compatibility issues * adding temporary eurito-dev index to avoid conflating es7 compatibility issues * testing es7 on cordis only * testing es7 on cordis only * testing es7 on cordis only * changes to make cordis es7 run * eurito-dev iteration * compatibility issues between arxlive and eurito arxiv * sorted json * pycountry change no longer assumes not null country * needed to split pathstub args * removed redundant es mappings * empty gtr transformation * [267] Pool ES mappings across datasets (#280) * changed branch name * mappings build * updated docs * updated docs * updated docs * added docstrings * added dynamic strict to settings * removed index.json in favour of a single defaults file * using soft alias until a future PR to minimise changes * cleaned and sorted json * [267] Tidy & slim schema transformations (#281) * pruned deprecated schema transformations * updated fos fieldname on arxlive * unified data set schema transformations * restructured directory * refactored references to schema_transformation * refactored references to schema_transformation * slimmed down transformations, and included entity_type * pruned ontology * tidied schemas * consistency tests * reverted unrelated json file * harmonised name fieldsofstudy across arxiv * added novelty back in * sorted json * sorted json * sorted json Co-authored-by: Joel Klinger <[email protected]> Co-authored-by: Joel Klinger <[email protected]> * patched out es config setup from tests * removed redundant tests * fixed json formatting * fixed bad table name (NB table was empty anyway) * fixed bad table name (NB table was empty anyway) * gtr ontology * none included for testing * added schema transformation * picked up bug in test * gtr ontology is self consistent * added gtr mapping * added gtr to config * fixed merge conflicts * fixed merge conflicts * changed json field names * instiutes are now analyzed and text * sorted and cleaned json * added geopoint * fixed bad json * fixed bad json Co-authored-by: Joel Klinger <[email protected]>
Closes #270
Closes #206
tier_1/datasets/gtr.json
)tier_1/mapping/datasets/gtr_mapping.json
)