Skip to content
This repository has been archived by the owner on Aug 13, 2021. It is now read-only.

[270] GtR mapping (also table name typo) #283

Merged
merged 81 commits into from
Jun 26, 2020
Merged

[270] GtR mapping (also table name typo) #283

merged 81 commits into from
Jun 26, 2020

Conversation

jaklinger
Copy link
Contributor

Closes #270
Closes #206

  • Address Typo in GtR table name #206
  • Add GtR schema transformation (tier_1/datasets/gtr.json)
  • Add GtR ES mapping base (tier_1/mapping/datasets/gtr_mapping.json)

@@ -292,7 +292,7 @@ class SoftwareAndTechnicalProducts(Base):


class DocumentClusters(Base):
__tablename__ = 'grt_doc_clusters'
__tablename__ = 'gtr_doc_clusters'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixing this issue is a little moot, since the table gtr_doc_clusters is empty

@jaklinger
Copy link
Contributor Author

Note @mindrones that this PR is in advance of #271 where I will pipe the GtR data with this mapping into ES7, from where it can be validated, including deeper iteration on the schema as features are understood or desired.

Copy link
Contributor

@mindrones mindrones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a couple of questions below.

"currencyCode": {
"type": "keyword"
},
"end": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly I'd use date_end and date_start here

Comment on lines +66 to +68
"terms_iso2_project": {
"type": "keyword"
},
Copy link
Contributor

@mindrones mindrones Jun 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about terms_countryId_project with the assumption that we always use ISO alpha 2.

Can we add some custom doc string in ES mappings? Like:

"terms_iso2_project": {
  "type": "keyword",
  "__doc": "country ISO alpha 2 codes"
},

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regarding countryId - for consistency i will stick with iso2, and then this issue #285 will consider replacing iso2 --> countryId so that all indexes can be consistent.

regarding the docstring - it doesn't seem so - but there's nothing stopping us adding custom docstrings and then stripping them out again so I added an issue #284

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome, thanks a lot

Comment on lines 63 to 65
"terms_institutes_project": {
"type": "keyword"
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this one, should we be able to search it? If so I'd use text+keyword

},
"type": "nested"
},
"json_outcomes_project": {
Copy link
Contributor

@mindrones mindrones Jun 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this one: does this mean that the type is determined at runtime? or that it can have multiple types?
What's the reason for using this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I should have put a note here (I will write into the metadata):

The dynamic/nested bit will mean that this field is basically just an unstructured nosql store, and that the nested types will be determined on ingestion. This will lead to bad behaviour for inconsistently formatted numeric, date and boolean fields. However, I took this decision because there are currently fourteen additional/supplementary outcomes tables in GtR, each with subtly different schemas:

artisticandcreativeproducts 
collaborations              
disseminations              
furtherfundings             
impactsummaries             
intellectualproperties      
keyfindings                 
policyinfluences            
products                    
publications                
researchdatabaseandmodels   
researchmaterials           
softwareandtechnicalproducts
spinouts    

which all reflect different outcomes after the projects finished. For most projects, I believe that these are empty, but this is in desperate need of EDA before we can decide how to structure the data more formally.

The plan was to dump all of these fields (where they exist) directly into this json_outcomes_project field, where they can be analysed later. I'm confident that structure can be added to unify the interface to the outcomes fields, but I think it's a little out of scope for this PR

Copy link
Contributor

@mindrones mindrones Jun 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK thanks a lot for the explainer, once this will be up we'll discuss about how to make EDA on this kind of fields, thanks!

@mindrones mindrones self-requested a review June 10, 2020 09:39
Copy link
Contributor

@mindrones mindrones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to merge for my part.

@jaklinger jaklinger merged commit 5229238 into dev Jun 26, 2020
@jaklinger jaklinger deleted the 270_gtr_mapping branch June 26, 2020 10:09
jaklinger added a commit that referenced this pull request Sep 28, 2020
* make sure conf dir is empty

* simplified es config

* added orm es config reader

* modified setup_es to pick up new es config

* swapped es_mode for boolean

* aliases now consistent with config

* aliases now automatically located

* added endpoint field to estasks

* added endpoint field to sql2estasks

* changed branch name

* mappings build

* updated docs

* updated docs

* updated docs

* added docstrings

* pruned deprecated schema transformations

* updated fos fieldname on arxlive

* unified data set schema transformations

* restructured directory

* refactored references to schema_transformation

* refactored references to schema_transformation

* slimmed down transformations, and included entity_type

* pruned ontology

* tidied schemas

* consistency tests

* reverted unrelated json file

* added dynamic strict to settings

* removed index.json in favour of a single defaults file

* harmonised name fieldsofstudy across arxiv

* using soft alias until a future PR to minimise changes

* added novelty back in

* sorted json

* sorted json

* sorted json

* changed schema_transformor to use new simpler mapping

* removed to/from keys

* new null syntax mapping implemented

* cleaned and sorted json

* adding temporary eurito-dev index to avoid conflating es7 compatibility issues

* adding temporary eurito-dev index to avoid conflating es7 compatibility issues

* testing es7 on cordis only

* testing es7 on cordis only

* testing es7 on cordis only

* changes to make cordis es7 run

* eurito-dev iteration

* compatibility issues between arxlive and eurito arxiv

* sorted json

* pycountry change no longer assumes not null country

* needed to split pathstub args

* removed redundant es mappings

* empty gtr transformation

* [267] Pool ES mappings across datasets (#280)

* changed branch name

* mappings build

* updated docs

* updated docs

* updated docs

* added docstrings

* added dynamic strict to settings

* removed index.json in favour of a single defaults file

* using soft alias until a future PR to minimise changes

* cleaned and sorted json

* [267] Tidy & slim schema transformations (#281)

* pruned deprecated schema transformations

* updated fos fieldname on arxlive

* unified data set schema transformations

* restructured directory

* refactored references to schema_transformation

* refactored references to schema_transformation

* slimmed down transformations, and included entity_type

* pruned ontology

* tidied schemas

* consistency tests

* reverted unrelated json file

* harmonised name fieldsofstudy across arxiv

* added novelty back in

* sorted json

* sorted json

* sorted json

Co-authored-by: Joel Klinger <[email protected]>

Co-authored-by: Joel Klinger <[email protected]>

* patched out es config setup from tests

* removed redundant tests

* fixed json formatting

* fixed bad table name (NB table was empty anyway)

* fixed bad table name (NB table was empty anyway)

* gtr ontology

* none included for testing

* added schema transformation

* picked up bug in test

* gtr ontology is self consistent

* added gtr mapping

* added gtr to config

* fixed merge conflicts

* fixed merge conflicts

* changed json field names

* instiutes are now analyzed and text

* sorted and cleaned json

* added geopoint

* fixed bad json

* fixed bad json

Co-authored-by: Joel Klinger <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[DAPS] Create GtR mapping Typo in GtR table name
2 participants