Skip to content
This repository has been archived by the owner on Aug 13, 2021. It is now read-only.

[270] GtR mapping (also table name typo) #283

Merged
merged 81 commits into from
Jun 26, 2020
Merged
Show file tree
Hide file tree
Changes from 77 commits
Commits
Show all changes
81 commits
Select commit Hold shift + click to select a range
3999574
make sure conf dir is empty
Jun 1, 2020
d5539dd
simplified es config
Jun 1, 2020
b51f3ef
Merge branch 'dev' into 266_es_base
Jun 1, 2020
bb8c330
added orm es config reader
Jun 1, 2020
9f4b2b5
modified setup_es to pick up new es config
Jun 1, 2020
6b6dbc1
swapped es_mode for boolean
Jun 2, 2020
a0b4030
aliases now consistent with config
Jun 2, 2020
ec05346
aliases now automatically located
Jun 2, 2020
dae44aa
added endpoint field to estasks
Jun 2, 2020
ac88d8f
added endpoint field to sql2estasks
Jun 2, 2020
ea98848
changed branch name
Jun 3, 2020
99e837c
mappings build
Jun 3, 2020
da053d0
updated docs
Jun 3, 2020
8101b92
updated docs
Jun 3, 2020
ad9ae75
updated docs
Jun 3, 2020
3df222f
added docstrings
Jun 3, 2020
2cb692b
pruned deprecated schema transformations
Jun 3, 2020
ed10697
updated fos fieldname on arxlive
Jun 4, 2020
7184032
unified data set schema transformations
Jun 4, 2020
7267b3d
restructured directory
Jun 4, 2020
29389bd
refactored references to schema_transformation
Jun 4, 2020
bdb92bf
refactored references to schema_transformation
Jun 4, 2020
984291a
slimmed down transformations, and included entity_type
Jun 5, 2020
309f823
pruned ontology
Jun 5, 2020
b933014
tidied schemas
Jun 5, 2020
832c77f
Merge branch '267_es_mappings' into 267a_schematrans
Jun 5, 2020
2c2482b
consistency tests
Jun 5, 2020
9bab34e
reverted unrelated json file
Jun 5, 2020
cea95a8
added dynamic strict to settings
Jun 5, 2020
0f190d5
Merge branch '267_es_mappings' into 267a_schematrans
Jun 5, 2020
326cf02
removed index.json in favour of a single defaults file
Jun 5, 2020
08fe6da
rmd old files
Jun 5, 2020
cec4dd0
harmonised name fieldsofstudy across arxiv
Jun 5, 2020
1ef6e0e
using soft alias until a future PR to minimise changes
Jun 5, 2020
c38af84
Merge branch '267_es_mappings' into 267a_schematrans
Jun 5, 2020
b5617fc
added novelty back in
Jun 5, 2020
908cdec
sorted json
Jun 5, 2020
cde6765
sorted json
Jun 5, 2020
697a7b5
sorted json
Jun 5, 2020
af41059
changed schema_transformor to use new simpler mapping
Jun 5, 2020
a146f01
removed to/from keys
Jun 5, 2020
2b460e7
new null syntax mapping implemented
Jun 5, 2020
fce71f2
cleaned and sorted json
Jun 5, 2020
3797ab7
Merge branch '267_es_mappings' into 267a_schematrans
Jun 5, 2020
8fc238d
Merge branch '267a_schematrans' into 267b_devpipes
Jun 5, 2020
74a0de0
adding temporary eurito-dev index to avoid conflating es7 compatibili…
Jun 8, 2020
ff52858
adding temporary eurito-dev index to avoid conflating es7 compatibili…
Jun 8, 2020
9682dc7
testing es7 on cordis only
Jun 8, 2020
3d23881
testing es7 on cordis only
Jun 8, 2020
74f6d3c
testing es7 on cordis only
Jun 8, 2020
b00aeac
changes to make cordis es7 run
Jun 8, 2020
ebad7dd
eurito-dev iteration
Jun 8, 2020
fa48641
compatibility issues between arxlive and eurito arxiv
Jun 8, 2020
63c98fa
sorted json
Jun 9, 2020
9fb40e8
pycountry change no longer assumes not null country
Jun 9, 2020
414c62b
needed to split pathstub args
Jun 9, 2020
04921ff
removed redundant es mappings
Jun 9, 2020
2e0f881
empty gtr transformation
Jun 9, 2020
aac29f1
[267] Pool ES mappings across datasets (#280)
jaklinger Jun 9, 2020
e84f6f1
patched out es config setup from tests
Jun 9, 2020
3d9e912
removed redundant tests
Jun 9, 2020
8ccc9e5
fixed json formatting
Jun 9, 2020
5a07b60
fixed bad table name (NB table was empty anyway)
Jun 9, 2020
da1a84b
fixed bad table name (NB table was empty anyway)
Jun 9, 2020
2945be2
gtr ontology
Jun 9, 2020
717afa6
none included for testing
Jun 9, 2020
92234b5
added schema transformation
Jun 9, 2020
4b05e43
picked up bug in test
Jun 9, 2020
ce88847
manual fix of merge conflicts
Jun 9, 2020
2e9a1c5
gtr ontology is self consistent
Jun 9, 2020
639cb44
added gtr mapping
Jun 9, 2020
9224666
added gtr to config
Jun 9, 2020
cc21f32
fixed merge conflicts
Jun 9, 2020
ba62e05
fixed merge conflicts
Jun 9, 2020
13e0f9e
fixed merge conflicts
Jun 9, 2020
92ce96d
changed json field names
Jun 10, 2020
1931acb
instiutes are now analyzed and text
Jun 10, 2020
bd60992
sorted and cleaned json
Jun 10, 2020
292f171
added geopoint
Jun 11, 2020
eb5b052
fixed bad json
Jun 11, 2020
4359565
fixed bad json
Jun 11, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file modified nesta/core/config/elasticsearch.yaml
Binary file not shown.
4 changes: 2 additions & 2 deletions nesta/core/orms/gtr_orm.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ class Participant(Base):


class OrganisationLocation(Base):
"""This table is not in the orginal data. It contains all organisations and location
"""This table is not in the original data. It contains all organisations and location
details where it has been possible to ascertain them."""
__tablename__ = "gtr_organisations_locations"

Expand Down Expand Up @@ -292,7 +292,7 @@ class SoftwareAndTechnicalProducts(Base):


class DocumentClusters(Base):
__tablename__ = 'grt_doc_clusters'
__tablename__ = 'gtr_doc_clusters'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixing this issue is a little moot, since the table gtr_doc_clusters is empty


doc_id = Column(VARCHAR(36), ForeignKey('gtr_projects.id'), primary_key=True)
cluster_id = Column(INT, primary_key=True, index=True)
Expand Down
24 changes: 24 additions & 0 deletions nesta/core/schemas/tier_1/datasets/gtr.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
{
"entity_type": "project",
"tier0_to_tier1": {
"_json_funding_project": "json_funding_project",
"_json_outcomes_project": "json_outcomes_project",
"_terms_continent_project": "terms_continent_project",
"_terms_countries_project": "terms_countries_project",
"_terms_instituteIds_project": "terms_instituteIds_project",
"_terms_institutes_project": "terms_institutes_project",
"_terms_iso2_project": "terms_iso2_project",
"_terms_topics_project": "terms_topics_project",
"abstractText": "textBody_abstract_project",
"end": "date_end_project",
"grantCategory": "type_category_funding",
"id": "id_of_project",
"leadFunder": "name_of_funder",
"leadOrganisationDepartment": "name_leadOrgDepartment_project",
"potentialImpact": "textBody_potentialImpact_project",
"start": "date_start_project",
"status": "status_of_project",
"techAbstractText": "textBody_techAbstract_project",
"title": "title_of_project"
}
}
121 changes: 121 additions & 0 deletions nesta/core/schemas/tier_1/mappings/datasets/gtr_mapping.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
{
"mappings": {
"dynamic": "strict",
"properties": {
"date_end_project": {
"type": "date"
},
"date_start_project": {
"type": "date"
},
"json_funding_project": {
"properties": {
"amount": {
"type": "integer"
},
"category": {
"type": "keyword"
},
"currency_code": {
"type": "keyword"
},
"end_date": {
"type": "date"
},
"start_date": {
"type": "date"
}
},
"type": "nested"
},
"json_outcomes_project": {
Copy link
Contributor

@mindrones mindrones Jun 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this one: does this mean that the type is determined at runtime? or that it can have multiple types?
What's the reason for using this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I should have put a note here (I will write into the metadata):

The dynamic/nested bit will mean that this field is basically just an unstructured nosql store, and that the nested types will be determined on ingestion. This will lead to bad behaviour for inconsistently formatted numeric, date and boolean fields. However, I took this decision because there are currently fourteen additional/supplementary outcomes tables in GtR, each with subtly different schemas:

artisticandcreativeproducts 
collaborations              
disseminations              
furtherfundings             
impactsummaries             
intellectualproperties      
keyfindings                 
policyinfluences            
products                    
publications                
researchdatabaseandmodels   
researchmaterials           
softwareandtechnicalproducts
spinouts    

which all reflect different outcomes after the projects finished. For most projects, I believe that these are empty, but this is in desperate need of EDA before we can decide how to structure the data more formally.

The plan was to dump all of these fields (where they exist) directly into this json_outcomes_project field, where they can be analysed later. I'm confident that structure can be added to unify the interface to the outcomes fields, but I think it's a little out of scope for this PR

Copy link
Contributor

@mindrones mindrones Jun 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK thanks a lot for the explainer, once this will be up we'll discuss about how to make EDA on this kind of fields, thanks!

"dynamic": true,
"type": "nested"
},
"name_leadOrgDepartment_project": {
"fields": {
"keyword": {
"type": "keyword"
}
},
"type": "text"
},
"name_of_funder": {
"fields": {
"keyword": {
"type": "keyword"
}
},
"type": "text"
},
"status_of_project": {
"type": "keyword"
},
"terms_continent_project": {
"type": "keyword"
},
"terms_countries_project": {
"type": "keyword"
},
"terms_instituteIds_project": {
"type": "keyword"
},
"terms_institutes_project": {
"analyzer": "terms_analyzer",
"fields": {
"keyword": {
"type": "keyword"
}
},
"type": "text"
},
"terms_iso2_project": {
"type": "keyword"
},
Comment on lines +75 to +77
Copy link
Contributor

@mindrones mindrones Jun 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about terms_countryId_project with the assumption that we always use ISO alpha 2.

Can we add some custom doc string in ES mappings? Like:

"terms_iso2_project": {
  "type": "keyword",
  "__doc": "country ISO alpha 2 codes"
},

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regarding countryId - for consistency i will stick with iso2, and then this issue #285 will consider replacing iso2 --> countryId so that all indexes can be consistent.

regarding the docstring - it doesn't seem so - but there's nothing stopping us adding custom docstrings and then stripping them out again so I added an issue #284

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome, thanks a lot

"terms_topics_project": {
"analyzer": "terms_analyzer",
"fields": {
"keyword": {
"type": "keyword"
}
},
"type": "text"
},
"textBody_abstract_project": {
"fields": {
"keyword": {
"type": "keyword"
}
},
"type": "text"
},
"textBody_potentialImpact_project": {
"fields": {
"keyword": {
"type": "keyword"
}
},
"type": "text"
},
"textBody_techAbstract_project": {
"fields": {
"keyword": {
"type": "keyword"
}
},
"type": "text"
},
"title_of_project": {
"fields": {
"keyword": {
"type": "keyword"
}
},
"type": "text"
},
"type_category_funding": {
"type": "keyword"
}
}
}
}
9 changes: 8 additions & 1 deletion nesta/core/schemas/tier_1/ontology.json
Original file line number Diff line number Diff line change
Expand Up @@ -56,14 +56,17 @@
"fieldsOfStudy",
"fiscal",
"framework",
"funding",
"health",
"institutes",
"instituteIds",
"ipc",
"iso2",
"iso2lang",
"iso3",
"isoNumeric",
"last",
"leadOrgDepartment",
"linkedIn",
"location",
"member",
Expand All @@ -77,9 +80,11 @@
"nuts2",
"nuts3",
"of",
"outcomes",
"parent",
"personCountry",
"personNuts",
"potentialImpact",
"region",
"regions",
"rhodonite",
Expand All @@ -90,6 +95,7 @@
"state",
"subcategory",
"summary",
"techAbstract",
"techFieldNumber",
"tokens",
"topics",
Expand All @@ -113,6 +119,7 @@
"country",
"description",
"entity",
"funder",
"funders",
"funding",
"group",
Expand All @@ -121,4 +128,4 @@
"project"
]
}
]
]