Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/mx 1677 orcid extractor #361

Open
wants to merge 24 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
3e392cf
started orcid extractor
vyvytranngoc Oct 31, 2024
20d9727
working OrcidConnector (name and orcidId) with running tests
vyvytranngoc Nov 5, 2024
82a859d
Merge branch 'main' of https://github.com/robert-koch-institut/mex-co…
vyvytranngoc Nov 5, 2024
57cd6ac
Merge branch 'main' of https://github.com/robert-koch-institut/mex-co…
vyvytranngoc Nov 12, 2024
06040ad
Merge branch 'main' of https://github.com/robert-koch-institut/mex-co…
vyvytranngoc Nov 19, 2024
fa27e0e
Extended OrcidPerson class, added extract and transform methods (unfi…
vyvytranngoc Nov 19, 2024
136c725
Merge branch 'main' of https://github.com/robert-koch-institut/mex-co…
vyvytranngoc Dec 10, 2024
67ba02b
Merge branch 'main' of https://github.com/robert-koch-institut/mex-co…
vyvytranngoc Dec 10, 2024
90a2b65
finished extract methods for orcid
vyvytranngoc Dec 12, 2024
9f7f13e
Finished orcid connector
vyvytranngoc Jan 9, 2025
b39c714
Finished orcid connector
vyvytranngoc Jan 9, 2025
d86029b
Merge branch 'main' of https://github.com/robert-koch-institut/mex-co…
vyvytranngoc Jan 9, 2025
19fa4a4
Rmoved dummy file
vyvytranngoc Jan 9, 2025
93d50c3
fixed requsted changes and remodeled OrcidPerson class to OrcidRecord
vyvytranngoc Jan 16, 2025
781ccaa
moved get_data_by_id from extract to connector
vyvytranngoc Jan 16, 2025
74c0346
Merge branch 'main' of https://github.com/robert-koch-institut/mex-co…
vyvytranngoc Jan 16, 2025
bef1ee6
removed get_data_by_name from extract to connector
vyvytranngoc Jan 21, 2025
3ca8a6d
Merge branch 'main' of https://github.com/robert-koch-institut/mex-co…
vyvytranngoc Jan 28, 2025
ff09d07
added unit tests
vyvytranngoc Jan 30, 2025
21edbda
Merge branch 'main' of https://github.com/robert-koch-institut/mex-co…
vyvytranngoc Jan 30, 2025
ebf54ec
update changelog
vyvytranngoc Jan 30, 2025
f69a18f
re-installed dependencies and merge
vyvytranngoc Feb 4, 2025
4da3dd7
adjusted unit test for orcid donnector
vyvytranngoc Feb 4, 2025
96ab8fb
Merge branch 'main' of https://github.com/robert-koch-institut/mex-co…
vyvytranngoc Feb 4, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,12 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### Added

- Connector class for retrieving ORCID data by ID or name
- methods for extracting data from orcid
- methods to transform from OcidPerson to mex person
- model class for orcid data
- unit tests for orcid connector

### Changes

### Deprecated
Expand Down
9 changes: 9 additions & 0 deletions assets/raw-data/primary-sources/primary-sources.json
Original file line number Diff line number Diff line change
Expand Up @@ -34,5 +34,14 @@
"value": "Wikidata APIs"
}
]
},
{
"identifier": "orcid",
"title": [
{
"language": "en",
"value": "Open Researcher Contributor Identification Initiative"
}
]
}
]
Empty file added mex/common/orcid/__init__.py
Empty file.
89 changes: 89 additions & 0 deletions mex/common/orcid/connector.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
from typing import Any

from mex.common.connector.http import HTTPConnector
from mex.common.exceptions import EmptySearchResultError, FoundMoreThanOneError
from mex.common.settings import BaseSettings


class OrcidConnector(HTTPConnector):
"""Connector class for querying Orcid records."""

def _set_url(self) -> None:
"""Set url of the host."""
settings = BaseSettings.get()
self.url = str(settings.orcid_api_url)

def _check_availability(self) -> None:
"""Send a GET request to verify the host is available."""
url = f"{self.url.rstrip('/')}/search"
response = self._send_request("HEAD", url=url, params={})
response.raise_for_status()

def check_orcid_id_exists(self, orcid_id: str) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method is unused in mex-common, do you need it for mex-backend?

"""Search for an ORCID person by ORCID ID."""
query_dict = {"orcid": orcid_id}
response = self.fetch(query_dict)
return bool(response.get("num-found", 0))

@staticmethod
def build_query(filters: dict[str, Any]) -> str:
"""Construct the ORCID API query string."""
return " AND ".join([f"{key}:{value}" for key, value in filters.items()])

def fetch(self, filters: dict[str, Any]) -> dict[str, Any]:
"""Perform a search query against the ORCID API."""
query = OrcidConnector.build_query(filters)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
query = OrcidConnector.build_query(filters)
query = self.build_query(filters)

return self.request(method="GET", endpoint="search", params={"q": query})

@staticmethod
def get_data_by_id(orcid_id: str) -> dict[str, Any]:
"""Retrieve data by UNIQUE ORCID ID.

Args:
orcid_id: Unique identifier in ORCID system.

Returns:
Personal data of the single matching id.
"""
orcidapi = OrcidConnector.get()
# or endpoint = f"{orcid_id}/person"
endpoint = f"{orcid_id}/record"
return dict(orcidapi.request(method="GET", endpoint=endpoint))
Comment on lines +38 to +51
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this have to be static?

Suggested change
@staticmethod
def get_data_by_id(orcid_id: str) -> dict[str, Any]:
"""Retrieve data by UNIQUE ORCID ID.
Args:
orcid_id: Unique identifier in ORCID system.
Returns:
Personal data of the single matching id.
"""
orcidapi = OrcidConnector.get()
# or endpoint = f"{orcid_id}/person"
endpoint = f"{orcid_id}/record"
return dict(orcidapi.request(method="GET", endpoint=endpoint))
def get_data_by_id(self, orcid_id: str) -> dict[str, Any]:
"""Retrieve data by UNIQUE ORCID ID.
Args:
orcid_id: Unique identifier in ORCID system.
Returns:
Personal data of the single matching id.
"""
endpoint = f"{orcid_id}/record"
return dict(self.request(method="GET", endpoint=endpoint))


@staticmethod
def get_data_by_name(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, why make this static?

instead of orcidapi = OrcidConnector.get() you could just use self

given_names: str = "*",
family_name: str = "*",
**filters: str,
) -> dict[str, Any]:
"""Get ORCID record of a single person for the given filters.

Args:
self: Connector.
given_names: Given name of a person, defaults to non-null
family_name: Surname of a person, defaults to non-null
**filters: Key-value pairs representing ORCID search filters.

Raises:
EmptySearchResultError
FoundMoreThanOneError

Returns:
Orcid data of the single matching person by name.
"""
if given_names:
filters["given-names"] = given_names
if family_name:
filters["family-name"] = family_name
orcidapi = OrcidConnector.get()
search_response = orcidapi.fetch(filters=filters)
num_found = search_response.get("num-found", 0)
if num_found == 0:
msg = f"Cannot find orcid person for filters {filters}'"
raise EmptySearchResultError(msg)
if num_found > 1:
msg = f"Found multiple orcid persons for filters {filters}'"
raise FoundMoreThanOneError(msg)

orcid_id = search_response["result"][0]["orcid-identifier"]["path"]
return OrcidConnector.get_data_by_id(orcid_id)
39 changes: 39 additions & 0 deletions mex/common/orcid/extract.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
from mex.common.orcid.connector import OrcidConnector
from mex.common.orcid.models.person import OrcidRecord
from mex.common.orcid.transform import map_orcid_data_to_orcid_record


def get_orcid_record_by_name(
given_names: str = "*", family_name: str = "*"
) -> OrcidRecord:
"""Returns Orcidrecord of a single person for the given filters.

Args:
given_names: Given name of a person, defaults to non-null
family_name: Surname of a person, defaults to non-null
**filters: Key-value pairs representing ORCID search filters.

Raises:
EmptySearchResultError
FoundMoreThanOneError

Returns:
Orcidrecord of the matching person by name.
"""
orcid_data = OrcidConnector.get_data_by_name(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when you make get_data_by_name non-static, this would become:

Suggested change
orcid_data = OrcidConnector.get_data_by_name(
orcid_data = OrcidConnector.get().get_data_by_name(

given_names=given_names, family_name=family_name
)
return map_orcid_data_to_orcid_record(orcid_data)


def get_orcid_record_by_id(orcid_id: str) -> OrcidRecord:
"""Returns Orcidrecord by UNIQUE ORCID ID.

Args:
orcid_id: Unique identifier in ORCID system.

Returns:
Orcidrecord of the matching id.
"""
orcid_data = OrcidConnector.get_data_by_id(orcid_id=orcid_id)
return map_orcid_data_to_orcid_record(orcid_data)
Empty file.
56 changes: 56 additions & 0 deletions mex/common/orcid/models/person.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
from pydantic import Field

from mex.common.models import BaseModel


class OrcidIdentifier(BaseModel):
"""Model class for OrcidID."""

path: str
uri: str


class OrcidEmail(BaseModel):
"""Model class for Orcid email."""

email: list[str]


class OrcidEmails(BaseModel):
"""Model class for Orcid emails."""

email: list[OrcidEmail]


class OrcidFamilyName(BaseModel):
"""Model class for orcid family names."""

value: str


class OrcidGivenNames(BaseModel):
"""Model class for Orcid given names."""

value: str


class OrcidName(BaseModel):
"""Model class for Orcid name."""

family_name: OrcidFamilyName = Field(alias="family-name")
given_names: OrcidGivenNames = Field(alias="given-names")
visibility: str


class OrcidPerson(BaseModel):
"""Model class for Orcid person."""

emails: OrcidEmails
name: OrcidName


class OrcidRecord(BaseModel):
"""Model class for Orcid record."""

orcid_identifier: OrcidIdentifier = Field(alias="orcid-identifier")
person: OrcidPerson
46 changes: 46 additions & 0 deletions mex/common/orcid/transform.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
from typing import Any

from mex.common.models import (
ExtractedPerson,
)
from mex.common.orcid.models.person import OrcidRecord
from mex.common.primary_source.helpers import get_all_extracted_primary_sources


def map_orcid_data_to_orcid_record(orcid_data: dict[str, Any]) -> OrcidRecord:
"""Wraps orcid data into an OrcidRecord."""
return OrcidRecord.model_validate(orcid_data)


def transform_orcid_person_to_mex_person(
orcid_record: OrcidRecord,
) -> ExtractedPerson:
"""Transforms a single ORCID person to an ExtractedPerson.

Args:
orcid_record: OrcidRecord object of a person.

Returns:
ExtractedPerson.
"""
primary_source = get_all_extracted_primary_sources()["orcid"]
had_primary_source = primary_source.stableTargetId

id_in_primary_source = orcid_record.orcid_identifier.path
email = orcid_record.person.emails.email[0].email
if orcid_record.person.name.visibility == "public":
given_names = orcid_record.person.name.given_names.value
family_name = orcid_record.person.name.family_name.value
else:
given_names = None
family_name = None
orcid_id = orcid_record.orcid_identifier.uri

return ExtractedPerson(
identifierInPrimarySource=id_in_primary_source,
hadPrimarySource=had_primary_source,
givenName=given_names,
familyName=family_name,
orcidId=orcid_id,
email=email,
)
5 changes: 5 additions & 0 deletions mex/common/settings.py
Original file line number Diff line number Diff line change
Expand Up @@ -186,6 +186,11 @@ def get(cls) -> Self:
"services ",
validation_alias="MEX_WEB_USER_AGENT",
)
orcid_api_url: AnyUrl = Field(
Url("https://orcid"),
description="URL of orcid api.",
validation_alias="MEX_ORCID_API_URL",
)

def text(self) -> str:
"""Dump the current settings into a readable table."""
Expand Down
68 changes: 67 additions & 1 deletion mex/common/testing/plugin.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,13 +12,15 @@
from typing import Any, cast
from unittest.mock import MagicMock, Mock

import pytest
import requests
from langdetect import DetectorFactory
from pydantic import AnyUrl
from requests import Response
from requests import HTTPError, Response

from mex.common.connector import CONNECTOR_STORE
from mex.common.models import ExtractedPrimarySource
from mex.common.orcid.connector import OrcidConnector
from mex.common.primary_source.helpers import get_all_extracted_primary_sources
from mex.common.settings import SETTINGS_STORE, BaseSettings
from mex.common.wikidata.connector import (
Expand Down Expand Up @@ -191,3 +193,67 @@ def get_wikidata_item_details_by_id(
"get_wikidata_item_details_by_id",
get_wikidata_item_details_by_id,
)


@pytest.fixture
def orcid_person_raw() -> dict[str, Any]:
"""Return a raw orcid person."""
with open(Path(__file__).parent / "test_data" / "orcid_person_raw.json") as fh:
return cast(dict[str, Any], json.load(fh))


@pytest.fixture
def orcid_multiple_matches() -> dict[str, Any]:
"""Return a raw orcid person."""
with open(
Path(__file__).parent / "test_data" / "orcid_multiple_matches.json"
) as fh:
return cast(dict[str, Any], json.load(fh))


@pytest.fixture
def mocked_orcid(
monkeypatch: pytest.MonkeyPatch,
orcid_person_raw: dict[str, Any],
orcid_multiple_matches: dict[str, Any],
) -> None:
"""Mock orcid connector."""
response_query = Mock(spec=Response, status_code=200)

session = MagicMock(spec=requests.Session)
session.get = MagicMock(side_effect=[response_query])

def mocked_init(self: OrcidConnector) -> None:
self.session = session

monkeypatch.setattr(OrcidConnector, "__init__", mocked_init)

def check_orcid_id_exists(_self: OrcidConnector, _orcid_id: str) -> bool:
return _orcid_id == "0009-0004-3041-5706"

monkeypatch.setattr(OrcidConnector, "check_orcid_id_exists", check_orcid_id_exists)

def fetch(_self: OrcidConnector, filters: dict[str, Any]) -> dict[str, Any]:
if filters.get("given-names") == "John":
return {"num-found": 1, "result": [orcid_person_raw]}
if filters.get("given-names") == "Multiple":
return orcid_multiple_matches
return {"result": None, "num-found": 0}

monkeypatch.setattr(OrcidConnector, "fetch", fetch)

def get_data_by_id(orcid_id: str) -> dict[str, Any]:
if orcid_id == "0009-0004-3041-5706":
return orcid_person_raw
msg = "404 Not Found"
raise HTTPError(msg)

monkeypatch.setattr(OrcidConnector, "get_data_by_id", staticmethod(get_data_by_id))

def build_query(filters: dict[str, Any]) -> str:
"""Construct the ORCID API query string."""
if "givennames" in filters:
return "givennames:Josiah AND familyname:Carberry"
return "given-names:Josiah AND family-name:Carberry"

monkeypatch.setattr(OrcidConnector, "build_query", staticmethod(build_query))
Comment on lines +252 to +259
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def build_query(filters: dict[str, Any]) -> str:
"""Construct the ORCID API query string."""
if "givennames" in filters:
return "givennames:Josiah AND familyname:Carberry"
return "given-names:Josiah AND family-name:Carberry"
monkeypatch.setattr(OrcidConnector, "build_query", staticmethod(build_query))

you don't need to mock this method. since it does not do any network-io, it is "unit-test-safe"

Loading