Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bringing Crossref, Semantic Scholar, Open Citations and Open Alex lookup + auto-import to Cita for Zotero 7 #300

Open
wants to merge 30 commits into
base: zotero7
Choose a base branch
from

Conversation

thebluepotato
Copy link

Hi! I adapted the (now stale) PR #139 to the new Zotero 7 branch so it has a chance to be swept up in the new release. The general logic is unchanged from the other PR, but I've made quite a few updates for efficiency, code clarity and type safety as well as fixed a few failing Promises here and there. I've tested quite a bit already, but it could definitely use more in-depth testing.

And I've also added a button to citations to auto-import that reference into Zotero with one click and then link it. It's similar to what https://github.com/MuiseDestiny/zotero-reference does, but I find that addon confusing at best and it doesn't help that all the info is in Mandarin Chinese...

All in all, probably still a WIP, but happy to receive code reviews and have some people test this!

@thebluepotato thebluepotato marked this pull request as draft September 22, 2024 21:22
@thebluepotato thebluepotato marked this pull request as ready for review September 22, 2024 21:22
@Dominic-DallOsto
Copy link
Collaborator

Thanks a lot for this!

It'll take me a little while to review this in detail sorry, but this is great!

@thebluepotato
Copy link
Author

One thing that could/should be considered, is that while adding the references for which Crossref has a DOI or ISBN is quite robust, adding items as book or journal merely on the title that they have is unsatisfactory. For instance, with DOI:10.1145/2786451.2786465, some of the references are sections from the same book (a different author per section), yet they all appear in Crossref as Author + Book title (instead of section title). Maybe it should be up to the user to enable what is actually imported.

To avoid type errors and to avoid overusing `any`, I copied the TypeScript definitions from zotero/translators and slightly tweaked them.
@thebluepotato thebluepotato changed the title Bringing Crossref lookup and auto-import to Cita for Zotero 7 Bringing Crossref, Semantic Scholar and Open Alex lookup + auto-import to Cita for Zotero 7 Sep 26, 2024
@thebluepotato
Copy link
Author

thebluepotato commented Sep 26, 2024

The latest commit adds a new IndexerBase abstract class that abstracts the common logic between various "indexers" (couldn't think of a better name). This allows us to more simply add various such "indexers", which now includes Semantic Scholar and Open Alex as well. They all have their pros and cons, but this should give the user a lot of options to automatically fetch these citations.

Based on initial (limited) experimentation:

  • Crossref: citations seem more "official" than the other sources, but not all items with DOIs have references
  • Semantic Scholar: because it analyses the indexed papers, it includes many references, but also some random entries that are not actually cited
  • Open Alex: has usually fewer citations than the others

One issue that this "abstraction" brings is that the context menu when clicking on an item shows the translation keys instead of the corresponding strings.

@Dominic-DallOsto
Copy link
Collaborator

Hi, I just had a chance to quickly test this and so far things look nice, thanks so much! I haven't been able to fully review the code yet but here are some observations from testing.

Openalex build error

I get the following build error at the moment because of the openalex-sdk. Did you encounter this on your end?

    node_modules/openalex-sdk/dist/src/utils/works.js:7:37:
      7 │ const fs_1 = __importDefault(require("fs"));~~~~

  The package "fs" wasn't found on the file system but is built into node. Are you trying to bundle
  for node? You can use "platform: 'node'" to do that, which will remove this error.

I removed the openalex SDK to test a bit further.

Auto import citations

Firstly, the auto import by identifier button is really nice! It would solve #40. One thing that might also be nice is, if the citation already has a QID attached, that this should be applied to the newly created item when it's imported?

Getting Crossref citations

Testing the auto import of citations from crossref I found some bugs, but they're mostly related to crossref's data so it was just unlucky I happened to pick a bad item haha

  1. Add this item by DOI - 10.1007/BF01700692
  2. Get citations from crossref
    • newlines in text aren't rendered properly
    • it says I will get 64 citations

image

  1. Press OK
    • actually I only get 2 citations, and they're both the same
      • Checking the API response, this is actually a crossref problem:
      • we get a response with 64 citations, but 62 are unstructured - maybe the message could be edited to exclude unstructured citations if we don't attempt to parse them, or a message after importing could say "imported 2/64 citations from crossref"
      • here crossref is just a bit strange in that 2 of the references have the same DOI. Could we check for duplicates within the crossref response and remove them?

image

Getting Semantic Scholar citations

I tested with using the item with DOI - 10.1109/ITW.2015.7133169. It got 11/14 citations because 3 had no identifiers in semantic scholar. The request was very slow though compared to getting citations from crossref. Here is an overview of the timing.

image

The slowdown is because the requests to arxiv are really slow. I tested the same request in the browser and it also took ~10 seconds to complete, so it doesn't seem that this is problem with Cita. Does arxiv have an alternative (faster) API? Maybe a workaround would be to update the progress message with the number of citations already downloaded, so users can see that things are progressing?

@thebluepotato
Copy link
Author

thebluepotato commented Sep 30, 2024

Openalex build error

I get the following build error at the moment because of the openalex-sdk. Did you encounter this on your end?

Yes sorry, I'm actually entirely new to npm so I forgot to commit the patch to openalex-sdk, fixed in latest commit.

Auto import citations

Firstly, the auto import by identifier button is really nice! It would solve #40. One thing that might also be nice is, if the citation already has a QID attached, that this should be applied to the newly created item when it's imported?

I didn't really look into the Wikidata side of things, but will definitely look into ensuring the QID is imported as well. Is it usually stored in the Extra field?

  • Import QID

Getting Crossref citations

Testing the auto import of citations from crossref I found some bugs, but they're mostly related to crossref's data so it was just unlucky I happened to pick a bad item haha

  • Get newlines to show in the alert
  • Rephrase alert to clarify (parsed does not mean the citations will be added in the end, rephrase)
  • Apply duplicate filter to the citations to be added as well

Getting Semantic Scholar citations

I tested with using the item with DOI - 10.1109/ITW.2015.7133169. It got 11/14 citations because 3 had no identifiers in semantic scholar. The request was very slow though compared to getting citations from crossref. Here is an overview of the timing.

As it currently stands, the PR relies heavily on Zotero's own existing translators to avoid doing too much heavy lifting and to avoid code duplication. Therefore, if it's slow to import with Cita, it's also slow to import when using the "magic wand" tool that imports items based on their identifiers. Will look into alternatives, but it seems likely that Zotero's own translator is already quite optimized as it is.

@thebluepotato
Copy link
Author

thebluepotato commented Oct 1, 2024

Regarding arXiv, I updated the translator locally (see: zotero/translators#3366) to use another endpoint which, based on limited testing, should be faster than the one the translator currently uses. However, when testing within Cita, it's just as slow...

EDIT: rather, depending on luck I guess, it can be as "fast" as 1s per request, but still can sometimes be as slow as the other endpoint.

@Dominic-DallOsto
Copy link
Collaborator

Dominic-DallOsto commented Oct 3, 2024

That's great, thanks a lot! And thanks for addressing the issues with the arXiv translator, doing it upstream in Zotero is definitely the right way.

A couple of little things I noticed:

  • If I right click an item, in the Cita menu it says "Get citations from Semantic" instead of "Get citations from Semantic Scholar" like it says in the More... menu
  • If I have an item that only has as ISBN, in the right click menu all the options for getting citations are still enabled, whereas in the More... menu they're all rightfully disabled

Otherwise this all looks good

@thebluepotato thebluepotato changed the title Bringing Crossref, Semantic Scholar and Open Alex lookup + auto-import to Cita for Zotero 7 Bringing Crossref, Semantic Scholar, Open Citations and Open Alex lookup + auto-import to Cita for Zotero 7 Oct 4, 2024
@thebluepotato
Copy link
Author

Got a little crazy and added OpenCitations capabilities again. However, within all the confusion, I need your input on whether we could/should expand the definition of PIDType to include all "IDs" we're now using and that the various indexers support searching for, or at least OpenAlex identifier and Semantic Scholar Corpus ID. In particular, it would streamline the code by using getPID everywhere

@thebluepotato
Copy link
Author

  • If I have an item that only has as ISBN, in the right click menu all the options for getting citations are still enabled, whereas in the More... menu they're all rightfully disabled

For this, I'd like to improve the logic so it is only disabled when no supported identifiers are present. While CrossRef requires a DOI, the other indexers often can search with more identifiers.

@Dominic-DallOsto
Copy link
Collaborator

Got a little crazy and added OpenCitations capabilities again. However, within all the confusion, I need your input on whether we could/should expand the definition of PIDType to include all "IDs" we're now using and that the various indexers support searching for, or at least OpenAlex identifier and Semantic Scholar Corpus ID. In particular, it would streamline the code by using getPID everywhere

Yeah, I think that's great to abstract this out like you have.

For this, I'd like to improve the logic so it is only disabled when no supported identifiers are present. While CrossRef requires a DOI, the other indexers often can search with more identifiers.

Yeah, that makes sense. I guess how you've set it up you could just check whether IndexerBase.extractSupportedUID returns null? Maybe it'd be nice to have a specific function that does this check.

@Dominic-DallOsto
Copy link
Collaborator

Do you get the same styling problem as me with the PID rows now?

image

If I make it so all the identifiers are visible, it looks a bit weird but I guess OK

image

Also, do you think it makes sense to grey out the fetch icons for identifiers that can't be fetched? I think it is more intuitive than clicking the button and then finding out that it isn't supported?

Additionally, could fetching the OMID and OPENALEX ids give a progress popup similar to fetching QIDs? I found that this took a few seconds to run so I wasn't sure whether anything until the identifier finally appeared.

@thebluepotato
Copy link
Author

Do you get the same styling problem as me with the PID rows now?

image

Yes I do, I guess we should also no longer uppercase them all. Should we hide PMID and PMCID from this view? They're supported for searching and all, but I don't think you can get citations from them, so there's no need to highlight them as much.

Also, do you think it makes sense to grey out the fetch icons for identifiers that can't be fetched? I think it is more intuitive than clicking the button and then finding out that it isn't supported?

That'll probably encourage us to further abstract the checking logic, good idea!

Additionally, could fetching the OMID and OPENALEX ids give a progress popup similar to fetching QIDs? I found that this took a few seconds to run so I wasn't sure whether anything until the identifier finally appeared.

In my testing it was nearly instant, but we sure can have a progress indicator.

I'll be away for the week so I won't be able to look at this PR much, feel free to tweak it to your liking if you want!

@Dominic-DallOsto
Copy link
Collaborator

Dominic-DallOsto commented Oct 5, 2024

I played around with things quickly so now they look like this

image

I'll be away for the week so I won't be able to look at this PR much, feel free to tweak it to your liking if you want!

No worries - thanks a lot for your hard work! I'll try to fully review the code by then and make a roadmap for what we need before merging

Edit: hiding the PMID and PMCID rows makes sense I think, yeah

image

And the progress messages work great, thanks

  • localise PID row fetch button "Fetch" text

@thebluepotato
Copy link
Author

thebluepotato commented Oct 5, 2024

TODOs:

  • Implement DOI fetcher inspired by https://github.com/bwiernik/zotero-shortdoi
  • Fetch DOIs from Datacite?
  • Auto-generate DOIs from arXiv? EDIT: Semantic Scholar does not seem to find documents by arXiv-DOI. Therefore, we should just be happy with the arXiv ID and instead, give the "arXiv" PIDType priority over DOI for Semantic Scholar
  • Adapt rate-limiting to the specific indexer and provide a meaningful error message when request was rate-limited
  • Implement Semantic Scholar's Corpus ID as PIDType
  • When fetching PIDs such as OMID, also add other PIDs if contained in the response (such as DOI)

name: string;
id: "qid" | "doi" | "omid";
}[] = [
// https://opencitations.net/oci
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, where did the 030 and 050 come from? https://opencitations.net/oci only has 010, 020, 040, and 06[1-9]0? Did the specification change at some point?

In saying that, I don't know if the parseOci function below works if omid's can be arbitrary length? https://registry.identifiers.org/registry/oci

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding, those prefixes come from old code and from a time when the OpenCitations Corpus was still a separate thing. Seems they should no longer be in use though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok cool, that makes sense.

const suppliers: {
prefix: string;
name: string;
id: "qid" | "doi" | "omid";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe these could be capitalised to match PID types? What do you think?

@@ -15,7 +15,7 @@ class Citation {
ocis: {
citingId: string;
citedId: string;
idType: "qid" | "doi" | "occ";
idType: "qid" | "doi" | "omid";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These could maybe be capitalised too?

case "arXiv": {
const field = this.item.getField("archiveID");
if (field && field.startsWith("arXiv:")) {
pid = field;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we explicitly call this an arXiv type, maybe we can strip out the arXiv: prefix?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really with that field. Zotero (and the arXiv translator) store the arXiv ID in the "archiveID" field and in the Extra field. The "archiveID" field is meant to hold IDs of other resources as well based on the scarce documentation.
In short, in this field, the prefix is required, whereas in the Extra field, "arXiv:" is the name of the field (and therefore not part of the value).

type.toUpperCase(),
),
);
case "OMID": {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these switch statements start to make me think it'd be easier to have a PID class with fetching/getting/setting/... methods, similar to how you did for the indexers. Do you think that would be clearer?

Crossref.getCitations();
const items = await this.getSelectedItems(menuName, true);
if (items.length) {
new Crossref().addCitationsToItems(items);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be a static method so we don't need to recreate the indexer every time?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was my first try but from my (very limited and recent) TypeScript understanding, you can't enforce static functions in abstract classes. So if we want static, we lose the abstraction. I might be wrong though!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, looks like you're right: microsoft/TypeScript#34516

Maybe this would work with an interface instead?


return citations;

// const citations = await Promise.all(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove this?

Services.prompt.alert(
window as mozIDOMWindowProxy,
Wikicite.formatString(
"wikicite.indexer.get-citations.no-doi-title",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't these messages say: "No items with a supported identifer provided" found instead of "No items with a DOI provided"?

This message could also then also have the list of supported identifiers?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants