Archives for DOIs are set to the wrong values #1324

sdruskat · 2024-03-13T14:27:04Z

Hi 👋,

Stumbled over two bad software_archives:

For https://joss.theoj.org/papers/10.21105/joss.05883:

It seems it has been inadvertently set to https://joss.theoj.org/papers/10.21105/v1.0.7 after it was previously correctly set to https://doi.org/10.5281/zenodo.10162614.

For https://joss.theoj.org/papers/10.21105/joss.04684:

The Zenodo DOI has been truncated to https://doi.org/10.5281/zenodo.714. The correct one is https://doi.org/10.5281/zenodo.7143971.

Perhaps this can be fixed in the metadata?

The text was updated successfully, but these errors were encountered:

sdruskat · 2024-03-21T19:51:02Z

Also, please let me know if

there is a better place or way for reporting these things, or
if there is a place I could fix this myself and put up a PR (I'll also have a peep at the infra docs).

- Fixes one of two wrong archive URLs in the Crossref XML - Issue: openjournals/joss#1324

- Fixes another of two wrong archive URLs in the Crossref XML - Issue: openjournals/joss#1324

sdruskat · 2024-03-21T20:13:46Z

See above, put up openjournals/joss-papers#5158 as the only place I could find the offending strings in the GH org :).

sdruskat · 2024-03-22T09:13:17Z

Found another one (reported in openjournals/buffy#103 (comment)):

JOSS paper: https://joss.theoj.org/papers/10.21105/joss.00041
Existing software_archive: https://doi.org/10.5281/zenodo.23671 (a French paper aobut a beww from 1910)
Correct DOI: https://doi.org/10.5281/zenodo.61965

- Fixes a wrong archive URLs in the Crossref XML that points to a French paper about a species of 🐝 from 1910 - Issue: openjournals/joss#1324

sdruskat · 2024-03-22T09:19:05Z

Found another one (reported in openjournals/buffy#103 (comment)):

Fixed in openjournals/joss-papers@2232b31

xuanxu · 2024-03-22T09:23:53Z

@sdruskat Thanks for reporting this!
The easy way to correct the wrong values is for an EiC to regenerate the pdf and the metadata files. That can be done reaccepting the paper, so the best place to report these things is the review issue of the affected papers. I'll ping the EiC for these two cases.

sdruskat · 2024-03-22T09:26:51Z

the best place to report these things is the review issue of the affected papers.

Thanks for the pointer @xuanxu! I'll do this for future issues. 👍

I'll ping the EiC for these two cases.

Three cases with the 🐝, and thanks!

sneakers-the-rat · 2024-03-22T22:20:49Z

Continuing this, I just went and checked all the archive links in joss-papers , and these are the papers that have problems:

file	archive	status
10.21105.joss.00040.crossref.xml	http://dx.doi.org/10.5281/zenodo.59387	500
10.21105.joss.00612.crossref.xml		(Missing link)
10.21105.joss.00971.crossref.xml		(Missing link)
10.21105.joss.02314.crossref.xml	https://doi.org/10.5281/zenodo.3877690	500
10.21105.joss.04439.crossref.xml	https://dx.doi.org/10.5281/zenodo.6767313.	404
10.21105.joss.04591.crossref.xml	https://dx.doi.org/v1.1.0	404
10.21105.joss.04684.crossref.xml	https://dx.doi.org/10.5281/zenodo.714	404
10.21105.joss.05395.crossref.xml	https://dx.doi.org/10.5281/zenodo.10050346	410
10.21105.joss.05883.crossref.xml	https://dx.doi.org/v1.0.7	404

not bad overall :)

here's the script (nothing special, just a one-off thing):

expand/collapse

"""
Check whether the archive DOI for each paper resolves to a page.

run this from within the joss-papers directory.

because of the handling of ratelimiting, you'll have to run this a few times
until you no longer skip for ratelimits.

generates 
- `joss_archive_links.csv` - see `Results` for columns
- `joss_archive_links_clean.csv` - see `clean_csv`
- `joss_doi_pages` - xz compressed cache of the resolved archive pages

requires:
- requests
- tqdm
- pandas
"""

import csv
from xml.etree import ElementTree
from pathlib import Path
from dataclasses import dataclass, fields, asdict
from typing import Optional, Literal, Union
import lzma
from multiprocessing import Pool, Lock, Event
from time import sleep, time
from math import ceil

import requests
from tqdm import tqdm
import pandas as pd

data_file = Path('joss_archive_links.csv')
cache_dir = Path('joss_doi_pages')
NAMESPACES = {
    'rel': "http://www.crossref.org/relations.xsd"
}

@dataclass
class Results:
    file: str
    archive: Optional[str] = None
    valid: bool = False
    status: Optional[int] = None
    error: Optional[str] = None
    retry_after: Optional[float] = None


def process_paper(path:Path) -> Optional[Results]:
    out_file = cache_dir / path.with_suffix('.html.xz').name
    if out_file.exists():
        return
    paper = ElementTree.parse(path).getroot()
    
    res = {}
    res['file'] = path.name
    try:
        archive = paper.find(".//rel:inter_work_relation[@relationship-type='references']", NAMESPACES).text
        archive = archive.lstrip('“').rstrip('”')
        if not archive.startswith('http'):
            archive = 'https://dx.doi.org/' + archive
        res['archive'] = archive

        # hold if we are currently in a ratelimit cooldown.
        lock.wait()
        req = requests.get(res['archive'])
        res['status'] = req.status_code

        match res['status']:
            case 429:
                res['retry_after'] = float(req.headers['x-ratelimit-reset'])
            case 200:
                res['valid'] = True

        if res['status'] != 429:
            with lzma.open(out_file, 'w') as cache_file:
                cache_file.write(req.content)
        
    except Exception as e:
        res['error'] = str(e)

    return Results(**res)

def init_lock(l):
    """make a lock (now an event) available as a global across processes in a pool"""
    global lock
    lock = l

def wait(lock:Event, result:Results, message:tqdm):
    """if we get a 429, acquire the lock until we can start again"""
    lock.clear()
    wait_time = ceil(result.retry_after - time())
    message.reset(wait_time)
    for i in range(int(wait_time)):
        sleep(1)
        message.update()
    lock.set()


def main():
    rate_lock = Event()
    rate_lock.set()

    cache_dir.mkdir(exist_ok=True)
    # ya i know i ruin the generator but i like progress bars with totals
    files = list(Path('.').glob("joss*/*crossref.xml"))

    try:
        all_pbar = tqdm(total=len(files), position=0)
        good = tqdm(position=1)
        bad = tqdm(position=2)
        message = tqdm(position=3)
        pool = Pool(16, initializer=init_lock, initargs=(rate_lock,))

        if not data_file.exists():
            with open(data_file, 'w', newline='') as dfile:
                writer = csv.DictWriter(dfile, [field.name for field in fields(Results)])
                writer.writeheader()

        with open(data_file, 'a', newline='') as dfile:
            writer = csv.DictWriter(dfile, [field.name for field in fields(Results)])

            for result in pool.imap_unordered(process_paper, files):
                all_pbar.update()

                if result is None:
                    continue

                if result.retry_after:
                    wait(rate_lock, result, message)

                if result.valid:
                    good.update()
                else:
                    bad.update()
                writer.writerow(asdict(result))

    finally:
        all_pbar.close()
        good.close()
        bad.close()

    clean_csv()

def clean_csv(path:Path = data_file):
    """
    - remove 429s
    - deduplicate rows (if identical)
    - sorts by `valid` and then `file`
    """
    df = pd.read_csv(path)
    df = df.loc[df['status'] != 429]
    df = df.drop_duplicates()
    df = df.sort_values(['valid', 'file'], ignore_index=True)

    out_fn = (path.parent / (path.stem + '_clean')).with_suffix('.csv')
    df.to_csv(out_fn, index=False)


if __name__ == "__main__":
    main()

xuanxu · 2024-04-11T08:05:25Z

Closing this issue as PDFs and metadata on the three papers have been corrected and re-deposited.

arfon · 2024-04-12T07:51:27Z

Quick update:

10.21105.joss.05883.crossref.xml

Now fixed.

10.21105.joss.05395.crossref.xml

Is a 410 which looks to be some kind of "User was blocked" thing. I'm not sure what to do about this one.

10.21105.joss.04684.crossref.xml

Looks like there was some kind of error with the reaccept compilation here. @xuanxu – any ideas what is going on there?

10.21105.joss.04591.crossref.xml

Now fixed.

10.21105.joss.04439.crossref.xml

Looks like the paper is missing. I've asked the author to re-add it: openjournals/joss-reviews#4439 (comment)

10.21105.joss.02314.crossref.xml

Seems to resolve for me now?

10.21105.joss.00971.crossref.xml

Looks like it's missing from the PDF and the XML files? This probably needs manual handling.

10.21105.joss.00612.crossref.xml

Same issue as 10.21105.joss.00971.crossref.xml. It's missing from the paper and the Crossref XML but the DOI is resolving.

10.21105.joss.00040.crossref.xml

I think we should report this to Zenodo as an issue.

sdruskat changed the title ~~Archive for DOI is set to the wrong value~~ Archives for DOIs are set to the wrong values Mar 21, 2024

sdruskat added a commit to sdruskat/joss-papers that referenced this issue Mar 21, 2024

openjournals#1324: Fix bad archive URLs (1/2)

5738d95

- Fixes one of two wrong archive URLs in the Crossref XML - Issue: openjournals/joss#1324

sdruskat added a commit to sdruskat/joss-papers that referenced this issue Mar 21, 2024

openjournals#1324: Fix bad archive URLs (2/2)

c2a3e9a

- Fixes another of two wrong archive URLs in the Crossref XML - Issue: openjournals/joss#1324

sdruskat mentioned this issue Mar 21, 2024

#1324: Fix archive DOIs openjournals/joss-papers#5158

Closed

sneakers-the-rat mentioned this issue Mar 22, 2024

Archive DOI not validated openjournals/buffy#103

Open

sdruskat added a commit to sdruskat/joss-papers that referenced this issue Mar 22, 2024

openjournals#1324: Fix bad archive URLs (3/2)

2232b31

- Fixes a wrong archive URLs in the Crossref XML that points to a French paper about a species of 🐝 from 1910 - Issue: openjournals/joss#1324

xuanxu mentioned this issue Mar 22, 2024

[REVIEW]: PsychoPhysioPipeline openjournals/joss-reviews#41

Closed

16 tasks

xuanxu closed this as completed Apr 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Archives for DOIs are set to the wrong values #1324

Archives for DOIs are set to the wrong values #1324

sdruskat commented Mar 13, 2024 •

edited

Loading

sdruskat commented Mar 21, 2024

sdruskat commented Mar 21, 2024

sdruskat commented Mar 22, 2024

sdruskat commented Mar 22, 2024

xuanxu commented Mar 22, 2024

sdruskat commented Mar 22, 2024

sneakers-the-rat commented Mar 22, 2024

xuanxu commented Apr 11, 2024

arfon commented Apr 12, 2024

Archives for DOIs are set to the wrong values #1324

Archives for DOIs are set to the wrong values #1324

Comments

sdruskat commented Mar 13, 2024 • edited Loading

sdruskat commented Mar 21, 2024

sdruskat commented Mar 21, 2024

sdruskat commented Mar 22, 2024

sdruskat commented Mar 22, 2024

xuanxu commented Mar 22, 2024

sdruskat commented Mar 22, 2024

sneakers-the-rat commented Mar 22, 2024

xuanxu commented Apr 11, 2024

arfon commented Apr 12, 2024

sdruskat commented Mar 13, 2024 •

edited

Loading