-
-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Archives for DOIs are set to the wrong values #1324
Comments
Also, please let me know if
|
- Fixes one of two wrong archive URLs in the Crossref XML - Issue: openjournals/joss#1324
- Fixes another of two wrong archive URLs in the Crossref XML - Issue: openjournals/joss#1324
See above, put up openjournals/joss-papers#5158 as the only place I could find the offending strings in the GH org :). |
Found another one (reported in openjournals/buffy#103 (comment)):
|
- Fixes a wrong archive URLs in the Crossref XML that points to a French paper about a species of 🐝 from 1910 - Issue: openjournals/joss#1324
Fixed in openjournals/joss-papers@2232b31 |
@sdruskat Thanks for reporting this! |
Thanks for the pointer @xuanxu! I'll do this for future issues. 👍
Three cases with the 🐝, and thanks! |
Continuing this, I just went and checked all the archive links in
not bad overall :) here's the script (nothing special, just a one-off thing): expand/collapse"""
Check whether the archive DOI for each paper resolves to a page.
run this from within the joss-papers directory.
because of the handling of ratelimiting, you'll have to run this a few times
until you no longer skip for ratelimits.
generates
- `joss_archive_links.csv` - see `Results` for columns
- `joss_archive_links_clean.csv` - see `clean_csv`
- `joss_doi_pages` - xz compressed cache of the resolved archive pages
requires:
- requests
- tqdm
- pandas
"""
import csv
from xml.etree import ElementTree
from pathlib import Path
from dataclasses import dataclass, fields, asdict
from typing import Optional, Literal, Union
import lzma
from multiprocessing import Pool, Lock, Event
from time import sleep, time
from math import ceil
import requests
from tqdm import tqdm
import pandas as pd
data_file = Path('joss_archive_links.csv')
cache_dir = Path('joss_doi_pages')
NAMESPACES = {
'rel': "http://www.crossref.org/relations.xsd"
}
@dataclass
class Results:
file: str
archive: Optional[str] = None
valid: bool = False
status: Optional[int] = None
error: Optional[str] = None
retry_after: Optional[float] = None
def process_paper(path:Path) -> Optional[Results]:
out_file = cache_dir / path.with_suffix('.html.xz').name
if out_file.exists():
return
paper = ElementTree.parse(path).getroot()
res = {}
res['file'] = path.name
try:
archive = paper.find(".//rel:inter_work_relation[@relationship-type='references']", NAMESPACES).text
archive = archive.lstrip('“').rstrip('”')
if not archive.startswith('http'):
archive = 'https://dx.doi.org/' + archive
res['archive'] = archive
# hold if we are currently in a ratelimit cooldown.
lock.wait()
req = requests.get(res['archive'])
res['status'] = req.status_code
match res['status']:
case 429:
res['retry_after'] = float(req.headers['x-ratelimit-reset'])
case 200:
res['valid'] = True
if res['status'] != 429:
with lzma.open(out_file, 'w') as cache_file:
cache_file.write(req.content)
except Exception as e:
res['error'] = str(e)
return Results(**res)
def init_lock(l):
"""make a lock (now an event) available as a global across processes in a pool"""
global lock
lock = l
def wait(lock:Event, result:Results, message:tqdm):
"""if we get a 429, acquire the lock until we can start again"""
lock.clear()
wait_time = ceil(result.retry_after - time())
message.reset(wait_time)
for i in range(int(wait_time)):
sleep(1)
message.update()
lock.set()
def main():
rate_lock = Event()
rate_lock.set()
cache_dir.mkdir(exist_ok=True)
# ya i know i ruin the generator but i like progress bars with totals
files = list(Path('.').glob("joss*/*crossref.xml"))
try:
all_pbar = tqdm(total=len(files), position=0)
good = tqdm(position=1)
bad = tqdm(position=2)
message = tqdm(position=3)
pool = Pool(16, initializer=init_lock, initargs=(rate_lock,))
if not data_file.exists():
with open(data_file, 'w', newline='') as dfile:
writer = csv.DictWriter(dfile, [field.name for field in fields(Results)])
writer.writeheader()
with open(data_file, 'a', newline='') as dfile:
writer = csv.DictWriter(dfile, [field.name for field in fields(Results)])
for result in pool.imap_unordered(process_paper, files):
all_pbar.update()
if result is None:
continue
if result.retry_after:
wait(rate_lock, result, message)
if result.valid:
good.update()
else:
bad.update()
writer.writerow(asdict(result))
finally:
all_pbar.close()
good.close()
bad.close()
clean_csv()
def clean_csv(path:Path = data_file):
"""
- remove 429s
- deduplicate rows (if identical)
- sorts by `valid` and then `file`
"""
df = pd.read_csv(path)
df = df.loc[df['status'] != 429]
df = df.drop_duplicates()
df = df.sort_values(['valid', 'file'], ignore_index=True)
out_fn = (path.parent / (path.stem + '_clean')).with_suffix('.csv')
df.to_csv(out_fn, index=False)
if __name__ == "__main__":
main() |
Closing this issue as PDFs and metadata on the three papers have been corrected and re-deposited. |
Quick update:
Now fixed.
Is a
Looks like there was some kind of error with the
Now fixed.
Looks like the paper is missing. I've asked the author to re-add it: openjournals/joss-reviews#4439 (comment)
Seems to resolve for me now?
Looks like it's missing from the PDF and the XML files? This probably needs manual handling.
Same issue as
I think we should report this to Zenodo as an issue. |
Hi 👋,
Stumbled over two bad
software_archive
s:It seems it has been inadvertently set to
https://joss.theoj.org/papers/10.21105/v1.0.7
after it was previously correctly set tohttps://doi.org/10.5281/zenodo.10162614
.The Zenodo DOI has been truncated to https://doi.org/10.5281/zenodo.714. The correct one is https://doi.org/10.5281/zenodo.7143971.
Perhaps this can be fixed in the metadata?
The text was updated successfully, but these errors were encountered: