Skip to content

Commit

Permalink
init
Browse files Browse the repository at this point in the history
  • Loading branch information
YaoxiangLi committed Oct 3, 2024
0 parents commit 0eff993
Show file tree
Hide file tree
Showing 80 changed files with 3,788 additions and 0 deletions.
25 changes: 25 additions & 0 deletions .Rbuildignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
^medrxivr\.Rproj$
^\.Rproj\.user$
^LICENSE\.md$
^README\.Rmd$
^\.travis\.yml$
^appveyor\.yml$
^data-raw$
^codecov\.yml$
pdf/
^_pkgdown\.yml$
^docs$
^pkgdown$
^CODE_OF_CONDUCT\.md$
data-extraction/
.github/
ropensci.*
figs/
.here
^\.github$
^vignettes/medrxiv-api.Rmd
^vignettes/building-complex-search-strategies.Rmd
^cran-comments\.md$
medrxiv_export.bib
^codemeta\.json$
^CRAN-RELEASE$
1 change: 1 addition & 0 deletions .github/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
*.html
32 changes: 32 additions & 0 deletions .github/CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Contributing

Development is a community effort, and we welcome participation.

## Code of Conduct

Please note that this package is released with a [Contributor Code of Conduct](https://ropensci.org/code-of-conduct/).

## Issues

<https://github.com/ropensci/medrxivr/issues> is for maintenance tasks feature requests. When you post, please abide by the same guidelines as <https://books.ropensci.org/targets/help.html>.

## Development

External code contributions are extremely helpful in the right circumstances. Here are the recommended steps.

1. Prior to contribution, please propose your idea in a discussion thread so you and the maintainer can define the intent and scope of your work.
2. [Fork the repository](https://help.github.com/articles/fork-a-repo/).
3. Follow the [GitHub flow](https://guides.github.com/introduction/flow/index.html) to create a new branch, add commits, and open a pull request.
4. Discuss your code with the maintainer in the pull request thread.
5. If everything looks good, the maintainer will merge your code into the project.

Please also follow these additional guidelines.

* Respect the architecture and reasoning of the package. Depending on the scope of your work, you may want to read the design documents (package vignettes).
* If possible, keep contributions small enough to easily review manually. It is okay to split up your work into multiple pull requests.
* Format your code according to the [tidyverse style guide](https://style.tidyverse.org/) and check your formatting with the `lint_package()` function from the [`lintr`](https://github.com/jimhester/lintr) package.
* For new features or functionality, add tests in `tests`. Tests that can be automated should go in `tests/testthat/`. Tests that cannot be automated should go in `tests/interactive/`. For features affecting performance, it is good practice to add profiling studies to `tests/performance/`.
* Check code coverage with `covr::package_coverage()`. Automated tests should cover all the new or changed functionality in your pull request.
* Run overall package checks with `devtools::check()` and `goodpractice::gp()`
* Describe your contribution in the project's [`NEWS.md`](https://github.com/ropensci/targets/blob/main/NEWS.md) file. Be sure to mention relevent GitHub issue numbers and your GitHub name as done in existing news entries.
* If you feel contribution is substantial enough for official author or contributor status, please add yourself to the `Authors@R` field of the [`DESCRIPTION`](https://github.com/ropensci/targets/blob/main/DESCRIPTION) file.
10 changes: 10 additions & 0 deletions .github/issue_template.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
<!-- IF THIS INVOLVES AUTHENTICATION: DO NOT SHARE YOUR USERNAME/PASSWORD, OR API KEYS/TOKENS IN THIS ISSUE - MOST LIKELY THE MAINTAINER WILL HAVE THEIR OWN EQUIVALENT KEY -->

<!-- If this issue relates to usage of the package, whether a question, bug or similar, along with your query, please paste your devtools::session_info() or sessionInfo() into the code block below, AND include a reproducible example (consider using a "reprex" https://cran.rstudio.com/web/packages/reprex/) If not, delete all this and proceed :) -->

<details> <summary><strong>Session Info</strong></summary>

```r

```
</details>
20 changes: 20 additions & 0 deletions .github/pull_request_template.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
<!-- IF THIS INVOLVES AUTHENTICATION: DO NOT SHARE YOUR USERNAME/PASSWORD, OR API KEYS/TOKENS IN THIS ISSUE - MOST LIKELY THE MAINTAINER WILL HAVE THEIR OWN EQUIVALENT KEY -->

<!-- If you've updated a file in the man-roxygen directory, make sure to update the man/ files by running devtools::document() or similar as .Rd files should be affected by your change -->

<!--- Provide a general summary of your changes in the Title above -->

## Description
<!--- Describe your changes in detail -->

## Related Issue
<!--- if this closes an issue make sure include e.g., "fix #4"
or similar - or if just relates to an issue make sure to mention
it like "#4" -->

## Example
<!--- if introducing a new feature or changing behavior of existing
methods/functions, include an example if possible to do in brief form -->

<!--- Did you remember to include tests? Unless you're just changing
grammar, please include new tests for your change -->
84 changes: 84 additions & 0 deletions .github/workflows/R-CMD-check.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# For help debugging build failures open an issue on the RStudio community with the 'github-actions' tag.
# https://community.rstudio.com/new-topic?category=Package%20development&tags=github-actions
on:
push:
branches:
- master
pull_request:
branches:
- master

name: R-CMD-check

jobs:
R-CMD-check:
runs-on: ${{ matrix.config.os }}
name: ${{ matrix.config.os }} (${{ matrix.config.r }})

strategy:
fail-fast: false
matrix:
config:
- {os: windows-latest, r: 'release'}
- {os: macOS-latest, r: 'release'}
- {os: ubuntu-20.04, r: 'release', rspm: "https://packagemanager.rstudio.com/cran/__linux__/focal/latest"}
- {os: ubuntu-20.04, r: 'devel', rspm: "https://packagemanager.rstudio.com/cran/__linux__/focal/latest"}

env:
R_REMOTES_NO_ERRORS_FROM_WARNINGS: true
RSPM: ${{ matrix.config.rspm }}

steps:
- uses: actions/checkout@v4

- uses: r-lib/actions/setup-r@v2
with:
r-version: ${{ matrix.config.r }}

- uses: r-lib/actions/setup-pandoc@v2

- name: Install system dependencies (Linux)
if: runner.os == 'Linux'
run: |
sudo apt-get update
sudo apt-get install -y libcurl4-openssl-dev libssl-dev
- name: Query dependencies
run: |
install.packages('remotes')
saveRDS(remotes::dev_package_deps(dependencies = TRUE), ".github/depends.Rds", version = 2)
writeLines(sprintf("R-%i.%i", getRversion()$major, getRversion()$minor), ".github/R-version")
shell: Rscript {0}

- name: Cache R packages (Linux and macOS)
if: runner.os != 'Windows'
uses: actions/cache@v2
with:
path: ${{ env.R_LIBS_USER }}
key: ${{ runner.os }}-${{ hashFiles('.github/R-version') }}-1-${{ hashFiles('.github/depends.Rds') }}
restore-keys: ${{ runner.os }}-${{ hashFiles('.github/R-version') }}-1-

- name: Install dependencies
run: |
remotes::install_deps(dependencies = TRUE)
remotes::install_cran("rcmdcheck")
shell: Rscript {0}

- name: Install missing packages on Windows
if: runner.os == 'Windows'
run: |
install.packages(c('curl', 'openssl', 'testthat', 'hms', 'dplyr', 'rlang', 'tibble', 'progress', 'remotes', 'rcmdcheck'))
shell: Rscript {0}

- name: Check
env:
_R_CHECK_CRAN_INCOMING_REMOTE_: false
run: rcmdcheck::rcmdcheck(args = c("--no-manual", "--as-cran"), error_on = "warning", check_dir = "check")
shell: Rscript {0}

- name: Upload check results
if: failure()
uses: actions/upload-artifact@main
with:
name: ${{ runner.os }}-r${{ matrix.config.r }}-results
path: check
68 changes: 68 additions & 0 deletions .github/workflows/test-coverage.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
on:
push:
branches:
- master
pull_request:
branches:
- master

name: test-coverage

jobs:
test-coverage:
runs-on: ubuntu-latest
env:
GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}
steps:
- uses: actions/checkout@v4

- uses: r-lib/actions/setup-r@v2

- uses: r-lib/actions/setup-pandoc@v2

- name: Install system dependencies
run: |
sudo apt-get update
sudo apt-get install -y libcurl4-openssl-dev libssl-dev
- name: Query dependencies
run: |
install.packages('remotes')
saveRDS(remotes::dev_package_deps(dependencies = TRUE), ".github/depends.Rds", version = 2)
writeLines(sprintf("R-%i.%i", getRversion()$major, getRversion()$minor), ".github/R-version")
shell: Rscript {0}

- name: Cache R packages
uses: actions/cache@v3
with:
path: ${{ env.R_LIBS_USER }}
key: ${{ runner.os }}-R-${{ hashFiles('.github/R-version') }}-1-${{ hashFiles('.github/depends.Rds') }}
restore-keys: ${{ runner.os }}-R-${{ hashFiles('.github/R-version') }}-1-

- name: Install dependencies
run: |
install.packages(c("remotes", "covr"))
remotes::install_deps(dependencies = TRUE)
shell: Rscript {0}

- name: Check for missing packages
run: |
missing_pkgs <- setdiff(c("covr", "testthat"), rownames(installed.packages()))
if (length(missing_pkgs) > 0) {
cat("Missing packages:", paste(missing_pkgs, collapse = ", "), "\n")
install.packages(missing_pkgs)
} else {
cat("All required packages are installed.\n")
}
shell: Rscript {0}

- name: Run tests
run: |
testthat::test_local()
shell: Rscript {0}

- name: Test coverage
run: covr::codecov()
shell: Rscript {0}
env:
GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}
10 changes: 10 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
.Rproj.user
.Rhistory
.RData
inst/doc
pdf/
docs/
extract/keys
ropensci.*
inst/paper.html
medrxiv_export.bib
60 changes: 60 additions & 0 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
Package: medrxivr
Title: Access and Search MedRxiv and BioRxiv Preprint Data
Version: 0.1.0
Authors@R: c(
person("Yaoxiang", "Li",
role = c("aut", "cre"),
email = "[email protected]",
comment = c(ORCID="0000-0001-9200-1016")),
person("Luke", "McGuinness",
role = c("aut")),
person("Lena", "Schmidt",
role = "aut"),
person("Tuija", "Sonkkila",
role = "rev"),
person("Najko", "Jahn",
role = "rev"))
Description: An increasingly important source of health-related bibliographic
content are preprints - preliminary versions of research articles that have
yet to undergo peer review. The two preprint repositories most relevant to
health-related sciences are medRxiv <https://www.medrxiv.org/> and
bioRxiv <https://www.biorxiv.org/>, both of which are operated by the Cold
Spring Harbor Laboratory. 'medrxivr' provides programmatic access to the
'Cold Spring Harbour Laboratory (CSHL)' API <https://api.biorxiv.org/>,
allowing users to easily download medRxiv and bioRxiv preprint metadata
(e.g. title, abstract, publication date, author list, etc) into R.
'medrxivr' also provides functions to search the downloaded preprint records
using regular expressions and Boolean logic, as well as helper functions
that allow users to export their search results to a .BIB file for easy
import to a reference manager and to download the full-text PDFs of
preprints matching their search criteria.
License: GPL-2
Encoding: UTF-8
LazyData: true
Language: en-US
URL: https://github.com/ropensci/medrxivr
BugReports: https://github.com/ropensci/medrxivr/issues
Imports:
methods,
dplyr,
curl,
jsonlite,
httr,
stringr,
rlang,
bib2df,
tibble,
progress,
lubridate,
purrr,
data.table
Suggests:
testthat (>= 2.1.0),
knitr,
rmarkdown,
covr,
kableExtra,
spelling
VignetteBuilder:
knitr
RoxygenNote: 7.3.2
16 changes: 16 additions & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# Generated by roxygen2: do not edit by hand

export(mx_api_content)
export(mx_api_doi)
export(mx_caps)
export(mx_crosscheck)
export(mx_download)
export(mx_export)
export(mx_search)
export(mx_snapshot)
importFrom(dplyr,"%>%")
importFrom(methods,is)
importFrom(rlang,.data)
importFrom(stats,runif)
importFrom(utils,download.file)
importFrom(utils,read.csv)
62 changes: 62 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# medrxivr (development version)

* `datatable::fread()` is now used in place of `vroom::vroom()` to import the snapshot

# medrxivr 0.0.5

Major changes:

* Improved error handling to address a common bug that causes extraction from the API to fail. The "total number" of records element of the API metadata is frequently artificially inflated. This leads to an overestimation of the number of pages of records, which in turn caused the extraction function to fail at the very end when `mx_api_content()` encounters an empty page. This error has been changed to informative messaging about the expected (as per the metadata) and actual (`nrows()` of returned dataset) number of retrievable records.
* New functionality added to `mx_search()` allows users to view the number of "hits" (records returned) for each individual element of the search. An extra parameter called `report` has been added, which gives the user the option to switch this functionality on or off. The default value for this parameter is set to FALSE. This functionality was added by [James O'Hare](https://github.com/jamesohare1) in response to [Issue #13.](https://github.com/ropensci/medrxivr/issues/13)
* Users can now pass a vector of terms to the `NOT` parameter rather than a single exclusion term.
* New functionality to allow for user-friendly search operators, including wildcards ("randomi*ation" will now find "randomisation" and "randomization") and the NEAR operator ("systematic NEAR1 review" will find "systematic review" and "systematic _<any-other-word>_ review")
* A new argument, `auto_caps`, in the `mx_search()` function to allow for automatic capitalisation of the first character in each search term (e.g. with `auto_caps = TRUE`, "dementia" will be automatically converted to "[Dd]ementia" which will find "**d**ementia" and also "**D**ementia"). This replaces the recommendation that users capitalise the first character themselves using square brackets. However, if user defined alternative are already in place for the first character of the search term, then these are left untouched.
* A helper function, `mx_caps()`, allows users to wrap search terms to find all possible combinations of upper- and lower-case letters in that term. For example, `mx_caps("ncov")` converts the term to "[Nn][Cc][Oo][Vv]" which will find "NCOV", "Ncov", "nCOV", "nCoV", etc.

# medrxivr 0.0.4

Major changes:

* Fixed error which occurred when downloading the whole bioRxiv database. This was caused by any record above 100000 being presented in scientific notation (e.g. 1e+05), which meant the API returned an invalid response.
* Change tests to fix runtime regardless of future growth of the repositories

# medrxivr 0.0.3

Version created for submission to JOSS and CRAN, and onboarded to rOpenSci following peer-review.

Major changes:

* `mx_snapshot()` now takes a `commit` argument, allowing you to specify exactly which snapshot of the database you would like to use. Details on the commit keys needed are [here](https://github.com/mcguinlu/medrxivr-data/commits/master/snapshot.csv). In addition, the process of taking the snapshot is now managed by GitHub actions, meaning it should be a lot more robust/regular/
* Importing the snapshot to R is now significantly faster, as `vroom::vroom()` is used in place of `read.csv()`
* All functions that return a data frame now return ungrouped tibbles.
* The to/from date arguments for both `mx_search()` and `mx_api_content()` have been standardized to snake case and now expect the same "YYYY-MM-DD" character format.
* A progress indicator has been added to `mx_api_content()` provide useful information when downloading from the API.
* Some refactoring of code has taken place to reduce duplication of code chunks and to make future maintenance easier.

Minor changes:

* `mx_crosscheck()` no longer uses web-scraping when providing the number of
* Documentation has been updated to reflect the changes, and some additional sections added to the vignettes. This includes removing references to older versions of the functions names (e.g. `mx_raw()`).
* Additional test have been written, and the overall test coverage has been increased. Some lines (handling exceptional rare errors that can't be mocked) have been marked as `#nocov`.
* \dontrun had been replaced with \donttest in all examples across the package.
* All examples for mx_download() and mx_export() now use tempfile() and tempdir(), so as not to modify the users home filespace when running the examples.




# medrxivr 0.0.2

Major changes:

* Following the release of the [medRxiv API](https://api.biorxiv.org/), the way the snapshot of the medRxiv site is taken has changed, resulting in a more accurate snapshot of the entire repository being taken daily (as opposed to just new articles being captured, as was previously the case). This has introduced some breaking changes (e.g. in the `fields` argument, "subject" has become "category", and "link" has become "doi"), but will result in better long-term stability of the package.
* Two new functions, `mx_api_content()` and `mx_api_doi()`, have been added to allow users to interact with the medRxiv API endpoints directly. A new vignette documenting these functions has been added.
* The API has also allowed for improved data collection. The "authors" variable searched/returned now contains all authors of the paper as opposed to just the first one. Several additional fields are now returned, including corresponding author's institution, preprint license, and the DOI of the published, peer-reviewed version of preprint (if available).
* A companion app was launched, which allows you to build the search strategy using a user-friendly interface and then export the code needed to run it directly from R.
* You can now define the field(s) you wish to search. By default, the Title, Abstract, First Author, Subject, and Link (which includes the DOI) fields are searched.
* There is no longer a limit on the number of distinct topics you can search for (previously it was 5).
* The output of `mx_search()` has been cleaned to make it more useful to future end-users. Of note, some of the columns names have changed, and the "pdf_name" and "extraction_date" variables are no longer returned.


# medrxivr 0.0.1

* Added a `NEWS.md` file to track changes to the package.
Loading

0 comments on commit 0eff993

Please sign in to comment.