OpenScience.Rmd

# Open Science {#OpenScience} 

\index{open science}

*Chapter lead: Kees van Bochove*

From the inception of the OHDSI community, the goal was to establish an international collaborative by building on open science values, such as the use of open source software, public availability of all conference proceedings and materials, and transparent, open access publication of generated medical evidence. But what exactly is open science? And how could OHDSI build an open science or open data strategy around medical data, which is very privacy sensitive and typically not open at all for good reasons? Why is it so important to have reproducibility of analysis, and how does the OHDSI community aim to achieve this? These are some of the questions that we touch on in this chapter.

## Open Science

The term 'open science' has been used since the nineties, but really gained traction in the 2010s, during the same period OHDSI was born. Wikipedia [@wiki:Open_science] defines it as "the movement to make scientific research (including publications, data, physical samples, and software) and its dissemination accessible to all levels of an inquiring society, amateur or professional", and goes on to state that it is typically developed through collaborative networks. Although the OHDSI community never positioned itself explicitly as an 'open science' collective or network, the term is frequently used to explain the driving concepts and principles behind OHDSI. For example, in 2015, Jon Duke presented OHDSI as "An Open Science Approach to Medical Evidence Generation"[^1], and in 2019, the EHDEN projects' introductory webinar hailed the OHDSI network approach as "21st Century Real World Open Science"[^2]. Indeed, as we shall see in this chapter, many of the practices of open science can be found in today's OHDSI community. One could argue that the OHDSI community is a grassroots open science collective driven by a shared desire for improving the transparency and reliability of medical evidence generation.

Open science or "Science 2.0" [@wiki:Science_2.0] approaches mean to address a number of perceived problems within the current scientific practice. Information technology has led to an explosion of data generation and analysis methods, and for individual researchers, it is very hard to keep up with all literature published in their area of expertise. This holds even more true for medical doctors who have a practice to run as day job, but still need to keep abreast of the latest medical evidence. In addition, there is growing concern that many experiments may suffer from poor statistical designs, publication bias, p-hacking and similar statistical problems, and are hard to reproduce. The traditional method of correcting these problems, peer review of published articles, often fails to identify and tackle these problems. The special 2018 Nature edition on "Challenges in irreproducible research"[^3] includes several examples of this. A group of authors attempting to apply systematic peer review on the articles in their field found that, for various reasons, it was very hard to get the errors they identified rectified. Especially hard to correct are those experiments that have a flawed experimental design to begin with. In the words of Ronald Fisher: "To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of." [@wikiquote:Ronald_Fisher] The authors encountered common statistical problems such as poor randomization designs leading to false conclusions about statistical significance, miscalculations in meta-analyses, and inappropriate baseline comparisons. [@allison_2016] Another paper from the same collection, taking experiences from physics as an example, argues that it is critical to not only provide access to the underlying data, but also to publish and properly document the data processing and analysis scripts to achieve full reproducibility. [@Chen2018]

The OHDSI community addresses these challenges in its own way, and puts significant emphasis on the importance of generating medical evidence at scale. As stated in @schuemie_2018b, while the current paradigm "centers on generating one estimate at a time using a unique study design with unknown reliability and publishing (or not) one estimate at a time", the OHDSI community "advocates for high-throughput observational studies using consistent and standardized methods, allowing evaluation, calibration and unbiased dissemination to generate a more reliable and complete evidence base." This is achieved by a combination of a network of medical data sources that map their data to the OMOP common data model, open source analytics code that can be used and verified by all, and large-scale baseline data such as the condition occurrences published at howoften.org. In the following paragraphs, concrete examples are provided and the open science approach of OHDSI is detailed further using the four principles of Open Standards, Open Source, Open Data and Open Discourse as a guide. The chapter is concluded with a brief reference to the FAIR principles and outlook for OHDSI from an open science perspective.

## Open Science in Action: the Study-a-thon 

\index{study-a-thon}

A recent development in the community is the emergence of 'study-a-thons': short, concentrated face to face gatherings of a multidisciplinary group of scientists aimed at answering an important clinically relevant research question using the OMOP data model and the OHDSI tools. A nice example is the 2018 Oxford study-a-thon, which is explained in an EHDEN webinar (https://youtu.be/X5yuoJoL6xs) which provides a walkthrough of the process and also highlights the openly available results. In the period leading up to the study-a-thon, the participants propose medically relevant research questions to study, and one or more research questions are selected to study during the study-a-thon itself. Data is provided through participants that have access to patient-level data in OMOP format and are able to run queries on these data sources. Much of the actual study-a-thon time is devoted to discussing the statistical approach (see also the next chapter \@ref(WhereToBegin)), the suitability of the data sources, the results which are interactively produced and the follow-up questions that are inevitably raised by these results. In the case of the Oxford study-a-thon, the questions centered around studying adverse post-surgical effects of different knee replacement methods, and the results were published interactively during the study-a-thon using the OHDSI forums and tools \@ref(OhdsiAnalyticsTools). The OHDSI tools such as ATLAS faciliate rapid creation, exchange, discussion and tests of cohort definitions, which greatly speeds up the initial process of achieving consensus on problem definition and choice of methods. Thanks to the usage of the OMOP Common Data Model by the involved data sources and the availability of the OHDSI open source patient level prediction packages \@ref(PatientLevelPrediction), it was possible to create a a prediction model for 90-day post-operative mortality in one day, and validate the model externally in several large data sources the day after. The study-a-thon also resulted in a traditional peer-reviewed paper (Development and validation of patient-level prediction models for adverse outcomes following total knee arthroplasty, Ross Williams, Daniel Prieto-Alhambra et al., manuscript in preparation), which took of course months to process through peer review. But the fact that the analysis scripts and results for several healthcare databases covering hundreds of millions of patient records were conceived, produced and published from scratch within a week, illustrates the fundamental improvements OHDSI can bring to medical science, reducing the turnaround time for evidence to become available from months to days.

## Open Standards 

\index{open science!open standards}

A very significant community resource that is maintained in the OHDSI community is the OMOP Common Data Model \@ref(CommonDataModel) and associated Standardized Vocabularies \@ref(StandardizedVocabularies). The model itself is scoped to capture observational healthcare data, and was originally meant to analyze associations between exposures such as drugs, procedures, devices etc. and outcomes such as conditions and measurements, and is now extended for various analysis use cases (see also \@ref(DataAnalyticsUseCases)). However, harmonizing healthcare data worldwide from a wide variety of coding systems, healthcare paradigm and different types of healthcare sources requires a massive amount of 'mappings' between source codes and their closest standardized counterparts. The OMOP Standardized Vocabulary is further described in chapter 6 and includes mappings from hundreds of medical coding systems that are in used worldwide, and is browseable through the OHDSI Athena tool. By providing these vocabularies and mappings as a freely available community resource, OMOP and the OHDSI community make a significant contribution to healthcare data analytics and is by several accounts the most comprehensive model for this purpose, representing approximately 1.2 billion healthcare records worldwide [@garza_2016] [^6].

## Open Source 

\index{open science!open source}

Another key resource the OHDSI community provides are open source programs. These can be divided in several categories, such as the helper tools to map data to OMOP \@ref(ExtractTransformLoad), the OHDSI Methods Library which contain a powerful suite of commonly used statistical methods, open source code for published observational studies, and ATLAS, Athena and other infrastructure-related software which underpins the OHDSI ecosystem \@ref(OhdsiAnalyticsTools). See chapter 9 for a detailed overview.
From an open science perspective, one of the most important resources is the code for the actual execution of studies, such as studies from the OHDSI Research Network \@ref(NetworkResearch). In turn, these programs leverage the fully open source OHDSI stack, which can be inspected, reviewed and contributed to via GitHub. For example, network studies often build on the Methods Library, which ensures a consistent re-use of statistical methods across analytical use cases. See the \@ref(SoftwareValidity) chapter for a more detailed overview of how the use of and collaboration on open source software in OHDSI ultimately underpins the quality and reliability of the generated evidence.

## Open Data 

\index{open science!open data}

Because of the privacy-sensitive nature of healthcare data, fully open comprehensive patient-level datasets are typically not available. However, it is possible to leverage OMOP mapped datasets to publish important aggregated data and results sets, such as the earlier mentioned howoften.org and other public result sets that are published to data.ohdsi.org. Also, the OHDSI community provides simulated datasets such as SynPUF for testing and development purposes, and the OHDSI Research Network \@ref(NetworkResearch) can be leveraged to run studies in a network of available datasources that have mapped their data to OMOP. In order to make the mapping between the source data and the OMOP CDM transparent, it is encouraged for data sources to re-use the OHDSI ETL or 'mapping' tools and publish their mapping code as open source as well.

## Open Discourse 

\index{open science!open discourse}

Open standards, open source and open data are great assets, but left by themselves, they will not impact medical practice. Key to the open science practice and impact of OHDSI is the implementation of medical evidence generation and the translation of the science to medical practice. The OHDSI community has now several annual OHDSI Symposia, in the United States and Europe, and dedicated communities of practice in a.o. China and Korea. These symposia discuss the advancements in statistical methods, data and software tooling, the standardized vocabularies, and all other aspects of the OHDSI open source community. The OHDSI forums [^8] and wiki [^9] facilitate thousands of researchers worldwide in practicing observational research. The community calls [^10] and the code, issues and pull requests in Github [^11] constantly evolve the open community assets such as code and the CDM, and in the OHDSI Network Studies, global observational research is practiced in an open and transparent way using hundreds of millions of patient records worldwide. Openness and open discourse is encouraged throughout the community, and this very book is written via an open process facilitated by the OHDSI wiki, community calls and a GitHub repository [^12]. It needs to be stressed however that without all the OHDSI collaborators, the processes and tools would be empty shells. Indeed, one could argue that the true value of the OHDSI community is with its members, who share a vision of improving health through collaborative and open science, as discussed in Chapter. \@ref(MissionVisionValues)

## OHDSI and the FAIR Guiding Principles 

\index{FAIR}

This last paragraph of the chapter takes a look at the current state of the OHDSI community and tooling, using the FAIR Data Guiding Principles published in @wilkinson2016.

### Findability

Any healthcare database that is mapped to OMOP and used for analytics, should from a scientific perspective be persisted for future reference and reproducibility. The use of persistent identifiers for OMOP databases is not yet widespread, partly because these databases are often contained behind firewalls and on internal networks and not necessarily connected to the internet. However, it is of course entirely possible to publish summaries of the databases as a descriptor record that can be referenced for e.g. citation purposes. This method is followed in for example the EMIF catalog [^7], which provides a comprehensive record of the database in terms of data gathering purpose, sources, vocabularies and terms, access control mechanisms, license, consents etc. [@Oliveira2019] This approach is further developed in the IMI EHDEN project.

### Accessibility

Accessibility of OMOP mapped data through an open protocol is typically achieved through the SQL interface, which combined with the OMOP CDM provides a standardized and well-documented method for accessing OMOP data. However, as discussed above, OMOP sources are often not directly available over the internet for security reasons. Creating a secure worldwide healthcare data network that is accessible for researchers is an active research topic and operational goal of projects like IMI EHDEN. However, what can be openly published are results of analyses in multiple OMOP databases, as shown through OHDSI initiatives such as LEGEND and howfoften.org.

### Interoperability

Interoperability is arguably the strong suit of the OMOP data model and OHDSI tooling. In order to build a strong network of medical data sources worldwide which can be leveraged for evidence generation, achieving interoperability between healthcare data sources is key, and this is achieved through the OMOP model and Standardized Vocabularies. However, by sharing cohort definitions and statistical approaches, the OHDSI community goes beyond code mapping and also provides a platform to build an interoperable understanding of the analysis methods for healthcare data.
Since healthcare systems such as hospitals are often the source of record for OMOP data, the interoperability of the OHDSI approach could be further enhanced by alignment with operational healthcare interoperability standards such as HL7 FHIR, HL7 CIMI and openEHR. The same is true for alignment with clinical interoperability standards such as CDISC and biomedical ontologies. Especially in areas such as oncology, this is an important topic, and the Oncology Working Group and Clinica Trials Working Group in the OHDSI community provide good examples of forums where these issues are actively discussed.
In terms of references to other data and specifically ontology terms, ATLAS and OHDSI Athena are important tools, as they allow the exploration of the OMOP Standardized Vocabularies in the context of other available medical coding systems.

### Reusability

The FAIR principles around reusability focus on important issues such as the data license, provenance (clarifying how the data came in existence) and the link to relevant community standards.
Data licensing is a complicated topic, especially across jurisdictions, and it would fall outside of the scope of this book to cover it extensively. However, it is important to state that if you intend for your data (e.g. analysis results) to be freely used by others, it is good practice to explicitly provide these permissions via a data license. This is not yet a common practice for most data that can be found on internet, and the OHDSI community is unfortuantely not an exception here.
 The data provenance of OMOP databases is a very interesting topic, as there are potential improvements for making these available in an automated way, provided the ETL and mapping tools would persist metadata about for example the used CDM version, Standardized Vocabularies release, custom code lists etc. The OHDSI ETL tools do not currently produce this information automatically, but working groups such as the Data Quality Working Group and Metadata Working Group actively work on these. Another important aspect is the provenance of the underlying databases itself, for example it is important to know if a hospital or GP information system was replaced or changed, and when known data omissions or other data issues occurred historically. Exploring ways to attach this metadata systematically in the OMOP CDM is the domain of the Metadata Working Group.

## Conclusions

To conclude, the OHDSI community itself can be seen as an open science community that is actively pursuing the interoperability and reproducibility of medical evidence generation. It also advocates a paradigm shift from single study and single estimate medical research to large-scale systematic evidence generation, where facts such as baseline occurrence are known and the evidence focuses on statistically estimating the effects of interventions and treatments from real world healthcare sources.

[^1]: https://www.ohdsi.org/wp-content/uploads/2014/07/ARM-OHDSI_Duke.pdf
[^2]: https://www.ehden.eu/webinars/
[^3]: https://www.nature.com/collections/prbfkwmwvz
[^6]: https://www.ema.europa.eu/en/events/common-data-model-europe-why-which-how
[^7]: https://emif-catalogue.eu
[^8]: https://forums.ohdsi.org
[^9]: https://www.ohdsi.org/web/wiki
[^10]: https://www.ohdsi.org/web/wiki/doku.php?id=projects:overview
[^11]: https://github.com/ohdsi
[^12]: https://github.com/OHDSI/TheBookOfOhdsi