This project started in the same way that many educational research projects do: we wanted to understand more about the secondary institutions that graduate prospective CU Boulder students.
Once we bumped into the well-known crosswalk problem, our team added the goal to crowdsource an open-source, publicly available, NCES to CEEB crosswalk.
To finish this, we need your help.
If you would you like to augment this data set with currently unknown or unlisted schools, there are two ways to contribute.
-
The Github Way - Fork this repository, augment the data file and then submit a pull request with the data file in the same format as oda_nces_ceeb_crosswalk.csv. We will review the updates and then merge your branch with the master file you see here.
-
The Email Way - Clone this repository or download the master data file. Update oda_nces_ceeb_crosswalk.csv with any new or updated information and simply email it to brad dot weiner at colorado dot edu. We will update the file twice-yearly with all received additions.
Merging two different data sets requires a unique key. Unfortunately, there are often situations when two or more related data sets exist, yet no useful crosswalk is available.
This is a long-standing challenge in educational policy because performance and conditions at one level of education (Pre-K, K-12), might longitudinally explain performance in post-secondary institutions and beyond.
This project attempts to create a crosswalk between the National Center for Education Statistics (NCES) secondary school data and the codes utilized by national post-secondary entrance examinations like the ACT and SAT. These codes, often referred to as "CEEB" codes, uniquely identify the secondary school in which a test-taking student is graduating.
Other institutional researchers and scholars 1,2 have tackled this problem, often using fuzzy matching methods which are helpful, but incomplete. This project attempts to provide a reproducible, open-source framework to solve this problem.
The Office of Data Analytics at the University of Colorado Boulder utilized three methods to create as complete of a file to include all secondary schools that had recently submitted applications to our institution.
First, we looked for others who have done some or all of this work successfully. To our knowledge, the best data set was created by Mark Davenport at the University of North Carolina - Greensboro. This set (known in the code as the "Davenport Set"), was considered the "base". Our first pass was leaning on the Davenport set for any records that had already been matched.
Second, we used fuzzy matching techniques to merge data from both the NCES Common Core of Data and the NCES Private School Survey to any records not in the Davenport Data Set. Matches were scored based on school name and zip code. It was assumed that few, if any schools would share all four of those characteristics.
The remaining cases were set up as jobs for workers using Amazon's Mechanical Turk. Mechanical Turk is a crowdsourcing website that allows for requesters like CU Boulder, to hire crowdworkers to assist with work that require human intervention. In our case, we wanted crowdworkers to manually search for high schools in the NCES database, and input the correct code whenever possible. To increase accuracy, we covered each request with 3-4 independent workers so that discrepancies could be sorted with a tie-breaking aggregation of the workers.
The final file, which is included here, contains 21,592 matched secondary schools in the United States with both CEEB codes and NCES IDS.
This repo contains all of the raw data files including
-
The 'Davenport' data set generously provided by Mark Davenport at the University of North Carolina-Greensboro
-
High schools from which The University of Colorado Boulder has received applications, intended to be the full universe of matches for our purposes
-
A listing of state codes for matching
-
The master crosswalk found here:
https://github.com/cu-boulder/ceeb_nces_crosswalk/blob/master/oda_nces_ceeb_crosswalk.csv
We have also included the code so others may reproduce or improve upon it.
Education Data Portal (Version 0.8.0), Urban Institute, Center on Education Data and Policy, accessed August, 28, 2020, https://educationdata.urban.org/documentation/, [US Department of Education Common Core of Data]
The team below are all affiliated with The Office of Data Analytics (ODA) at The Universty of Colorado, Boulder.
The ODA mission is to foster an understanding of institutional knowledge. This is done by transforming data into actionable intelligence; aiding the decision making processes throughout the university.
Danielle Lyles - Data and Evaluation Scientist
Peter Shaffery - Postdoctoral Research Assistant
Ulises Guzman Sol - Data Engineer
Seth Spielman - Chief Data Officer
Brad Weiner - Director of Data Science
Pull requests and data additions are welcome. Please feel free to use these data or this strategy on your own campus. When using, please cite the following DOI.
Contributors and users of this repository are asked to adhere to the University of Colorado code of conduct which is stated below.
The University of Colorado recognizes that people are the most important resource in accomplishing its mission. The University of Colorado values academic freedom, diversity, and respect for all persons. The university is committed to the principle of non-discrimination and does not tolerate harassment on any basis, including race, color, national origin, sex, pregnancy, age, disability, creed, religion, sexual orientation, gender identity, gender expression, veteran status, political affiliation, or political philosophy. Members of the university community are expected to treat colleagues, co-workers, and students with respect, professionalism, and dignity in all interactions and communications.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.