-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tool to help document archaeological data #3
Comments
Thank you for this interesting idea @zackbatist. The underlying problem of meta documentation is IMHO very real and important. I do have difficulties though to see, how the interface of this software would work. Each archaeological project produces very different kinds of data and each archaeologist uses another personal workflow to manage this data. Where can these prompts appear and how can this metadata be stored permanently in a human- and machine-readable way? I wonder if not a new piece of software, but a well described workflow -- a how-to guide to produce metadata -- would be a better and more universal solution? This would be very well compatible with the aims of this session. I got the feeling that this proposal could be interesting for you, @florianthiery. |
I had a chance to think about this some more. I envision this sort of as a post-hoc documentation tool, partially inspired by tools that help generate data management plans such as DMPonline and Portage, which are meant to prompt researchers to think about how they will handle data before they begin their work (though they are laughably ineffective because they are unenforceable and lack sufficient specificity). I think that this would be most useful for projects that keep data stored across a series of vaguely-named excel spreadsheets, which is still quite common practice. Links between spreadsheets are inherently implicit, and this would simply force the user to document them in a more explicit way.1 This would make it easier to do manual queries, especially for people who are reusing shared data that they were not involved in creating. Therefore the goal is human readability. A user would identify a series of excel files to be read by the system. The system doesn't care what the files' contents are. The user would be prompted to explain the significance or scope of this group of files, which could represent a relevant subset of data from a bigger project (i.e. lithics data from archaeological project x). The script would read the name of each file and prompt the user to explain or describe its scope. If there are multiple worksheets, the user will be prompted to describe each one. Users will also be able to provide contact info for the people who created or maintained each file. Then the script would dive deeper into each worksheet by parsing the values stored within them. Column names would be identified with user assistance (is first row column names? y/n). The values under each column would then be read and if there seem to be repeated/standardized values (shorthand, abbreviations, etc) then the system would prompt for their meanings to be defined more clearly. Index columns (i.e. independent variables) would be identified and their relations to other indexes in other input spreadsheets would be declared and explained in a subsequent stage. I imagine that excel formulas might also be parsed somehow. The result would be a brief and professional-looking report. Much of this can be done using base R, dplyr and whatever excel-parsing package is most extensive or up to date, all wrapped up as a shiny app. But because the imagined user would be someone who hasn't bothered to explicitly relate their data, stacking this on R (requires installing and launching R first, if run locally) might be the wrong path to take. It would be better to code this in python and then wrapping it as an application bundle using py2app or whatever the equivalent may be for windows systems. 1 However, it would also be very useful for documenting SQL databases, since the reasons for making various database design decisions are actually rarely documented, at least in my experience. To keep things simple, it's best to focus on the scattered spreadsheets scenario to start with. |
I think this would be a very valuable project that goes hand in hand with other efforts to create a more standardized workflow for processing (archaeological) data. It would certainly be very useful, and would integrate well into an overall analysis tool for standardized evaluations, e.g. in the sense of SDS processing as already implemented by Clemens, or as a revival of an old project corpse such as quantAAR. However, I am not sure whether such an extensive project could be implemented within a hackathlon on the CAA. Especially since I think that some conceptual groundwork would have to be done in order to create a meaningful and sustainable interface. But I would be very pleased if we would take this as a general project idea and start and implement it inshallah as soon as there is time for that. |
@zackbatist Thanks for this explanation. I understand it better now and agree with @MartinHinz that this might be very useful if implemented in a good way. 👍 I'm not sure though, if it makes sense at this point to distinguish between projects that are sufficiently simple or too complex for the session. Pretty much all of the ideas we have so far can't be fully realized in the handful of hours we have together. I see this session more as a kickstarting event to discuss and develop first prototypes and to gather collaborators for future work. |
An interface that prompts users to document various aspects of their datasets and highlight or explain the implicit relationships between tables or variables. May be especially valuable for organizing series of scattered spreadsheets.
The text was updated successfully, but these errors were encountered: