This is a template repository
that can be used for the archiving process of a big dataset
at the end of a research project's lifetime.
For this purpose,
the excel2xml
module
of DSP-TOOLS was developed.
This README guides you through the steps to set up a repository.
After that, you can learn how to use excel2xml
in a Python script
in the documentation of the excel2xml
module.
- Open this repository in Visual Studio Code
- The benefits of the Debugging Mode
- The benefits of Version Control with Git
- Some extras
From within GitHub Desktop,
clone this repository,
and open it in Visual Studio Code.
You will be prompted to install some extensions
that are recommended for importing data in DaSCH.
After installing them, navigate to README.md
and press ⌘⇧V (Ctrl+Shift+V on Windows) to get a preview.
Before importing data into DSP, you need a data model.
We have created a very easy one for you to get started: import_project.json
.
In order not to type everything by hand in JSON,
you can create your data model in Excel,
and then use the command dsp-tools excel2json
to convert it to JSON.
The folder data_model_files
contains the Excel files necessary to create import_project.json
.
Open import_script.py
.
You can now choose a Python interpreter by clicking on the Version number on the bottom right.
You can either work with the global (system-wide) Python,
or you can create a virtual environment for your project.
DaSCH employees who have installed Python via Homebrew can choose that one.
To find it, typebrew --prefix
in a Terminal.
Probably you already have a symlink (/usr/local/bin/python3
or /usr/bin/python3
)
that redirects to the Homebrew-installed Python.
The only thing that you shouldn't do is selecting a virtual environment of another project.
To start the debugging process, switch to the "Run and Debug" tab.
- set a break point
- click "Run and Debug"
- choose "Debug the currently active Python file"
- The control bar appears, and debugging starts.
Code execution will interrupt at your break point, that means, before the line of the break point is executed. Use this opportunity to inspect what has been done until now in the "Variables" area on the left, where the current state of the program is shown.
If one of the dependencies is not installed,
code execution will not reach your break point,
but stop at the missing dependency.
In this case, install the missing dependency with pip install package
in the Terminal of Visual Studio Code.
If you want to experiment with different scenarios how to proceed,
go to the "Debug Console" where you can execute code.
For example, let's inspect the Pandas Dataframe by typing main_df.info
.
You see that there are some empty rows at the end which don't contain useful data.
The next two lines of code will eliminate them.
Click on "Step Over" two times,
or set a new break point two lines further down and click on "Continue".
Now, type again main_df.info
in the Debug Console.
You will see that the empty rows are gone.
You see that the debugging mode is a useful tool to understand code and to inspect it for correctness.
Tip |
---|
Make regular use of the debugging mode to check if your code really does what you think it should do! |
One of the big benefits of version control is the diff viewer. Visual Studio Code highlights the changes you have introduced since your last commit.
- Deletions are shown as red triangle.
- Additions are shown as green bars.
- Changed lines are shown as striped bars.
Click on these visual elements to see a small popup that shows you the difference. In the popup, you can stage the change, revert it, or jump to the next/previous change.
Once you have a bunch of code changes that can be meaningfully grouped together, you should make a commit. If you click on "Commit", all staged changes will be committed:
Tips |
---|
Test your code (e.g. with the debugging mode) before committing it. |
Make small commits that contain only one new feature. |
If you work on a big project where you spend weeks/months on, you might want to have a backup, or to invite colleagues for collaboration or a code review. For this purpose, follow these steps:
- create an own repository on https://github.com/dasch-swiss/
- name it according to the scheme
[project_shortcode]-[project_shortname]-scripts
- in the "Source Control" tab of VS Code,
click on the three dots,
choose Remote > Add Remote...,
and set the upstream to
https://github.com/dasch-swiss/shortcode-shortname-scripts
- push your local repo to your new GitHub repo
WARNING |
---|
Do NOT push to https://github.com/dasch-swiss/00A1-import-scripts ! This is a template and should remain as it is! |
OpenRefine is a tool for working with messy data. Once downloaded and installed, it runs as a local server, accessed by your browser. So, all data remains on your own machine. If you work on a Mac and have Homebrew installed, you can simply type:
brew install openrefine
The potentials for the everyday work of the Research Data Unit at DaSCH are twofold:
- Data cleaning (recommended): For this purpose, you can think of OpenRefine as a much better version of Excel. You can perform operations which would be very tiresome in Excel.
- Conversion to our DSP-specific XML format for bulk upload (not recommended)
Git can be complicated, so you will appreciate to work with one of these GUIs: