A Python script to process XML files created by the 'Qualitivity' plugin for Trados Studio. The script creates CSV file(s) with details of the language translation process including keystroke counts, count and duration of pauses in typing, and the duration of 'Records' or segment visits.
The script uses pandas (https://pandas.pydata.org/) for the creation of the CSV file and NumPy (https://numpy.org/) for date parsing and the calculations of milliseconds.
Setup a virtual environment:
python3 -m venv venv
source venv/bin/activate
Install libraries used by the script:
pip install --upgrade pip
pip install -r requirements.txt
If running on Windows, see https://docs.python.org/3/tutorial/venv.html
The scripts take two arguments: a directory of the input XML files and a directory for the output files. The script will create the output directory if needed.
The following will create two files for each input XML file. One file will hold the duration and count values for each 'Record' in the source XML. The second is an 'audit' file providing each interval between keystrokes without categorisation. The 'audit' includes information about any 'system keystrokes' omitted from the main output files. These are keystroke logs generated by Trados Studio (e.g. when a segment is automatically populated with a machine translation match).
python process_xml.py ./sample ./output
It is possible to combine the results, so we only have two files: 'combined.csv' and 'combined-audit.csv'. The first has the duration and count values and the second provides the 'audit'. The combined files have an additional column – 'File' – that indicates the source XML file.
python process_xml.py ./sample ./output --combine
The features for each 'Record' in the Qualitivity XML files are as follows. All pause measures are provided based on three minimum pause duration thresholds: 300 milliseconds and above (_300), 500 milliseconds and above (_500), and 1 second and above (_1s).
- Record ID: The ID for each 'Record' in the Qualitivity output
- Segment ID: The ID for each text segment
- Total pause duration: The total duration of pauses - milliseconds.
- Pause count: The number of pauses - count.
- Keystrokes: The number of keystrokes ('ks created' elements) - count.
- Active ms: The duration of each 'Record' copied from the 'activeMiliseconds' attribute in the Qualitivity XML file - milliseconds.
- Record duration: The duration of each 'Record' computed as the difference between 'stopped' and 'started' times for the record - milliseconds.
- Total duration: Same as Record duration, obtained by adding up all intervals in between keystrokes or beginning/end of a record.