Skip to content

Latest commit

 

History

History
105 lines (81 loc) · 7.05 KB

DATASET.md

File metadata and controls

105 lines (81 loc) · 7.05 KB

Documentation regarding the EVE dataset

Introduction

The EVE dataset is a video dataset of Point-of-Gaze (on a 25 inch computer display) during the viewing of image, video, and text-based (wikipedia) stimuli. It consists of 54 participants (39 train / 5 validation / 10 test), with video taken from 4 different views while they gaze upon the presented stimuli.

For further information, and to gain access to the dataset, please follow the instructions at the EVE project page.

Dataset folder structure

On receipt (and download) of the dataset, you will find that it consists of 54 folders, each corresponding to a single dataset participant. The following folders should exist:

train01, train02, ..., train39

val01, val02, ..., val05

test01, test02, ..., test10

Each folder will consist of subfolders with names similar to the following:

step008_image_MIT-i2263021117
step009_image_MIT-i2267703789
step010_image_MIT-istatic-outdoor-street-city-cambridge-uk-IMG-8893
step011_image_MIT-i1325514089
...
step030_video_diem-harry-potter-6-trailer
step031_video_diem-movie-trailer-ice-age-3
step032_video_vagba-track01
...
step116_video_Wikimedia-Washlets-high-tech-toilets-in-Japan
step117_video_Wikimedia-Barack-Obama-inaugural-address
step120_wikipedia_wikipedia-random

which corresponds to approximately:

  • 60 image stimuli (each exposed for 3 seconds)
  • 12-minutes worth of video stimuli
  • 3x 2-minute Wikipedia gazing

The image and video stimuli are sourced from the following

Stimulus folder contents

In each stimulus folder, one can find data files pertaining to the 4 different camera views (basler, webcam_l, webcam_c, webcam_r):

  • {camera}.mp4: the video recording after removal of camera distortion
  • {camera}_eyes.mp4: a video of the 2 eyes, after "data normalization"
  • {camera}_face.mp4: a video of the face, after "data normalization"
  • {camera}.timestamps.txt: a file containing the timestamps of each frame present in the MP4 files above
  • {camera}.h5: an HDF archive containing intermediate values from pre-processing and the ground-truth labels associated with the above files

In addition, we provide information regarding the screen content with the following files:

  • screen.mouse.txt: a file where each line contains <timestamp> <mouse x-position in pixels> <mouse y-position in pixels>
  • screen.mp4: a full 1080p recording of the screen
  • screen.128x72.mp4: a downscaled version of the full video for the purpose of faster data loading during training
  • screen.timestamps.txt: a file containing the timestamps of each frame present in the MP4 files above

HDF file format

The following keys are provided directly as shown (presented with name - shape of array):

  • camera_matrix - (3, 3) - camera calibration matrix using the pinhole camera model
  • camera_transformation - (4, 4) - the full transformation to bring a point from the screen coordinate system to the given camera's coordinate system
  • inv_camera_transformation - (4, 4) - the inverse of the previous line
  • millimeters_per_pixel - (2, ) - the x/y scaling factors to convert pixels to millimeters
  • pixels_per_millimeter- (2, ) - the inverse of the previous line

The following values consist of both the actual values (under the field data) and whether the value is valid as described by the Tobii firmware, and via failures in the data pre-processing procedure (under the field validity). The shapes described below are of the actual values (under the field data).

  • facial_landmarks - (N, 68, 2) - facial landmarks (3D prediction, but only u,v retained) detected using FAN wrt the full camera frame
  • head_rvec - (N, 180, 3, 1) - the rotation of the head as determined via cv2.solvePnP
  • head_tvec - (N, 180, 3, 1) - the translation of the head as determined via cv2.solvePnP
  • left_p - (N, ) - pupil size in millimeters of the left eye
  • right_p - (N, ) - pupil size in millimeters of the right eye

The so-called data normalization procedure (described further later) is used to yield image patches specifically of the face, left-eye and right-eye. This procedure yields the following values, where N is the number of frames:

  • {face,left,right}_PoG_tobii - (N, 2) - on-screen pixel coordinates for Point-of-Gaze as estimated by the Tobii Pro Spectrum device
  • {face,left,right}_g_tobii - (N, 2) - the roll-removed gaze direction after data normalization in spherical coordinates
  • {face,left,right}_R - (N, 3, 3) - the rotation correction applied to the raw gaze direction vector (line 72 of example code)
  • {face,left,right}_W - (N, 3, 3) - the perspective transform matrix (line 74 of example code)
  • {face,left,right}_h - (N, 2) - the roll-removed head orientation after data normalization in spherical coordinates
  • {face,left,right}_o - (N, 3) - the 3D origin of gaze as defined for the particular patch (face, left-eye, or right-eye)

A note on the data normalization procedure

For a detailed explanation of this procedure, please refer to the paper, example code, and materials provided at https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/research/gaze-based-human-computer-interaction/revisiting-data-normalization-for-appearance-based-gaze-estimation/

Simply speaking, this procedure provides improved consistency when producing eye or face patches via careful control of a virtual camera. In this paper by Zhang et al. an improvement was proposed to the data normalization method, and thus we use this adjusted method in processing EVE's ground-truth.

Please refer to the official example Python script for data normalization (https://www.mpi-inf.mpg.de/fileadmin/inf/d2/xucong/data_normalization_code.zip) to find the corresponding measures as shown above, demonstrated by code.

A note on the timestamps

For the purpose of this dataset, all timestamps may be considered as being synchronized. However, our camera synchronization is a best effort and due to the hardware and firmware involved, reliable synchronization exists only between the basler camera and the eye tracking data. The webcams (webcam_l, webcam_c, webcam_r) were synchronized via the system timestamp as provided by the Video4Linux driver and may be imprecise.