GitHub - midusi/LSA-T at v1.0

2 Branches 1 Tag

Name	Name	Last commit message	Last commit date
Latest commit pedroodb Updated paper links Jun 28, 2023 a058c51 · Jun 28, 2023 History 137 Commits
docs	docs	Updated paper links	Jun 28, 2023
lsat	lsat	Update DB classname	Aug 1, 2022
.gitignore	.gitignore	Class rename and package publication	Aug 1, 2022
LICENCE	LICENCE	Class rename and package publication	Aug 1, 2022
README.md	README.md	Updated paper links	Jun 28, 2023
pyproject.toml	pyproject.toml	Class rename and package publication	Aug 1, 2022
setup.py	setup.py	Package restructuring	Jun 27, 2022

Repository files navigation

LSA-T: The first continuous LSA dataset

LSA-T is the first continuous Argentinian Sign Language (LSA) dataset. It contains 14,880 sentence level videos of LSA extracted from the CN Sordos YouTube channel with labels and keypoints annotations for each signer. Videos are in 30 FPS full HD (1920x1080).

Download link (45GB compressed)
Visualization notebook
Presentation paper (preprint PDF)

Format

Samples are organized in directories according to the playlists and video they belong to. For each sample i there are four files:

i.mp4: the clip corresponding to the ith line of subtitles.
i.json contains:
- label: the line of subtitles corresponding to the clip.
- start: time in seconds where the subtitle starts.
- end: time in seconds where the subtitle ends.
- video: title of the video which the clip belongs to.
- playlist: title of the playlist which the clip belongs to.
i_ap.json: the raw AlphaPose results over the clip using Halpe KeyPoints in AlphaPose default output format.
i_signer.json contains:
- scores: for each person in the clip, the amount of "movement" in its hands. It is used to infer who is the signer.
- roi: the considered region of interest of the clip (bounding box of the infered signer).
- keypoints: list of keypoints for each frame of the infered signer in same format that in i_ap.json.

Usage

This repository can be installed via pip and contains the LSA_Dataset class (in lsat.dataset.LSA_Dataset module). This class inherits from the Pytorch dataset class and implements all necessary methods for using it with a Pytorch dataloader. It also manages the downloading and extraction of the database.

Also, useful transforms for the clips and keypoints are provided in lsat.dataset.transforms

Statistics and comparison with other DBs

	LSA-T	PHOENIX*	SIGNUM	CSL	GSL	KETI
language	Spanish	German	German	Chinese	Greek	Korean
sign language	LSA	GSL	GSL	CSL	GSL	KLS
real life	Yes	Yes	No	No	No	No
signers	103	9	25	50	7	14
duration (h)	21.78	10.71	55.3	100+	9.51	28
# samples	14,880	7096	33,210	25,000	10,295	14,672
# unique sentences	14,254	5672	780	100	331	105
% unique sentences	95.79%	79.93%	2.35%	0.4%	3.21%	0.71%
vocab. size (w)	14,239	2887	N/A	178	N/A	419
# singletons (w)	7150	1077	0	0	0	0
% singletons (w)	50.21%	37.3%	0%	0%	0%	0%
vocab. size (gl)	-	1066	450	-	310	524
# singletons (gl)	-	337	0	-	0	0
# singletons (gl)	-	31.61%	0%	-	0%	0%
resolution	1920x1080	210x260	776x578	1920x1080	848x480	1920x1080
fps	30	25	30	30	30	30

*Data was not available for the whole PHOENIX dataset, so the table show its train set statistics.

Evaluation splits

	LSA-T	Full version		Reduced version
		Train	Test	Train	Test
signers	103	X	X	X	X
duration [h]	21.78	17.49	4.29	15.85	3.89
# sentences	14,880	11,065	2735	3767	910
% unique sentences	95.79%	96.64%	92.78%	96.88%	98.35%
vocab. size	14,239	12,385	5546	2694	1579
% singletons	50.21%	52.01%	61.9%	23.2%	48.83%
% sentences with singletons	34.97%	40.98%	67.97%	14.36%	54.29%
% sentences with words not in train vocabulary	-	-	59.2%	-	84.5%

Citation

@article{bianco2022lsa,
  title={LSA-T: The first continuous Argentinian Sign Language dataset for Sign Language Translation}, 
  author={Bianco, Pedro Dal and R{\'\i}os, Gast{\'o}n and Ronchetti, Franco and Quiroga, Facundo and Stanchi, Oscar and Hasperu{\'e}, Waldo and Rosete, Alejandro},
  journal={arXiv preprint arXiv:2211.15481},
  year={2022}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LSA-T: The first continuous LSA dataset

Format

Usage

Statistics and comparison with other DBs

Evaluation splits

Citation

About

Releases 1

Packages

Languages

License

midusi/LSA-T

Folders and files

Latest commit

History

Repository files navigation

LSA-T: The first continuous LSA dataset

Format

Usage

Statistics and comparison with other DBs

Evaluation splits

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages