About dataset statistics #15

yangb0o · 2024-12-24T06:30:18Z

Hi,

I noticed that the description mentions:

In addition, to help researchers become familiar with our data and run quick experiments, we are releasing a demo and a small version of the EB-NeRD by randomly sampling 5,000 and 50,000 users and their behavior logs from the full dataset.

But my stats on the small version show that the number of users is 18,827. The code for the statistics is as follows:

import os
import pyarrow.parquet as pq


DATA_DIR = './small'


train_df = pq.ParquetFile(os.path.join(DATA_DIR, 'train', 'behaviors.parquet')).read().to_pandas()
test_df = pq.ParquetFile(os.path.join(DATA_DIR, 'validation', 'behaviors.parquet')).read().to_pandas()
# or
# train_df = pq.ParquetFile(os.path.join(DATA_DIR, 'train', 'history.parquet')).read().to_pandas()
# test_df = pq.ParquetFile(os.path.join(DATA_DIR, 'validation', 'history.parquet')).read().to_pandas()

cnt_dict = {}
for item in train_df.to_dict(orient='records'):
  cnt_dict[item['user_id']] = 0
for item in test_df.to_dict(orient='records'):
  cnt_dict[item['user_id']] = 0

print(len(cnt_dict))
# Output: 18827

I would like to know the reason for this. Thanks!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About dataset statistics #15

About dataset statistics #15

yangb0o commented Dec 24, 2024 •

edited

Loading

About dataset statistics #15

About dataset statistics #15

Comments

yangb0o commented Dec 24, 2024 • edited Loading

yangb0o commented Dec 24, 2024 •

edited

Loading