Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rename main branch #3

Merged
merged 10 commits into from
Feb 12, 2025
Merged

Rename main branch #3

merged 10 commits into from
Feb 12, 2025

Conversation

dey4ss
Copy link
Member

@dey4ss dey4ss commented Feb 6, 2025

This PR adapts the GH action to the renamed main branch, adds a comparison of generated data and query parameters to the action, and fixes small compiler warnings encountered with gcc-13.

Note on the change in phash.c

jcch-dbgen/skew/phash.c

Lines 85 to 91 in 55da3ba

// Hyrise: cast subtrahend as long. Otherwise, we got inconsistent offsets on macOS/clang/ARM. E.g., for
// `key = 79`, `tbl_size = 100`, we got `row = 3`, `(0.18 + row * 0.2) * tbl_size) = 78.0`, but `offset = 0`
// (instead of 1). With the cast, data is consistent across our tested systems.
// Note that this means we have a few customers/suppliers in the dataset (5 for SF 0.01, 8 for SFs 1 and 10)
// that have a different `nationkey` compared to the version without the cast. We chose to trade off these
// negligible discrepancies for consistency.
long offset = key - (long)((0.18 + row * 0.2) * tbl_size);

I generated some data and added a comparison to the GH action. Thus, I noticed that there were discrepancies between my machine (+ GH macOS runners) and the GH Ubuntu runners (+ nemea). I am not 100% sure if it's an STL or a CPU architecture thing.

The generated data is not 100% accurate compared to the current master branch BUT at least it is consistent now. For the customer/supplier tuples in the skewed dataset, we introduce a systematic bias in the nationkey assignment, but it is not clear if that error also favors a higher bin here:

jcch-dbgen/skew/phash.c

Lines 62 to 67 in a7cfdd2

static uint16_t nations_map[25] = /* mapping between countries and their keys */
{15, 0, 5, 14, 16, /* AFRICA (MOROCCO | ALGERIA, ETHIOPIA, KENYA, MOZAMBIQUE) */
24, 1, 2, 3, 17, /* AMERICA (UNITED STATES | ARGENTINA, BRAZIL, CANADA, PERU)*/
18, 9, 12, 21, 8, /* ASIA (CHINA | INDONESIA, JAPAN, VIETNAM, INDIA)*/
7, 6, 22, 19, 23, /* EUROPE (GERMANY | FRANCE, RUSSIA, ROMANIA, UNITED KINGDOM*/
4, 10, 11, 13, 20}; /* MIDDLE EAST (EGYPT | IRAN, IRAQ, JORDAN, SAUDI ARABIA)*/

However, as it is a consistant bias affecting all tuples/nationkeys the same, it should not make a difference in the end.

An alternative to avoid the systematic error would be to use lround(). The results are also consistent across the systems. The differences for both compared to the master are the same: 5 tuples for SF 0.01, 8 tuples for SFs 1 and 10.

@dey4ss dey4ss requested a review from Bouncner February 12, 2025 14:08
@dey4ss dey4ss changed the title Prepare branch renaming Rename main branch Feb 12, 2025
@dey4ss dey4ss merged commit e6c7df6 into main Feb 12, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants