Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: Hash @indices can grow larger than Int32::MAX bytes #15347

Conversation

ysbaddaden
Copy link
Contributor

@ysbaddaden ysbaddaden commented Jan 15, 2025

fixes #15341

See #15341 (comment) for details.

@ysbaddaden ysbaddaden added kind:bug A bug in the code. Does not apply to documentation, specs, etc. topic:stdlib:collection labels Jan 15, 2025
@ysbaddaden ysbaddaden self-assigned this Jan 15, 2025
@straight-shoota
Copy link
Member

straight-shoota commented Jan 15, 2025

We should add a spec for this. Something like Hash(String, String).new(initial_capacity: Int32::MAX) would do.

@ysbaddaden
Copy link
Contributor Author

Oh, pw2ceil(Int32::MAX) becomes Int32::MAX + 1 and thus raises a math overflow (oops). We shall fix that too.

@ysbaddaden
Copy link
Contributor Author

ysbaddaden commented Jan 16, 2025

The spec is painfully slow, wasting over 9s to allocate & initialize the memory 😕

  Hash creates with max capacity
    9.21 seconds spec/std/hash_spec.cr:1396

The overall std spec runtime is 45s on my computer. Adding 9s has a +20% impact 😰

It does allocate too much memory, which breaks on CI, is definitely too large on 32-bit targets, and is far too slow overall, even on fast hardware.
@straight-shoota
Copy link
Member

The spec is painfully slow, wasting over 9s to allocate & initialize the memory

Ouch that's a lot 🙈 Of course, we're trying to allocate 10GB total (8 for the indices, 2 for the entries).
Certainly we shouldn't run this by default. Maybe put it with the manual specs? (although that effectively means it'll almost never be tested 🤷).

What if we try to allocate just (Int32::MAX // 4) + 1 (allocates 2.5GB). This should validate fixing the original issue. It still takes 6 seconds, though (on my machine). So we still shouldn't run it on default. But at least it works doesn't require quite as much memory.

@ysbaddaden
Copy link
Contributor Author

Where are the manual specs 🔍

@straight-shoota
Copy link
Member

Where are the manual specs 🔍

That's exactly what I meant with "it'll almost never be tested" 😆

👉 https://github.com/crystal-lang/crystal/tree/master/spec/manual

@ysbaddaden
Copy link
Contributor Author

It actually allocates even more memory than that.

With Int32::MAX that's ~18GB:

  • @entries: 2147483647 bytes (~2GB)
  • @indices: 17179869176 bytes (~16GB)

With Int32::MAX // 4 + 1 that's 4.5GB:

  • @entries: 536870912 bytes (512MB)
  • @indices: 4294967296 bytes (4GB)

In fact I wonder if the indices are worth taking so much memory.

@straight-shoota
Copy link
Member

Oh, right. It's capacity * 2 * 4 for the indices malloc.

Yeah that's quite a lot for huge collections. Probably quite a lot for smaller ones, too?

@straight-shoota straight-shoota added this to the 1.16.0 milestone Jan 16, 2025
@ysbaddaden
Copy link
Contributor Author

ysbaddaden commented Jan 16, 2025

Argh, I got the megabytes wrong for @entries I only reported the number of entries, the actual bytes are size * sizeof(Entry(K, V)), so 12 times more bytes per entry for Hash(Int32, Int32) 😨

@straight-shoota
Copy link
Member

One thing is quite certain then: It's pretty hard to understand the code and reason about the allocations 🙈

@crysbot
Copy link
Collaborator

crysbot commented Jan 16, 2025

This pull request has been mentioned on Crystal Forum. There might be relevant details there:

https://forum.crystal-lang.org/t/goldbachs-conjecture-and-crystal-code/7579/27

@straight-shoota straight-shoota merged commit c4682db into crystal-lang:master Jan 18, 2025
69 of 70 checks passed
@ysbaddaden ysbaddaden deleted the fix/hash-indices-size-can-grow-larger-than-int32 branch January 18, 2025 15:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:bug A bug in the code. Does not apply to documentation, specs, etc. topic:stdlib:collection
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inexplicable arithmetic overflow ??
3 participants