Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove bech32 data columns (or atleast make them optional) #1835

Open
rdlrt opened this issue Sep 1, 2024 · 8 comments
Open

Remove bech32 data columns (or atleast make them optional) #1835

rdlrt opened this issue Sep 1, 2024 · 8 comments
Labels
bug Something isn't working

Comments

@rdlrt
Copy link

rdlrt commented Sep 1, 2024

Problem Report

Currently dbsync stores bech32 encoded strings for various columns across tables.

  1. These text fields are indexed as hash type, which are sub-optimal , as they cannot be used in parallel. Thus, restoring from a snapshot - ends up waiting for 10-12 hours for index against tx_out.address - as it only uses a single CPU/worker for this process (atleast as of Postgres 16, as listed here ).

  2. Secondly, the encoding standards are often based on CIP, and a change in CIP (eg: CIP129) would mean dbsync can only use/implement it when changing previous data, which could be quite an unnecessary data operation. Larger the history grows, bigger the resistance would be to add any improvements to naming if it has a large data migration dependency.

The request is for dbsync not to store bech32 data at all, or if they need to be - atleast make them optional. Instead, consumers can use the decoding/encoding on where filter of queries, by making use of extensions (eg: pg_bech32 , pg_base58, etc.). While hopefully the address field itself is moving to it's own table soon as part of #1820 , it's still a bit of an overkill to store duplicate copy of similar data.

*** Positive Outcomes ***

  • Reduces time taken for restore snapshot (especially the additional 8-12 hours it waits for single index) , due to index parallelisation, allowing better utilisation of resources.
  • Reduction in overall storage size of dbsync database
  • Remove data that is essentially derivable from another column.
  • Improve maintenance time to have up to date stats on tables, which in turn helps query planner create better guess (especially helpful when maintaining a materialized view or seperate cache table).
  • Consistency across users (users having dependency on bech32-encoded fields, others on raw format).
  • Lookup on bytea being faster than text

*** Cons ***

  • User having to install postgresql extension and update their SQL queries once to replace current bech32 columns with something like b32_encode(,encode(,'hex') ) and filters to be something like b32_decode() - (additional bits for network indicator on addresses or script indicator on drep can be added to extension itself to make it cardano-specific).
@rdlrt rdlrt added the bug Something isn't working label Sep 1, 2024
@rdlrt rdlrt changed the title Make bech32 output optional (or remove them entirely) Remove bech32 data columns (or atleast make them optional) Sep 1, 2024
@gufmar
Copy link
Contributor

gufmar commented Sep 1, 2024

  • Lookup on bytea being faster than text

This alone sounds like a good reason.
I generally like the proposed change

@Fell-x27
Copy link

Fell-x27 commented Sep 1, 2024

It's not just about Bech32; it's about text address representation in general. So, this applies to both Bech32 for Shelley and Base58 for Byron.

I don't think removal is a good idea, but making them optional would be the best solution.In fact, raw address data is sufficient for everything. It can be naturally indexed without hashing, and it takes up less memory for storage, even if we're talking about a separate table for addresses in the future.

Cardano is about CBOR, not text. If you need to find any text address in the database, just convert it to raw data—whether on the client/backend side or even at the PostgreSQL level. This will be even faster than implicit text hashing.

Moreover, as for Shelley, we don't use text view at all due to the nature of the addresses. We only need the spending part, which is already stored as binary data. If we need to build a text address, we do it manually by deriving stake keys ourselves to ensure they are correct, and then we build the full address.

It might be a good idea to use the payment_cred column for Byron addresses too and define an explicit column for the address era. Of course, we can determine whether it's Byron or not by certain characteristics like the length of the binary, but this is not a reliable solution—it's an opportunistic workaround. Tomorrow, there could be a fourth address type, and this hack would no longer be effective.

@kderme
Copy link
Contributor

kderme commented Sep 1, 2024

A bit unfortunate that recently we removed the raw bytea address, with the assumption that text address is used more frequently.

These text fields are indexed as hash type

It doesn't really change for bytea, especially since the size of addresses can be quite big due to older bugs.

@rdlrt
Copy link
Author

rdlrt commented Sep 1, 2024

It doesn't really change for bytea, especially since the size of addresses can be quite big due to older bugs.

The bytea fields can still be made btree indexed like for block/tx tables - which allows for parallel indexing

@kderme
Copy link
Contributor

kderme commented Sep 1, 2024

btree indexed data must not exceed 2704 bytes, but we have

select max (length (address)) from tx_out;
  max  
-------
 30864

raw bytes are ofcourse slightly smaller, but still.
Also hash indexes are much faster to query in cases like that.

@rdlrt
Copy link
Author

rdlrt commented Sep 1, 2024

Ugh, these 17 records on mainnet are from byron era :( and pointer addresses are now retired. We'd have to live with that indexing forever on cardano? Coz currently (5 years history) , that index creation is taking 8-12 hours... coming years it might take multiple days for one index alone

@rdlrt rdlrt closed this as completed Sep 1, 2024
@rdlrt rdlrt reopened this Sep 1, 2024
@kderme
Copy link
Contributor

kderme commented Sep 1, 2024

There should be some workaround, I haven't put much thought into it. We could insert a hash of the address when it's out of limits and introduce a new table huge_address. Just thinking out loud, I'm not sure if this will cause too much disturbance to users.

@rdlrt
Copy link
Author

rdlrt commented Sep 1, 2024

Ye - was my preference too (to split them out to seperate table or column - especially with address detail moving out). Move those anomolies out and be able to optimise tx_out better.
I didnt find a single hit for those addresses in query logs for koios either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants