Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve resilience of Live Share to psql -f #123

Open
thenonameguy opened this issue Nov 1, 2024 · 5 comments
Open

Improve resilience of Live Share to psql -f #123

thenonameguy opened this issue Nov 1, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@thenonameguy
Copy link

thenonameguy commented Nov 1, 2024

Bug report

While building a demo using the Live Share feature of database.build, I've seen many intermittent failures of the connection. Usually fixed by stopping the Live Share, reloading the tab and restarting the Live Share.

Describe the bug

While running many statements (via psql -f) against a Live Share'd non-trivial DB the server/proxy regularly loses sync, while seemingly hallucinating gigabyte-sized messsage sizes (1853122928 bytes or 1667196005 bytes as examples).

psql:schemamap_init.sql:462: lost synchronization with server: got message type "e", length 1667196005
psql:schemamap_init.sql:462: error: connection to server was lost

To Reproduce

Steps to reproduce the behavior, please provide code snippets or a repository:

  1. Open database.build
  2. Run prompt: Create a well-modeled ERP schema, with multi-tenancy, constraints on most columns (especially tenant_id) and mock data.
  3. Enable Live Share, copy URI
  4. Download schemamap_init.txt (sql script)
  5. Run: psql -f ./schemamap_init.txt "$URI"
  6. Expectation: it should succeed, although there are sometimes failures here as well
  7. Try rerunning step 5 a few times, eventually you will a get a "lost synchronization".

Expected behavior

The client does not get out of sync.

System information

  • OS:
    • MacOS: psql (PostgreSQL) 15.8
    • Linux: psql (PostgreSQL) 15.8
  • Browser: Firefox 131.0.3 (aarch64)
  • Version of supabase-js: N/A
  • Version of Node.js: N/A

Additional context

I'm aware the client is not major 16 version, but for these simple statement executions, it should not matter.

Another invocation failed later:

psql:schemamap_init.txt:1361: lost synchronization with server: got message type "m", length 1701737504
psql:schemamap_init.txt:1361: error: connection to server was lost

I haven't been able to reproduce the bug on Google Chrome 130.0.6723.92 (Official Build) (arm64)

@thenonameguy thenonameguy added the bug Something isn't working label Nov 1, 2024
@gregnr
Copy link
Collaborator

gregnr commented Nov 1, 2024

Thanks for reporting @thenonameguy, we'll dig into this. cc @jgoux

@gregnr
Copy link
Collaborator

gregnr commented Nov 1, 2024

@thenonameguy by chance do you have the original migrations create by AI prior to using Live Share to help us closer reproduce this?

@thenonameguy
Copy link
Author

thenonameguy commented Nov 1, 2024

I made something even better. Followed the repro steps and stopped live sharing after receiving:

Failed to install SDK: psql:<stdin>:17: lost synchronization with server: got message type "o", length 1920229740
psql:<stdin>:17: error: connection to server was lost

Then exported the state of the PGLite:
erp-schema-with-multi-tenancy-1730479427654.gz

For completeness, here are the migrations:

-- Migrations will appear here as you chat with AI

create table tenants (
  id bigint primary key generated always as identity,
  name text not null,
  created_at timestamptz not null default now()
);

create table users (
  id bigint primary key generated always as identity,
  tenant_id bigint not null,
  username text not null unique,
  email text not null unique,
  password text not null,
  created_at timestamptz not null default now(),
  foreign key (tenant_id) references tenants (id) on delete cascade
);

create table products (
  id bigint primary key generated always as identity,
  tenant_id bigint not null,
  name text not null,
  description text,
  price numeric(10, 2) not null check (price > 0),
  created_at timestamptz not null default now(),
  foreign key (tenant_id) references tenants (id) on delete cascade
);

create table orders (
  id bigint primary key generated always as identity,
  tenant_id bigint not null,
  user_id bigint not null,
  total numeric(10, 2) not null check (total >= 0),
  status text not null check (status in ('pending', 'completed', 'cancelled')),
  created_at timestamptz not null default now(),
  foreign key (tenant_id) references tenants (id) on delete cascade,
  foreign key (user_id) references users (id) on delete set null
);

create table order_items (
  id bigint primary key generated always as identity,
  order_id bigint not null,
  product_id bigint not null,
  quantity int not null check (quantity > 0),
  price numeric(10, 2) not null check (price > 0),
  foreign key (order_id) references orders (id) on delete cascade,
  foreign key (product_id) references products (id) on delete cascade
);

@jgoux
Copy link
Contributor

jgoux commented Nov 7, 2024

Hello @thenonameguy and thanks for your detailed report!

Unfortunately I wasn't able to reproduce the issue on my end. I tried in Firefox using your initial migration file then enabling Live Sharing and executing your schemamap_init.sql file using psql v15.8. I did like ~20 runs of it and they all succeeded.

Let's wait to see if there are other reports. 👍

Also, we just released the ability to drop .sql files directly into the chat, so you can now run your script directly against your in-browser database without going through Live Share! 🎊

@thenonameguy
Copy link
Author

Thanks for looking into it @jgoux!
My current hunch is that this might be related to the Websocket proxy.

I've had some friends/ex-colleagues try out the demo flow with Live Share in different countries/web browsers, and the issues popped up the same.

The demo is available here, after clicking "Yes".
https://app.schemamap.io/demo

Since I'm using a single VPS to connect to the Websocket proxy, might it be related to hitting some form of rate limits?

I've been also getting portal "p2" does not exist errors as well since.

I'm adding metric & trace collection to connection failures so I can provide a bit more detail on the ticket, please don't close it just yet. Cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants