Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Natural join seems to eliminate rows which it shouldn't #977

Open
alexpdp7 opened this issue Oct 15, 2019 · 6 comments
Open

Natural join seems to eliminate rows which it shouldn't #977

alexpdp7 opened this issue Oct 15, 2019 · 6 comments
Labels
blocked Some other issue is blocking this bug Something isn't working

Comments

@alexpdp7
Copy link

alexpdp7 commented Oct 15, 2019

MySQL [gitbase]> select blob_hash, repository_id from blobs natural join repositories where blob_hash in ('93ec5b4525363844ddb1981adf1586ebddbc21c1', 'aad34590345310fe813fd1d9eff868afc4cea10c', 'ed82eb69daf806e521840f4320ea80d4fe0af435');
+------------------------------------------+-------------------------------------+
| blob_hash                                | repository_id                       |
+------------------------------------------+-------------------------------------+
| aad34590345310fe813fd1d9eff868afc4cea10c | github.com/bblfsh/javascript-driver |
| ed82eb69daf806e521840f4320ea80d4fe0af435 | github.com/src-d/enry               |
| aad34590345310fe813fd1d9eff868afc4cea10c | github.com/bblfsh/python-driver     |
| 93ec5b4525363844ddb1981adf1586ebddbc21c1 | github.com/src-d/go-mysql-server    |
| aad34590345310fe813fd1d9eff868afc4cea10c | github.com/bblfsh/ruby-driver       |
| ed82eb69daf806e521840f4320ea80d4fe0af435 | github.com/src-d/gitbase            |
+------------------------------------------+-------------------------------------+
6 rows in set (14.90 sec)

MySQL [gitbase]> select blob_hash, repository_id from blobs where blob_hash in ('93ec5b4525363844ddb1981adf1586ebddbc21c1', 'aad34590345310fe813fd1d9eff868afc4cea10c', 'ed82eb69daf806e521840f4320ea80d4fe0af435');
+------------------------------------------+-------------------------------------+
| blob_hash                                | repository_id                       |
+------------------------------------------+-------------------------------------+
| aad34590345310fe813fd1d9eff868afc4cea10c | github.com/bblfsh/python-driver     |
| aad34590345310fe813fd1d9eff868afc4cea10c | github.com/bblfsh/javascript-driver |
| ed82eb69daf806e521840f4320ea80d4fe0af435 | github.com/src-d/enry               |
| aad34590345310fe813fd1d9eff868afc4cea10c | github.com/bblfsh/ruby-driver       |
| 93ec5b4525363844ddb1981adf1586ebddbc21c1 | github.com/src-d/gitbase            |
| ed82eb69daf806e521840f4320ea80d4fe0af435 | github.com/src-d/gitbase            |
| 93ec5b4525363844ddb1981adf1586ebddbc21c1 | github.com/src-d/go-mysql-server    |
| ed82eb69daf806e521840f4320ea80d4fe0af435 | github.com/src-d/go-mysql-server    |
+------------------------------------------+-------------------------------------+
8 rows in set (0.13 sec)

also note that removing the natural join makes things go much faster- it was my understanding that normally we want to join with repositories to benefit from some specific optimizations (although I'm guessing that filtering with blob_hash makes those optimizations moot).

@erizocosmico erizocosmico added the bug Something isn't working label Oct 15, 2019
@erizocosmico erizocosmico self-assigned this Oct 15, 2019
@erizocosmico
Copy link
Contributor

Normally we don't want to join with repositories unless there are already joins involved. When querying a single table like blobs, they usually have other optimizations in place.

For example, blobs with a filter like blob_hash IN list only reads the given blobs in each repository. That's why no join it's faster.

As with everything: it depends on the query, depending on what you want, some optimizations may be better than others for performance.

In any case, I reproduced the bug and there's actually an issue. It seems to not return the repeated rows for some reason.

@alexpdp7
Copy link
Author

Yeah, I suspected something like that. Anyway, for my use case lack of duplicated rows is not an issue, so for my this is not high priority.

@erizocosmico
Copy link
Contributor

This bug is really weird. The natural join is the one returning the correct result. If you remove the optimization in blobs table it returns the same. So, there something going on because repo.BlobObjects() doesn't return these blobs, but accessing them directly does

@erizocosmico
Copy link
Contributor

@alexpdp7 are you using siva files got from gitcollector?

I tried with regular repositories and it didn't happen.

@alexpdp7
Copy link
Author

Yup, it's using siva

@erizocosmico
Copy link
Contributor

Narrowed it down to a siva issue and reported it to go-borges: src-d/go-borges#90, so leaving this as blocked until it's solved on their side.

@erizocosmico erizocosmico added the blocked Some other issue is blocking this label Oct 15, 2019
@erizocosmico erizocosmico removed their assignment Oct 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked Some other issue is blocking this bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants