Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unified schema GitHub #45

Merged
merged 8 commits into from
Oct 29, 2019
Merged

Unified schema GitHub #45

merged 8 commits into from
Oct 29, 2019

Conversation

smacker
Copy link
Contributor

@smacker smacker commented Oct 14, 2019

Fix: #35

This breaks backward compatibility. Don't merge yet.
Tests check only provider-specific schema for now.


  • I have updated the CHANGELOG file according to the conventions in keepachangelog.com
  • This PR contains changes that do not require a mention in the CHANGELOG file

Copy link
Contributor

@dpordomingo dpordomingo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM plus one suggestion and one question.

if err != nil {
return fmt.Errorf("failed to drop unified view pull_request_reviews: %v", err)
}
// TODO: state value rename COMMENTED to something else?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not add this as an issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense to fix it here. I just want to hear somebody else's opinion on the new name.
Current statuses are approved/rejected/commented. I would rename it to "viewed" (in one of bitbuckets review actually have a boolean variable "approved", which would map nicely on "viewed" but not "commented")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the benefit of "viewed", because I'd say that the purpose of a review is either approve, reject or comment the PR.
Also, I didn't understand you when mentioning BB reviews; but if it means that BB reviews can only happen to approve or not, I'd map them as approved/commented

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can keep "commented" if you prefer.

Copy link
Contributor

@dpordomingo dpordomingo Oct 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a strong opinion, for sure, but I'd vote for commented if you ask ;)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smacker could you please recap the values for the 3 different providers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

github: APPROVED/CHANGES_REQUESTED/COMMENTED
bbserver: APPROVED/REVIEWED
for bbcould&gitlab there is a boolean field "approved" (according to the table) but maybe if we fetch data from activity endpoints we can get some text values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If have to go for a duality, I'd go for APPROVED/COMMENTED. Simply because a comment could also not be a review, but a review should always be a comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

APPROVED/COMMENTED sounds good to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines 63 to 76
_, err := s.DB.ExecContext(ctx, "DROP MATERIALIZED VIEW IF EXISTS owners")
if err != nil {
return fmt.Errorf("failed to drop unified view owners: %v", err)
}
_, err = s.DB.ExecContext(ctx, fmt.Sprintf(`CREATE MATERIALIZED VIEW owners AS
SELECT login, name
FROM github_organizations_versioned WHERE %v = ANY(versions)`, v))
if err != nil {
return fmt.Errorf("failed to create unified view owners: %v", err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about something like the example below.
Imo, that way it could be reviewed more easily, and it could be also useful for documentation purposes because it will be more clear the mapping between the view, and its underlying data.

var views map[string]string = map[string]string{
  "owners": `SELECT login, name
	FROM github_organizations_versioned WHERE %v = ANY(versions)`,
  "users": `SELECT login, name
	FROM github_users_versioned WHERE %v = ANY(versions)`,
  "issues": `SELECT repository_owner, repository_name, repository_owner || '/' || repository_name AS repository_full_name,
           number, state, title, body,
           created_at, closed_at, updated_at, comments, user_id, user_login, htmlurl AS html_url, labels
	FROM github_issues_versioned WHERE %v = ANY(versions)`,
}

func createMaterializedVire(viewname string, viewFilter, version int) err {
  _, err := s.DB.ExecContext(ctx, fmt.Printf("DROP MATERIALIZED VIEW IF EXISTS %s", viewname))
  if err != nil {
    return fmt.Errorf("failed to drop unified view %s: %v", viewname, err)
  }

  _, err = s.DB.ExecContext(ctx, fmt.Sprintf(`CREATE MATERIALIZED VIEW %s AS %s`,
    fmt.Printf(viewFilter, version),
    viewname,
  ))
  if err != nil {
    return fmt.Errorf("failed to create unified view %s: %v", viewname, err)
  }

  return nil
}

func (s *DB) SetActiveVersion(ctx context.Context, v int) error {
  for name, query := range views {
    if err := createMaterializedVire(name, query, v); err != nil {
      return err
    }
  }

  return nil
}

Copy link
Contributor

@se7entyse7en se7entyse7en Oct 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, if we can do something like what @dpordomingo proposed I think it would more readable, instead of having a wall of sql statements.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if I change to a map it would still be a wall of sql queries. Let me think about how to improve it.

Copy link
Contributor Author

@smacker smacker Oct 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't come up with anything better than this: 59c1805

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mhmmm... doesn't change that much in terms of visual perception 😂. I'm ok in both ways.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really prefer this other version, because now the SQL code and logic are clearly separated.
thanks!

Copy link
Contributor

@dpordomingo dpordomingo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmm, can you help me to understand the join?

SELECT c.repository_owner, c.repository_name, c.repository_owner || '/' || c.repository_name AS repository_full_name,
c.issue_number, c.created_at, c.body, c.user_id, c.user_login, c.htmlurl AS html_url
FROM github_issue_comments_versioned AS c
JOIN github_issues_versioned AS i ON
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't we use only the content of github_issue_comments_versioned?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

github_issue_comments_versioned contains both comments for PRs and Issues.
In unified schema issue_comments contains only comments for issues and pull_request_comments only for pull requests. To distinguish them we need this join.

_, err = s.DB.ExecContext(ctx, fmt.Sprintf(`CREATE MATERIALIZED VIEW pull_request_comments AS
SELECT c.repository_owner, c.repository_name, c.repository_owner || '/' || c.repository_name AS repository_full_name,
issue_number AS pull_request_number, c.created_at, c.body, c.user_id, c.user_login, c.htmlurl AS html_url
FROM github_issue_comments_versioned AS c
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't we use only the content of github_pull_request_comments_versioned?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for myself: UNION is missing here, will need to fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was fixed

@dpordomingo dpordomingo self-requested a review October 22, 2019 12:16
@se7entyse7en
Copy link
Contributor

@smacker could you rebase this now that we #57 landed in master?

@smacker
Copy link
Contributor Author

smacker commented Oct 22, 2019

@se7entyse7en Yes, I'll rebase and do something with the tests. But if you take a quick look at the changes to make sure you don't have any radical suggestions for improvements that would require rewriting everything, that would be nice.

@se7entyse7en
Copy link
Contributor

But if you take a quick look at the changes to make sure you don't have any radical suggestions for improvements that would require rewriting everything, that would be nice.

@smacker sure! I'll have a look 👍


_, err := s.DB.ExecContext(ctx, fmt.Sprintf(`CREATE OR REPLACE VIEW organizations AS
_, err := s.DB.ExecContext(ctx, "DROP MATERIALIZED VIEW IF EXISTS owners")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Premature optimization is the root of all evil, but could it be that using https://www.postgresql.org/docs/10/sql-refreshmaterializedview.html could be beneficial in terms of performances?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also once we have multiple providers we need to think about this, as we cannot make each provider's storer run these independently. We would end up with only the data of the last provider that finished if I didn't miss anything. I guess that we would need a single db committer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added drop here to be able to develop first of all. Also, we might want to keep it for the first (maybe unstable) versions to allow changing the views without migration. We will have to improve it at some point but I don't think we should do it right now.

single db committer

at the time of writing this code there was single commiter. It changed only in master. I'll fix it during rebase of course.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh. maybe you mean distributed mode here, not multiple downloaders that we use in our cli.
yes, there is no support for distributed mode yet. imo better implement it in a separate pr later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, I meant multiple downloaders in terms of multiple storers. If we implement a similar thing for BitBucket, for example, we will have a storer of a downloader for the first provider that would run SetCurrent which drops the materialized views and puts data, then a storer of a downloader for the second provider tha would do the same. So, in the end, we will have only data for the second provider, unless I missed something.

But for sure not something to be addressed now. Just a note for the nex things to do to keep in mind. 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will have only data for the second provider

As we agreed in the DD, for now, we support only 1 provider.

For the problem with distributed mode I have created an issue: #86

Comment on lines 63 to 76
_, err := s.DB.ExecContext(ctx, "DROP MATERIALIZED VIEW IF EXISTS owners")
if err != nil {
return fmt.Errorf("failed to drop unified view owners: %v", err)
}
_, err = s.DB.ExecContext(ctx, fmt.Sprintf(`CREATE MATERIALIZED VIEW owners AS
SELECT login, name
FROM github_organizations_versioned WHERE %v = ANY(versions)`, v))
if err != nil {
return fmt.Errorf("failed to create unified view owners: %v", err)
}
Copy link
Contributor

@se7entyse7en se7entyse7en Oct 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, if we can do something like what @dpordomingo proposed I think it would more readable, instead of having a wall of sql statements.

@smacker smacker force-pushed the unified_schema_github branch 3 times, most recently from 682f070 to 36ab9cc Compare October 23, 2019 11:51
@smacker
Copy link
Contributor Author

smacker commented Oct 23, 2019

@kyrcha @lwsanty what should we do with the tests here? I can copy-paste the current test suite and duplicate the oracle. Or do you prefer to refactor them by yourself?

@kyrcha
Copy link
Contributor

kyrcha commented Oct 25, 2019

@smacker I've seen that you've changed the name of the views in the tests. Does it need anything else? If yes I could refactor whatever is needed.

@smacker
Copy link
Contributor Author

smacker commented Oct 25, 2019

We have new views like this one: https://github.com/src-d/metadata-retrieval/pull/45/files#diff-c52e0395f3662ad3a85b0fd2f30ecf39R77
So we need new tests for them (based on the same gh data).

Copy link
Contributor

@dpordomingo dpordomingo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

@smacker
Copy link
Contributor Author

smacker commented Oct 29, 2019

  1. Rebased on master
  2. PR reviews view with only 2 states
  3. Run this with srcd-ce after simple changes in charts (like removing ids, using uppercase for state) they work correctly.
  4. Integration testing for new views will be implemented by QA team.

Should I merge it now?

@se7entyse7en
Copy link
Contributor

yes let's merge! 🎉

@smacker smacker merged commit fe23146 into master Oct 29, 2019
@dpordomingo dpordomingo deleted the unified_schema_github branch December 17, 2019 06:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Change GH metadata schema to fit the common schema
4 participants