Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add vertex attributes #67

Closed
wants to merge 21 commits into from
Closed

Add vertex attributes #67

wants to merge 21 commits into from

Conversation

flx5
Copy link
Contributor

@flx5 flx5 commented Nov 27, 2017

This adds vertex attributes as discussed in #34 .

Usage example:

x.data = ProjectData$new(project.conf = proj.conf)
x = NetworkBuilder$new(project.data = x.data, network.conf = net.conf)

mybins = as.POSIXct(c("2016-07-12 15:00:00", "2016-07-12 16:00:00", "2016-07-12 16:05:00", "2030-01-01 00:00:00"))
 cf.data = split.data.time.based(x.data, bins = mybins)
 my.networks = lapply(cf.data, FUN=function(range.data) {
     y = NetworkBuilder$new(project.data = range.data, network.conf = net.conf)
     return (y$get.author.network())
 })

 x.net.as.list = list("1970-01-01 00:00:00-2030-01-01 00:00:00" = x$get.author.network())

sample = add.commit.count(my.networks, x.data) # "range" omitted as it is the default value 
sample.cumulative = add.commit.count(my.networks, x.data, "cumulative")
sample.entire = add.commit.count(x.net.as.list, x.data)
sample.mail = add.mail.attribute(my.networks, x.data, "all")

Keep in mind that this is an early PR lacking proper documentation.

Copy link
Collaborator

@clhunsen clhunsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current set of patches is a very good start. We need to improve the implementation in various places, but this may take some more time.

start.end = get.range.bounds(range)

if(what == "cumulative" || what == "all") {
start.end[1] = as.POSIXct("1970-01-01 00:00:00", "GMT")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second argument is optional and is also pre-defined in other parts of the code (see also here). So, just omit the second argument here.

@@ -0,0 +1,96 @@
## (c) Felix Prasse, 2017
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rename the file to util-networks-covariates.R as we clearly have a relation to the functionality in util-networks.R.

}

if(what == "all") {
start.end[2] = as.POSIXct("9999-01-01 00:00:00", "GMT")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment on Line 22.

#' @param project.data The entire project date
#' @param compute.attr The function to compute the attribute to add. Must return a named list
#' with the names being the name of the vertex.
#' @param cumulative Should the attribute be calculated from project start to end of range?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation is outdated. Please also add documentation for all other functions in this file.


#' Utility function to compute vertex attributes for a list of network
#'
#' Only network ranges with dates work!
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Important: This function only works for lists of networks which have timestamps used in their range names."



add.mail.attribute = function(list.of.networks, data, what = c("range", "cumulative", "all")) {
compute.vertex.attributes(list.of.networks, data, "email.addresses", what,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the names of the attributes (in this case, "email.addresses"), we should keep to the names that are already used in the network configuration for the edge attributes or in the reading functions as column headers (e.g., here).

Thus, we would use "author.email" as the name for the vertex attribute here.

add.mail.attribute = function(list.of.networks, data, what = c("range", "cumulative", "all")) {
compute.vertex.attributes(list.of.networks, data, "email.addresses", what,
function(range.data, net) {
author.to.mail = range.data$get.author2commit()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is obsolete.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is not obsolete, but we can use something like get.author2mail instead. In the end, that does not make a difference.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes. Simply overlooked this... 🤦‍♂️ Then, we should use range.data$get.author2mail() in Line 54 to keep consistent with other functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I've got this right, get.autor2mail only works if the same author has at least authored a commit and an email on the mailing list. But couldn't it happen (like in the test data) that an author never wrote an email?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point. That can definitively happen. However, this problem can also appear the other way round: There are authors who wrote e-mails but never have committed.

Should we you use both data sources together and make a unique union then? That should not be a problem as the e-mail addresses used in the commit data and in the mail data are identical for each author.

One additional remark: Here we can definitely remove the what parameter as the e-mail address of a developer cannot be different for different network ranges due to the disambiguation algorithms of Codeface.

Copy link
Collaborator

@clhunsen clhunsen Nov 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the misunderstanding before. I talked with Thomas, opened an issue for this topic (#69), implemented the changes (254d4de), and, finally, opened a pull request (#71).

I suggest -- and Thomas confirmed -- that we add the attribute independent of the existence of the new email column in the author data (ProjectData$get.authors()): if the column exists, add the real e-mail addresses; if not, add NA as the vertex attribute. So, just wait for #71 to be merged and then rebase your branch and use the functionality.

add.commit.count = function(list.of.networks, data, what = c("range", "cumulative", "all")) {
compute.vertex.attributes(list.of.networks, data, "commit.count", what, function(range.data, net) {
commit.count.df = get.author.commit.count(range.data)[c("author.name", "freq")]
commit.count.list = structure(commit.count.df$freq, names = commit.count.df$author.name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please access columns in data.frames via the square-bracket notation. Here, you would use commit.count.df["freq"] and commit.count.df["author.name"].

Note: I will add a corresponding line to the contribution guide.

add.artifact.count = function(list.of.networks, data, what = c("range", "cumulative", "all")) {
compute.vertex.attributes(list.of.networks, data, "artifact.count", what,
function(range.data, net) {
return (lapply(range.data$get.author2artifact(), length))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bockthom: Do we want to have the amount of unique touched artifacts here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we do. The idea is, to use the amount of touched artifacts as a predictor for a kind of "importance". So, basically, we are interested in how many different artifacts a developer has changed, but (currently) not in how often the developer has changed certain artifacts.
@flx5 Please make sure that you have a unique list of artifacts before determining the length.

@clhunsen If one is interested in the individual artifacts changed by a certain developer, we could provide an additional vertex attribute which consists of a list of artifacts and an individual change count for each changed artifact. However, I am not sure whether we should provide something like that, too.


add.author.classification = function(list.of.networks, data, what = c("range", "cumulative", "all"),
type = c("network.degree", "network.eigen", "commit.count", "loc.count")) {
compute.vertex.attributes(list.of.networks, data, "classification", what,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As most classification functions get additional parameters, we need to pass a classification result here and not call the classification function. In this place, we need to discuss how to use the what parameter then.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the end, it would be nice to have both: a function that simply adds the attributes according a given classification, but also a wrapper function which takes all the possible parameters, calls the classification function and also calls the further function.

The what parameter can be used in the wrapper function somehow then - but then we have to carefully specify additional options...

function(range.data, net) {
return (lapply(range.data$get.author2commit(),
function(x) {
return ( c(start = min(x$date), end = max(x$date)) )
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure whether we really need this kind of activity-range information. We already have the date of the first activity as an extra attribute. The date of the last activity is not that interesting than the one of the first activity.

However, the 'active ranges' attribute, as it is called in #34 and also the example implementation, should check in which (resp. how much) of the ranges (e.g., threemonth windows) a developer was active at all.

So, I would like to have a boolean vector stating whether a developer was active or not. Depending on the what parameter, this vector covers just the current range (which is always TRUE), the current and the past ranges, or all ranges.

@clhunsen Any additional comments on that?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not right now.

@clhunsen
Copy link
Collaborator

clhunsen commented Nov 28, 2017

Thomas and I made a lot of comments, so we should restrict this PR to not get overwhelmed with all the implementation-related comments and problems: I suggest, we start with the basic set of simple count attributes here that are relatively easy to implement and add the additional set of more sophisticated attributes in a later PR.

@flx5, just stick to the following set at first (subset from #34):

  • Authors
    • e-mail address,
    • commit count,
    • artifact count,
    • active ranges,
    • first activity.

Additionally, only implement it for range-level and project-level statistics. We will have a look at the cumulative stuff later. This way, we may have a clearer look for the essentials and potential problems.

If you want, you can open a fresh and clean PR for that. Your decision.

@clhunsen clhunsen added this to the v3.1 milestone Dec 14, 2017
@clhunsen
Copy link
Collaborator

To fix your tests, if I understood correctly, you just need to move the function network.covariates.test.build.expected in your test file above the test_that call where it is used (the one directly above the definition of the function).

Felix Prasse added 21 commits December 21, 2017 17:10
Add methods required to add vertex attributes to the graph.
Add method to calculate commit count as vertex attribute

Signed-off-by: Felix Prasse <[email protected]>
Signed-off-by: Felix Prasse <[email protected]>
- Rename aggregation to aggregation.level and all to project
- Rename add.vertex.attributes.* methods to add.vertex.attribute.*
- Rename compute.vertex.attribute.with.n2r to add.vertex.attribute
- Rename compute.vertex.attribute to split.and.add.vertex.attribute
- Move networks.to.ranges to util-split and
  rename it to split.data.by.networks
- Remove explicit return statement where possible
- Add missing documentation

Signed-off-by: Felix Prasse <[email protected]>
Signed-off-by: Felix Prasse <[email protected]>
Add test for add.vertex.attribute
Add test for split.and.add.vertex.attribute
Add test for add.vertex.attribute.commit.count
Add test for add.vertex.attribute.author.email
Add test for add.vertex.attribute.artifact.count
Add test for add.vertex.attribute.first.activity

Signed-off-by: Felix Prasse <[email protected]>
The tests were referencing to a wrong project data variable

Signed-off-by: Felix Prasse <[email protected]>
Add test for add.vertex.attribute.author.role.simple
Add test for add.vertex.attribute.active.ranges

Signed-off-by: Felix Prasse <[email protected]>
Add artifact attributes
Add commit count based on committer

Signed-off-by: Felix Prasse <[email protected]>
@flx5
Copy link
Contributor Author

flx5 commented Dec 30, 2017

If I didn't forget anything this should be complete.

Copy link
Collaborator

@clhunsen clhunsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional to the fact that much inline documentation should be added and the indentation fixed in as many places, the overall impression is really good. I found only one major bug: two function with the same name, resulting from a copy-paste action, obviously.

@flx5: Do you want to work to resolve my comments? Otherwise, I will take it.

nets.with.attr = split.and.add.vertex.attribute(list.of.networks, project.data, name, aggregation.level, default.value,
function(range.data, net) {
artifact.to.commit = get.key.to.value.from.df(range.data$get.commits.filtered.empty(), "artifact", "hash")
artifact.change.count = lapply(artifact.to.commit, nrow)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When calling get.key.to.value.from.df, we do not call unique on the column indicated by value parameter (here, "hash"). Consequently, we should call unique before nrow in the lapply function here.

Under the assumption that a commit is a cohesive change, a feature should be seen as changed only once although it might been touched in several files within a single commit.

@@ -451,6 +453,50 @@ get.commit.count.threshold = function(range.data) {
return(threshold)
}

#' Get the commit count per comitter in the given range data
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the documentation here, we should state that we only count the commits of committers iff they are not the author.

#' @param range.data The data to count on
#' @return A data frame in descending order by the commit count.
get.committer.commit.count = function(range.data) {
logging::logdebug("get.contributer.commit.count: starting.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change the tracing comment to: "get.committer.commit.count: starting.".

## Execute a query to get the commit count per author
res = sqldf::sqldf("select *, COUNT(*) as `freq` from `commits.df` group by `committer.name` order by `freq` desc")

logging::logdebug("get.author.commit.count: finished.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change the tracing comment to: "get.committer.commit.count: finished.".

}

## Execute a query to get the commit count per author
res = sqldf::sqldf("select *, COUNT(*) as `freq` from `commits.df` group by `committer.name` order by `freq` desc")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should capitalize all SQL keywords as in the other sqldf statements.

start.end = get.range.bounds(range)

if(aggregation.level == "cumulative" || aggregation.level == "project") {
start.end[1] = as.POSIXct("1970-01-01 00:00:00")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should better use list.of.ranges[1] here as the list of networks can be a subset of all networks obtained from the (complete) project.data object.

CF.SELECTION.PROCESS = "testing"
CASESTUDY = "test"
ARTIFACT = "feature"
AGGREGATION.LEVELS = c("range", "cumulative", "project")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we define this somewhere in the main files (and not here) as we may want to use it also somewhere else?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, I do not see any advantages in defining the aggregation level somewhere else -- we only use it for the node attributes. Or would you like to avoid the duplicates in the parameters of the vertex-attribute functions? We could do that and use a constant for the aggregation levels, but then the functions' parameters would be less intuitive, I guess.

networks.with.attr = add.vertex.attribute.commit.count.author(networks.and.data[["networks"]],
networks.and.data[["project.data"]],
aggregation.level = level)
actual.attributes = lapply(networks.with.attr, function(net) igraph::vertex_attr(net, "commit.count"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use igraph::get.vertex.attribute, here and anywhere else.

lapply(AGGREGATION.LEVELS, function(level) {
networks.with.attr = add.vertex.attribute.commit.count.author(networks.and.data[["networks"]],
networks.and.data[["project.data"]],
aggregation.level = level)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation messed up.

network.covariates.test.build.expected = function(a,b,c) {
arguments = list(a,b,c)

names(arguments) = c('2016-07-12 15:00:00-2016-07-12 16:00:00',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bins should be derived dynamically. The solution would be to make mybins in Line 37 a public constant and call construct.ranges here.

@flx5
Copy link
Contributor Author

flx5 commented Feb 9, 2018

@clhunsen I can fix those issues after my exams in two to three weeks.

@clhunsen
Copy link
Collaborator

clhunsen commented Feb 9, 2018

That would be great as I will be busy correcting exams. 😄 Thanks.

@clhunsen
Copy link
Collaborator

clhunsen commented Feb 15, 2018

I am working on this right now with the help of Thomas. I will open a separate pull request soon.
See here: https://github.com/flx5/codeface-extraction-r/compare/vertex-attributes...clhunsen:vertex-attributes

This was referenced Feb 16, 2018
@clhunsen
Copy link
Collaborator

I will close here now in favor of issue #92 (open issues regarding vertex attributes) and PR #93 (updated version of this PR).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants