Add vertex attributes #67

flx5 · 2017-11-27T14:08:37Z

This adds vertex attributes as discussed in #34 .

Usage example:

x.data = ProjectData$new(project.conf = proj.conf)
x = NetworkBuilder$new(project.data = x.data, network.conf = net.conf)

mybins = as.POSIXct(c("2016-07-12 15:00:00", "2016-07-12 16:00:00", "2016-07-12 16:05:00", "2030-01-01 00:00:00"))
 cf.data = split.data.time.based(x.data, bins = mybins)
 my.networks = lapply(cf.data, FUN=function(range.data) {
     y = NetworkBuilder$new(project.data = range.data, network.conf = net.conf)
     return (y$get.author.network())
 })

 x.net.as.list = list("1970-01-01 00:00:00-2030-01-01 00:00:00" = x$get.author.network())

sample = add.commit.count(my.networks, x.data) # "range" omitted as it is the default value 
sample.cumulative = add.commit.count(my.networks, x.data, "cumulative")
sample.entire = add.commit.count(x.net.as.list, x.data)
sample.mail = add.mail.attribute(my.networks, x.data, "all")

Keep in mind that this is an early PR lacking proper documentation.

clhunsen

The current set of patches is a very good start. We need to improve the implementation in various places, but this may take some more time.

clhunsen · 2017-11-28T08:47:37Z

util-covariates.R

+        start.end = get.range.bounds(range)
+
+        if(what == "cumulative" || what == "all") {
+            start.end[1] = as.POSIXct("1970-01-01 00:00:00", "GMT")


The second argument is optional and is also pre-defined in other parts of the code (see also here). So, just omit the second argument here.

clhunsen · 2017-11-28T12:37:05Z

util-covariates.R

@@ -0,0 +1,96 @@
+## (c) Felix Prasse, 2017


Please rename the file to util-networks-covariates.R as we clearly have a relation to the functionality in util-networks.R.

clhunsen · 2017-11-28T12:41:34Z

util-covariates.R

+        }
+
+        if(what == "all") {
+            start.end[2] = as.POSIXct("9999-01-01 00:00:00", "GMT")


See comment on Line 22.

clhunsen · 2017-11-28T12:47:27Z

util-covariates.R

+#' @param project.data The entire project date
+#' @param compute.attr The function to compute the attribute to add. Must return a named list
+#'                  with the names being the name of the vertex.
+#' @param cumulative Should the attribute be calculated from project start to end of range?


The documentation is outdated. Please also add documentation for all other functions in this file.

clhunsen · 2017-11-28T12:48:25Z

util-covariates.R

+
+#' Utility function to compute vertex attributes for a list of network
+#'
+#' Only network ranges with dates work!


"Important: This function only works for lists of networks which have timestamps used in their range names."

clhunsen · 2017-11-28T13:12:44Z

util-covariates.R

+
+
+add.mail.attribute = function(list.of.networks, data, what = c("range", "cumulative", "all")) {
+    compute.vertex.attributes(list.of.networks, data, "email.addresses", what,


For the names of the attributes (in this case, "email.addresses"), we should keep to the names that are already used in the network configuration for the edge attributes or in the reading functions as column headers (e.g., here).

Thus, we would use "author.email" as the name for the vertex attribute here.

clhunsen · 2017-11-28T13:13:15Z

util-covariates.R

+add.mail.attribute = function(list.of.networks, data, what = c("range", "cumulative", "all")) {
+    compute.vertex.attributes(list.of.networks, data, "email.addresses", what,
+                              function(range.data, net) {
+                                  author.to.mail = range.data$get.author2commit()


This line is obsolete.

This line is not obsolete, but we can use something like get.author2mail instead. In the end, that does not make a difference.

Ah, yes. Simply overlooked this... 🤦‍♂️ Then, we should use range.data$get.author2mail() in Line 54 to keep consistent with other functions.

If I've got this right, get.autor2mail only works if the same author has at least authored a commit and an email on the mailing list. But couldn't it happen (like in the test data) that an author never wrote an email?

That is a good point. That can definitively happen. However, this problem can also appear the other way round: There are authors who wrote e-mails but never have committed.

Should we you use both data sources together and make a unique union then? That should not be a problem as the e-mail addresses used in the commit data and in the mail data are identical for each author.

One additional remark: Here we can definitely remove the what parameter as the e-mail address of a developer cannot be different for different network ranges due to the disambiguation algorithms of Codeface.

Sorry for the misunderstanding before. I talked with Thomas, opened an issue for this topic (#69), implemented the changes (254d4de), and, finally, opened a pull request (#71).

I suggest -- and Thomas confirmed -- that we add the attribute independent of the existence of the new email column in the author data (ProjectData$get.authors()): if the column exists, add the real e-mail addresses; if not, add NA as the vertex attribute. So, just wait for #71 to be merged and then rebase your branch and use the functionality.

clhunsen · 2017-11-28T13:14:32Z

util-covariates.R

+add.commit.count = function(list.of.networks, data, what = c("range", "cumulative", "all")) {
+    compute.vertex.attributes(list.of.networks, data, "commit.count", what, function(range.data, net) {
+        commit.count.df = get.author.commit.count(range.data)[c("author.name", "freq")]
+        commit.count.list = structure(commit.count.df$freq, names = commit.count.df$author.name)


Please access columns in data.frames via the square-bracket notation. Here, you would use commit.count.df["freq"] and commit.count.df["author.name"].

Note: I will add a corresponding line to the contribution guide.

clhunsen · 2017-11-28T13:16:37Z

util-covariates.R

+add.artifact.count = function(list.of.networks, data, what = c("range", "cumulative", "all")) {
+    compute.vertex.attributes(list.of.networks, data, "artifact.count", what,
+                              function(range.data, net) {
+                                  return (lapply(range.data$get.author2artifact(), length))


@bockthom: Do we want to have the amount of unique touched artifacts here?

Yes, we do. The idea is, to use the amount of touched artifacts as a predictor for a kind of "importance". So, basically, we are interested in how many different artifacts a developer has changed, but (currently) not in how often the developer has changed certain artifacts.
@flx5 Please make sure that you have a unique list of artifacts before determining the length.

@clhunsen If one is interested in the individual artifacts changed by a certain developer, we could provide an additional vertex attribute which consists of a list of artifacts and an individual change count for each changed artifact. However, I am not sure whether we should provide something like that, too.

clhunsen · 2017-11-28T13:19:33Z

util-covariates.R

+
+add.author.classification = function(list.of.networks, data, what = c("range", "cumulative", "all"),
+                                     type = c("network.degree", "network.eigen", "commit.count", "loc.count")) {
+    compute.vertex.attributes(list.of.networks, data, "classification", what,


As most classification functions get additional parameters, we need to pass a classification result here and not call the classification function. In this place, we need to discuss how to use the what parameter then.

In the end, it would be nice to have both: a function that simply adds the attributes according a given classification, but also a wrapper function which takes all the possible parameters, calls the classification function and also calls the further function.

The what parameter can be used in the wrapper function somehow then - but then we have to carefully specify additional options...

bockthom · 2017-11-28T16:05:04Z

util-covariates.R

+                              function(range.data, net) {
+                                  return (lapply(range.data$get.author2commit(),
+                                                 function(x) {
+                                                     return ( c(start = min(x$date), end = max(x$date)) )


I am not sure whether we really need this kind of activity-range information. We already have the date of the first activity as an extra attribute. The date of the last activity is not that interesting than the one of the first activity.

However, the 'active ranges' attribute, as it is called in #34 and also the example implementation, should check in which (resp. how much) of the ranges (e.g., threemonth windows) a developer was active at all.

So, I would like to have a boolean vector stating whether a developer was active or not. Depending on the what parameter, this vector covers just the current range (which is always TRUE), the current and the past ranges, or all ranges.

@clhunsen Any additional comments on that?

Not right now.

clhunsen · 2017-11-28T16:38:57Z

Thomas and I made a lot of comments, so we should restrict this PR to not get overwhelmed with all the implementation-related comments and problems: I suggest, we start with the basic set of simple count attributes here that are relatively easy to implement and add the additional set of more sophisticated attributes in a later PR.

@flx5, just stick to the following set at first (subset from #34):

Authors

e-mail address,

commit count,

artifact count,

active ranges,

first activity.

Additionally, only implement it for range-level and project-level statistics. We will have a look at the cumulative stuff later. This way, we may have a clearer look for the essentials and potential problems.

If you want, you can open a fresh and clean PR for that. Your decision.

clhunsen · 2017-12-19T14:45:03Z

To fix your tests, if I understood correctly, you just need to move the function network.covariates.test.build.expected in your test file above the test_that call where it is used (the one directly above the definition of the function).

Add methods required to add vertex attributes to the graph. Add method to calculate commit count as vertex attribute Signed-off-by: Felix Prasse <[email protected]>

Signed-off-by: Felix Prasse <[email protected]>

- Rename aggregation to aggregation.level and all to project - Rename add.vertex.attributes.* methods to add.vertex.attribute.* - Rename compute.vertex.attribute.with.n2r to add.vertex.attribute - Rename compute.vertex.attribute to split.and.add.vertex.attribute - Move networks.to.ranges to util-split and rename it to split.data.by.networks - Remove explicit return statement where possible - Add missing documentation Signed-off-by: Felix Prasse <[email protected]>

…nd issues Signed-off-by: Felix Prasse <[email protected]>

Signed-off-by: Felix Prasse <[email protected]>

Add test for add.vertex.attribute Add test for split.and.add.vertex.attribute Add test for add.vertex.attribute.commit.count Add test for add.vertex.attribute.author.email Add test for add.vertex.attribute.artifact.count Add test for add.vertex.attribute.first.activity Signed-off-by: Felix Prasse <[email protected]>

The tests were referencing to a wrong project data variable Signed-off-by: Felix Prasse <[email protected]>

Signed-off-by: Felix Prasse <[email protected]>

Add test for add.vertex.attribute.author.role.simple Add test for add.vertex.attribute.active.ranges Signed-off-by: Felix Prasse <[email protected]>

Signed-off-by: Felix Prasse <[email protected]>

Add artifact attributes Add commit count based on committer Signed-off-by: Felix Prasse <[email protected]>

Signed-off-by: Felix Prasse <[email protected]>

flx5 · 2017-12-30T16:21:33Z

If I didn't forget anything this should be complete.

clhunsen

Additional to the fact that much inline documentation should be added and the indentation fixed in as many places, the overall impression is really good. I found only one major bug: two function with the same name, resulting from a copy-paste action, obviously.

@flx5: Do you want to work to resolve my comments? Otherwise, I will take it.

clhunsen · 2018-01-09T09:43:04Z

util-networks-covariates.R

+    nets.with.attr = split.and.add.vertex.attribute(list.of.networks, project.data, name, aggregation.level, default.value,
+                                                    function(range.data, net) {
+                                                        artifact.to.commit = get.key.to.value.from.df(range.data$get.commits.filtered.empty(), "artifact", "hash")
+                                                        artifact.change.count = lapply(artifact.to.commit, nrow)


When calling get.key.to.value.from.df, we do not call unique on the column indicated by value parameter (here, "hash"). Consequently, we should call unique before nrow in the lapply function here.

Under the assumption that a commit is a cohesive change, a feature should be seen as changed only once although it might been touched in several files within a single commit.

clhunsen · 2018-01-18T16:21:04Z

util-core-peripheral.R

@@ -451,6 +453,50 @@ get.commit.count.threshold = function(range.data) {
    return(threshold)
 }

+#' Get the commit count per comitter in the given range data


In the documentation here, we should state that we only count the commits of committers iff they are not the author.

clhunsen · 2018-01-18T16:21:46Z

util-core-peripheral.R

+#' @param range.data The data to count on
+#' @return A data frame in descending order by the commit count.
+get.committer.commit.count = function(range.data) {
+    logging::logdebug("get.contributer.commit.count: starting.")


Change the tracing comment to: "get.committer.commit.count: starting.".

clhunsen · 2018-01-18T16:22:06Z

util-core-peripheral.R

+    ## Execute a query to get the commit count per author
+    res = sqldf::sqldf("select *, COUNT(*) as `freq` from `commits.df` group by `committer.name` order by `freq` desc")
+
+    logging::logdebug("get.author.commit.count: finished.")


Change the tracing comment to: "get.committer.commit.count: finished.".

clhunsen · 2018-01-18T16:23:58Z

util-core-peripheral.R

+    }
+
+    ## Execute a query to get the commit count per author
+    res = sqldf::sqldf("select *, COUNT(*) as `freq` from `commits.df` group by `committer.name` order by `freq` desc")


We should capitalize all SQL keywords as in the other sqldf statements.

clhunsen · 2018-01-22T13:37:36Z

util-split.R

+        start.end = get.range.bounds(range)
+
+        if(aggregation.level == "cumulative" || aggregation.level == "project") {
+            start.end[1] = as.POSIXct("1970-01-01 00:00:00")


We should better use list.of.ranges[1] here as the list of networks can be a subset of all networks obtained from the (complete) project.data object.

clhunsen · 2018-01-22T13:46:52Z

tests/test-network-covariates.R

+CF.SELECTION.PROCESS = "testing"
+CASESTUDY = "test"
+ARTIFACT = "feature"
+AGGREGATION.LEVELS = c("range", "cumulative", "project")


Shall we define this somewhere in the main files (and not here) as we may want to use it also somewhere else?

Currently, I do not see any advantages in defining the aggregation level somewhere else -- we only use it for the node attributes. Or would you like to avoid the duplicates in the parameters of the vertex-attribute functions? We could do that and use a constant for the aggregation levels, but then the functions' parameters would be less intuitive, I guess.

clhunsen · 2018-01-22T15:27:34Z

tests/test-network-covariates.R

+        networks.with.attr = add.vertex.attribute.commit.count.author(networks.and.data[["networks"]],
+                                                               networks.and.data[["project.data"]],
+                                                            aggregation.level = level)
+        actual.attributes = lapply(networks.with.attr, function(net) igraph::vertex_attr(net, "commit.count"))


Use igraph::get.vertex.attribute, here and anywhere else.

clhunsen · 2018-01-22T15:27:58Z

tests/test-network-covariates.R

+    lapply(AGGREGATION.LEVELS, function(level) {
+        networks.with.attr = add.vertex.attribute.commit.count.author(networks.and.data[["networks"]],
+                                                               networks.and.data[["project.data"]],
+                                                            aggregation.level = level)


Indentation messed up.

clhunsen · 2018-01-22T15:35:56Z

tests/test-network-covariates.R

+network.covariates.test.build.expected = function(a,b,c) {
+    arguments = list(a,b,c)
+
+    names(arguments) = c('2016-07-12 15:00:00-2016-07-12 16:00:00',


The bins should be derived dynamically. The solution would be to make mybins in Line 37 a public constant and call construct.ranges here.

flx5 · 2018-02-09T17:02:13Z

@clhunsen I can fix those issues after my exams in two to three weeks.

clhunsen · 2018-02-09T20:18:14Z

That would be great as I will be busy correcting exams. 😄 Thanks.

clhunsen · 2018-02-15T17:20:45Z

I am working on this right now with the help of Thomas. I will open a separate pull request soon.
See here: https://github.com/flx5/codeface-extraction-r/compare/vertex-attributes...clhunsen:vertex-attributes

clhunsen · 2018-02-16T09:39:39Z

I will close here now in favor of issue #92 (open issues regarding vertex attributes) and PR #93 (updated version of this PR).

clhunsen reviewed Nov 28, 2017

View reviewed changes

bockthom requested changes Nov 28, 2017

View reviewed changes

clhunsen added the enhancement label Dec 14, 2017

clhunsen added this to the v3.1 milestone Dec 14, 2017

Felix Prasse added 21 commits December 21, 2017 17:10

Add commit count as vertex attribute

3ca6063

Add methods required to add vertex attributes to the graph. Add method to calculate commit count as vertex attribute Signed-off-by: Felix Prasse <[email protected]>

Add further vertex attributes

cafa3ab

Signed-off-by: Felix Prasse <[email protected]>

Add author classification attribute

32944cb

Signed-off-by: Felix Prasse <[email protected]>

Minor refactoring on vertex attributes

fe2721c

Signed-off-by: Felix Prasse <[email protected]>

Fix implementation of active author range vertex attribute

0371351

Signed-off-by: Felix Prasse <[email protected]>

Rename util-covariates.R to util-networks-covariates.R

39065fd

Signed-off-by: Felix Prasse <[email protected]>

Rename what parameter to aggregation

e3c6510

Signed-off-by: Felix Prasse <[email protected]>

Add comments to vertex attribute methods

cd261ad

Signed-off-by: Felix Prasse <[email protected]>

Extend add.vertex.attribute.first.activity to work on mail, commits a…

288c927

…nd issues Signed-off-by: Felix Prasse <[email protected]>

Add default value for missing vertex attributes

d7d8986

Signed-off-by: Felix Prasse <[email protected]>

Improve author role classification

2549f2b

Signed-off-by: Felix Prasse <[email protected]>

Add explicit return

bca66be

Signed-off-by: Felix Prasse <[email protected]>

Add test for splitting project data

efd9b2d

Signed-off-by: Felix Prasse <[email protected]>

Fix network attribute unit tests

df52ac7

The tests were referencing to a wrong project data variable Signed-off-by: Felix Prasse <[email protected]>

Fix artifact count attribute implementation

8ae1abf

Signed-off-by: Felix Prasse <[email protected]>

Add tests for network vertex attributes

b9f0bd3

Add test for add.vertex.attribute.author.role.simple Add test for add.vertex.attribute.active.ranges Signed-off-by: Felix Prasse <[email protected]>

Comments for network-covariates tests

3e20fb2

Signed-off-by: Felix Prasse <[email protected]>

Add more vertex attributes

9f53875

Add artifact attributes Add commit count based on committer Signed-off-by: Felix Prasse <[email protected]>

Add tests for artifact vertex attributes

a002dfa

Signed-off-by: Felix Prasse <[email protected]>

clhunsen requested changes Jan 22, 2018

View reviewed changes

clhunsen mentioned this pull request Jan 22, 2018

Use lubridate package for all POSIXct conversions #85

Closed

This was referenced Feb 16, 2018

Further vertex attributes #92

Closed

Add vertex attributes (updated) #93

Merged

clhunsen closed this Feb 16, 2018

clhunsen mentioned this pull request Feb 16, 2018

Possibility to build cumulative ranges #61

Closed

clhunsen mentioned this pull request Mar 1, 2018

v3.1 #101

Merged

clhunsen mentioned this pull request Jul 11, 2019

Add further vertex attributes for artifact vertices #170

Closed

10 tasks



		add.mail.attribute = function(list.of.networks, data, what = c("range", "cumulative", "all")) {
		compute.vertex.attributes(list.of.networks, data, "email.addresses", what,

Add vertex attributes #67

Add vertex attributes #67

Conversation

flx5 commented Nov 27, 2017

clhunsen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clhunsen Nov 29, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clhunsen commented Nov 28, 2017 • edited Loading

clhunsen commented Dec 19, 2017

flx5 commented Dec 30, 2017

clhunsen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

flx5 commented Feb 9, 2018

clhunsen commented Feb 9, 2018

clhunsen commented Feb 15, 2018 • edited Loading

clhunsen commented Feb 16, 2018

clhunsen Nov 29, 2017 •

edited

Loading

clhunsen commented Nov 28, 2017 •

edited

Loading

clhunsen commented Feb 15, 2018 •

edited

Loading