Improve performance for vertex attribute 'author.role' & fix default aggregation levels #102

bockthom · 2018-03-08T14:59:34Z

When adding vertex attributes, we sometimes split the project data several times even if we do not need the project data. Therefore, we should further improve the outcome of PR #93. This is also related to #92.

In the following, I list three issues, which occurred when using add-vertex-attribute functions, and provide possible approaches. Excuse me for putting so much information into one issue, but the three issues are closely related to each other.

Issue 1: Add already existing classification results as author role

When using the following function to use already existing classification results as network attributes, we do not need to split the project data:

add.vertex.attribute.author.role = function(list.of.networks, project.data, classification.results,
                                            name = "author.role",
                                            aggregation.level = c("range", "cumulative", "all.ranges",
                                                                  "project.cumulative", "project.all.ranges",
                                                                  "complete"),
                                            default.value = NA)

In this specific function, we do not need the project.data and also the aggregation.level does not make sense here, as we already have the classification results.

Suggestion: Remove the project.data and aggregation.level parameters here (as they are confusing) and instead of calling split.and.add.vertex.attribute directly call add.vertex.attribute (with empty data).

Issue 2: Repeated splitting for several aggregation levels (complete, all.ranges, project.all.ranges)

In addition, is it somehow possible to improve the performance of the split.data.time.based.by.ranges function when passing identical ranges?

Example: The function split.data.time.based.by.ranges is called when adding vertex attributes to a list of networks. Let aggregation.level = "complete" and the length of the list be 60, for instance. Then the function split.data.time.based.by.ranges splits the data 60 times according to exactly the same ranges and also the log output contains 60 times the following statements:

2018-03-08 12:54:38 INFO::Splitting data 'RangeData<busybox, 2003-01-14 21:44:38-2016-02-18 21:50:25, NA>' into time ranges [2003-01-14 21:44:38, 2016-02-18 21:37:21].
2018-03-08 12:54:38 INFO::Initialized data object RangeData<busybox, 2003-01-14 21:44:38-2016-02-18 21:37:21, NA>
2018-03-08 12:54:38 INFO::Setting raw commit data.
2018-03-08 12:54:38 INFO::Setting e-mail data.
2018-03-08 12:54:38 INFO::Setting issue data.
2018-03-08 12:54:38 INFO::Setting synchronicity data.
2018-03-08 12:54:38 INFO::Getting synchronicity data.
2018-03-08 12:54:38 INFO::Setting author data.
2018-03-08 12:54:38 INFO::Getting author data.
2018-03-08 12:54:38 INFO::Setting PaStA data.
2018-03-08 12:54:38 INFO::Getting PaStA data.

To reduce the log output (log this only once instead of 60 times exactly the same output) and also to reduce the computation time (compute only once instead of 60 times), it would be helpful to check whether all elements in ranges are identical, and if so, just split the data once and replicate the splitting result.

Nevertheless, that is just an idea to improve the log output and possible improves the computation time -- but it will mess up the code...

Issue 3: Undocumented default aggregation levels and unexpected default aggregation levels

In addition, we should reconsider the default values of the aggregation.level parameters of all the functions. In the documentations of those functions, we did not specify a default value, but the default value is range in all cases. So, we should add the default value to the documentation.
In addition, there are some functions where the default value should not be range, but complete as this would make more sense (or is anyone interested in the first activity within the current range?):

add.vertex.attribute.first.activity
add.vertex.attribute.artifact.first.occurrence

However, if we would change the default value here, the vector of possible aggregation levels would be different for different functions, and therefore you can't just copy and paste it any more. Hm...

I just wanted to bring those issues to mind. Issue 1, Issue 2, and Issue 3 should be somehow addressed, that is something we should discuss. @clhunsen What's your opinion here?

The text was updated successfully, but these errors were encountered:

bockthom · 2018-03-10T15:33:11Z

I've already fixed Issue 1 in commit 136c604.

Issue 2 and Issue 3 still need to be addressed somehow.

clhunsen · 2018-03-13T11:51:23Z

Issue 2

To reduce the log output (log this only once instead of 60 times exactly the same output) and also to reduce the computation time (compute only once instead of 60 times), it would be helpful to check whether all elements in ranges are identical, and if so, just split the data once and replicate the splitting result.

Nevertheless, that is just an idea to improve the log output and possible improves the computation time -- but it will mess up the code...

That is easily possible, I guess. We just need to do the check here and then clone the object as many times as needed. We may need to think about whether we need a deep clone or not.

And the fix will likely only contain three to five lines, including documentation.

Issue 3

However, if we would change the default value here, the vector of possible aggregation levels would be different for different functions, and therefore you can't just copy and paste it any more. Hm...

That is definitely a problem. How about just adding something like a "recommended aggregation level" to the documentation?

bockthom · 2018-03-21T18:25:57Z

Let me just document our decisions here:

Issue 2

We just need to do the check here and then clone the object as many times as needed. We may need to think about whether we need a deep clone or not.

As we already checked, a shadow clone is enough (it is ok that the cloned data objects hold the same project configuration).

Issue 3

However, if we would change the default value here, the vector of possible aggregation levels would be different for different functions, and therefore you can't just copy and paste it any more. Hm...

That is definitely a problem. How about just adding something like a "recommended aggregation level" to the documentation?

I've implemented a new function match.arg.or.default, which behaves like match.arg but takes an additional parameter to specify a default value. Hence, we can keep the vector of possible aggregation levels identical everywhere but still vary the default value.

bockthom · 2018-03-22T16:53:09Z

As all three issues in here are fixed now, I will close this issue.

bockthom added the help wanted label Mar 8, 2018

bockthom added this to the v3.2 milestone Mar 8, 2018

bockthom added the bug label Mar 9, 2018

bockthom mentioned this issue Mar 10, 2018

Some fixes to README, range construction, vertex attributes, and network-splitting functions #105

Merged

bockthom added fixed/has-pr fixed/future-release labels Mar 21, 2018

clhunsen mentioned this issue Mar 22, 2018

Version 3.1.1 #108

Merged

bockthom closed this as completed Mar 22, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance for vertex attribute 'author.role' & fix default aggregation levels #102

Improve performance for vertex attribute 'author.role' & fix default aggregation levels #102

bockthom commented Mar 8, 2018 •

edited

Loading

bockthom commented Mar 10, 2018

clhunsen commented Mar 13, 2018

bockthom commented Mar 21, 2018

bockthom commented Mar 22, 2018

Improve performance for vertex attribute 'author.role' & fix default aggregation levels #102

Improve performance for vertex attribute 'author.role' & fix default aggregation levels #102

Comments

bockthom commented Mar 8, 2018 • edited Loading

Issue 1: Add already existing classification results as author role

Issue 2: Repeated splitting for several aggregation levels (complete, all.ranges, project.all.ranges)

Issue 3: Undocumented default aggregation levels and unexpected default aggregation levels

bockthom commented Mar 10, 2018

clhunsen commented Mar 13, 2018

Issue 2

Issue 3

bockthom commented Mar 21, 2018

Issue 2

Issue 3

bockthom commented Mar 22, 2018

bockthom commented Mar 8, 2018 •

edited

Loading