Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance for vertex attribute 'author.role' & fix default aggregation levels #102

Closed
bockthom opened this issue Mar 8, 2018 · 4 comments

Comments

@bockthom
Copy link
Collaborator

bockthom commented Mar 8, 2018

When adding vertex attributes, we sometimes split the project data several times even if we do not need the project data. Therefore, we should further improve the outcome of PR #93. This is also related to #92.

In the following, I list three issues, which occurred when using add-vertex-attribute functions, and provide possible approaches. Excuse me for putting so much information into one issue, but the three issues are closely related to each other.


Issue 1: Add already existing classification results as author role

When using the following function to use already existing classification results as network attributes, we do not need to split the project data:

add.vertex.attribute.author.role = function(list.of.networks, project.data, classification.results,
                                            name = "author.role",
                                            aggregation.level = c("range", "cumulative", "all.ranges",
                                                                  "project.cumulative", "project.all.ranges",
                                                                  "complete"),
                                            default.value = NA)

In this specific function, we do not need the project.data and also the aggregation.level does not make sense here, as we already have the classification results.

Suggestion: Remove the project.data and aggregation.level parameters here (as they are confusing) and instead of calling split.and.add.vertex.attribute directly call add.vertex.attribute (with empty data).


Issue 2: Repeated splitting for several aggregation levels (complete, all.ranges, project.all.ranges)

In addition, is it somehow possible to improve the performance of the split.data.time.based.by.ranges function when passing identical ranges?

Example: The function split.data.time.based.by.ranges is called when adding vertex attributes to a list of networks. Let aggregation.level = "complete" and the length of the list be 60, for instance. Then the function split.data.time.based.by.ranges splits the data 60 times according to exactly the same ranges and also the log output contains 60 times the following statements:

2018-03-08 12:54:38 INFO::Splitting data 'RangeData<busybox, 2003-01-14 21:44:38-2016-02-18 21:50:25, NA>' into time ranges [2003-01-14 21:44:38, 2016-02-18 21:37:21].
2018-03-08 12:54:38 INFO::Initialized data object RangeData<busybox, 2003-01-14 21:44:38-2016-02-18 21:37:21, NA>
2018-03-08 12:54:38 INFO::Setting raw commit data.
2018-03-08 12:54:38 INFO::Setting e-mail data.
2018-03-08 12:54:38 INFO::Setting issue data.
2018-03-08 12:54:38 INFO::Setting synchronicity data.
2018-03-08 12:54:38 INFO::Getting synchronicity data.
2018-03-08 12:54:38 INFO::Setting author data.
2018-03-08 12:54:38 INFO::Getting author data.
2018-03-08 12:54:38 INFO::Setting PaStA data.
2018-03-08 12:54:38 INFO::Getting PaStA data.

To reduce the log output (log this only once instead of 60 times exactly the same output) and also to reduce the computation time (compute only once instead of 60 times), it would be helpful to check whether all elements in ranges are identical, and if so, just split the data once and replicate the splitting result.

Nevertheless, that is just an idea to improve the log output and possible improves the computation time -- but it will mess up the code...


Issue 3: Undocumented default aggregation levels and unexpected default aggregation levels

In addition, we should reconsider the default values of the aggregation.level parameters of all the functions. In the documentations of those functions, we did not specify a default value, but the default value is range in all cases. So, we should add the default value to the documentation.
In addition, there are some functions where the default value should not be range, but complete as this would make more sense (or is anyone interested in the first activity within the current range?):

  • add.vertex.attribute.first.activity
  • add.vertex.attribute.artifact.first.occurrence

However, if we would change the default value here, the vector of possible aggregation levels would be different for different functions, and therefore you can't just copy and paste it any more. Hm...


I just wanted to bring those issues to mind. Issue 1, Issue 2, and Issue 3 should be somehow addressed, that is something we should discuss. @clhunsen What's your opinion here?

@bockthom
Copy link
Collaborator Author

I've already fixed Issue 1 in commit 136c604.

Issue 2 and Issue 3 still need to be addressed somehow.

@clhunsen
Copy link
Collaborator

Issue 2

To reduce the log output (log this only once instead of 60 times exactly the same output) and also to reduce the computation time (compute only once instead of 60 times), it would be helpful to check whether all elements in ranges are identical, and if so, just split the data once and replicate the splitting result.

Nevertheless, that is just an idea to improve the log output and possible improves the computation time -- but it will mess up the code...

That is easily possible, I guess. We just need to do the check here and then clone the object as many times as needed. We may need to think about whether we need a deep clone or not.

And the fix will likely only contain three to five lines, including documentation.


Issue 3

However, if we would change the default value here, the vector of possible aggregation levels would be different for different functions, and therefore you can't just copy and paste it any more. Hm...

That is definitely a problem. How about just adding something like a "recommended aggregation level" to the documentation?

@bockthom
Copy link
Collaborator Author

Let me just document our decisions here:

Issue 2

We just need to do the check here and then clone the object as many times as needed. We may need to think about whether we need a deep clone or not.

As we already checked, a shadow clone is enough (it is ok that the cloned data objects hold the same project configuration).


Issue 3

However, if we would change the default value here, the vector of possible aggregation levels would be different for different functions, and therefore you can't just copy and paste it any more. Hm...

That is definitely a problem. How about just adding something like a "recommended aggregation level" to the documentation?

I've implemented a new function match.arg.or.default, which behaves like match.arg but takes an additional parameter to specify a default value. Hence, we can keep the vector of possible aggregation levels identical everywhere but still vary the default value.

@bockthom
Copy link
Collaborator Author

As all three issues in here are fixed now, I will close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants