Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add data transformations for post-processing plot data #226

Merged
merged 35 commits into from
Dec 19, 2023

Conversation

pineapple-cat
Copy link
Collaborator

@pineapple-cat pineapple-cat commented Oct 23, 2023

Addresses #183 and #205.

  • Fix categorical sorting.
  • Add unit tests.
  • Update documentation.

@pineapple-cat pineapple-cat requested a review from ilectra December 1, 2023 16:07
@asifsamiarain
Copy link
Collaborator

asifsamiarain commented Dec 4, 2023

Seems there is some issue as updated the filters as per this branch while w.r.t. main branch that was working at least.

Here the perflogs w.r.t. a app and a tag has been attached for your perusal (that seems OK at my end too @pineapple-cat).
ssne.tar.gz

Here is the config:

title: sphng_single_node
x_axis:
  value: "job_completion_time"
  units:
    custom: null
y_axis:
  value: "elapsed_time_value"
  units:
    column: "elapsed_time_unit"
filters:
  and: [["test_name", "==", "Sphng_Single_Node_evolution"]]
  or: []
series: [[num_tasks_per_node, 1], [num_tasks_per_node, 2], [num_tasks_per_node, 4], [num_tasks_per_node, 8], [num_tasks_per_node, 16], [num_tasks_per_node, 32], [num_tasks_per_node, 64], [num_tasks_per_node, 128]]
column_types: # e.g. str/string/object, int/int64, float/float64, datetime/datetime64
  job_completion_time: "datetime"
  elapsed_time_value: "float"
  elapsed_time_unit: "str"
  test_name: "str"
  num_tasks_per_node: "int"

But as we use the data from all the apps
all_apps.tar.gz
then the plot looks like:

image

@tkoskela
Copy link
Member

tkoskela commented Dec 4, 2023

There might be a bug in series scaling. In SiWeakScaling.log with

title: Si Weak Scaling

x_axis:
  value: "num_cores"
  units:
    custom: null

y_axis:
  value: "Runtime_value"
  units:
    column: "Runtime_unit"

series: [["num_threads",1],["num_threads",2],["num_threads",4],["num_threads",8]]

filters:
  and: []
  or: []

column_types:
  num_cores: "int"
  num_threads: "int"
  Runtime_value: "float"
  Runtime_unit: "str"

I have
image

When I add scaling of the y-axis by the first series,

  scaling:
    column:
      name: "Runtime_value"
      series: 0

If I've understood correctly, each x value should get divided by the corresponding x value in the num threads = 1 series. The scaled results I get is
image

The value of the num threads = 1 series is 1 for all x values which looks like what I'd expect. The other scaled series look incorrect however. For example, the num cores = 8 value in num threads = 2 should be greater than 1.

@pineapple-cat
Copy link
Collaborator Author

pineapple-cat commented Dec 4, 2023

I still can't replicate exactly what's going wrong with Asif's example, but the scaling issue Tuomas found was caused by a num_cores mismatch, which I fixed by sorting the dataframe before scaling.

fixed_si_weak_scaling

Edit: I've figured out how to more-or-less replicate the first problem. It appears to also be related to dataframe sorting in some way, so I'll continue to investigate now that I have a lead.

Edit 2: Problem fixed by moving sorting before filtering to avoid filter mask interference.

  • Requested bugfixes.
  • Sorting QoL fixes (default sort, colour assignment, legend label sort).

@asifsamiarain
Copy link
Collaborator

asifsamiarain commented Dec 6, 2023

Thanks @pineapple-cat for the fix and now we may see the plot look like:

image

Still wonder, how to order w.r.t. series rather x-axis?

Here is below an example to give a quick look at the perflogs data:

import os
import glob
import pandas as pd
from pivottablejs import pivot_ui
 
path = os.getcwd()
files = glob.glob(os.path.join(path,"perflogs/*/*/*.log"))
df_list = (pd.read_csv(file, delimiter="|") for file in files)
df = pd.concat(df_list, ignore_index=True)
pivot_ui(df)

Above code will generate a browser viewable pivottablejs.html file (and above similar Sphng Single Node data w.r.t. 20230707, 20231013, 20231124 job completion times will look like grouped+ordered as shown below):

image

image

Copy link
Member

@tkoskela tkoskela left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation makes sense to me. A couple of suggestions:

  • It might be easier to understand if the examples were in figures instead of tables.
  • Instead of A note on X I would just have X as the subtitle.

While playing with Asif's logfiles, I noticed one more bit of odd behaviour. When I use the full datetime as the x axis, the series don't get sorted by num_tasks_per_node (or rather, it looks like they are ordered by the x axis value of the first element in the series).
image
If I drop the time from the datetimes, so that my series get grouped together on the x axis, I get the series sorted by numerica value of num_tasks_per_node in the legend, but in the plot they seem sorted by the string representation of num_tasks_per_node (ie. 8 is the last entry)
image
Is this a bug or a feature?

Should we include some general formatting options in the config file? Something to think about for future. Things I always end up hacking by hand include

  • orientation of the x axis labels
  • export to png

@pineapple-cat
Copy link
Collaborator Author

pineapple-cat commented Dec 15, 2023

Bokeh has its own x-axis sorting method that I need to undermine at every step if we want non-string data to be sorted properly on a categorical plot. I've fixed this for (x, series) groupings in the commits below, but this will need to be revisited if we want to expand to (x, series1, series2) groupings. Here's what the plot should look like now:

Ascending

sphng_single_node_ascending

Descending

sphng_single_node_descending

Additionally, there's no need to hack anything to produce a PNG of the graph; this feature is available through the 'Save' button in the Bokeh toolbar:

bokeh_toolbar

Of course, we could save someone a button click by including this as a setting in the config, and it's true that having the option of vertical x-axis group labels would also be a good addition.

Edit: Sorting is always done by x-axis first and then by series. If it's preferable to order by series, like in Asif's example with the pivot table, consider if you couldn't just swap your series and x-axis columns to achieve the effect you're looking for:

sphng_single_node_series_swapped

@pineapple-cat pineapple-cat merged commit ed45fd2 into main Dec 19, 2023
4 checks passed
@pineapple-cat pineapple-cat deleted the post-processing_data-transform branch December 19, 2023 16:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add OR and AND functionality to filtering Add data transformations in config and high-level script
4 participants