Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python analysis window #120

Merged
merged 47 commits into from
Jul 23, 2024
Merged

Python analysis window #120

merged 47 commits into from
Jul 23, 2024

Conversation

magland
Copy link
Collaborator

@magland magland commented Jul 3, 2024

Merged code from #94

Created "Analysis (Py)" tab.

Next step: provide sampling draws as input to script

@magland
Copy link
Collaborator Author

magland commented Jul 3, 2024

You can now run the following in analysis.py

import pandas as pd

df = pd.DataFrame(sp_sampling.draws, columns=sp_sampling.parameter_names)

print(df)

@magland
Copy link
Collaborator Author

magland commented Jul 3, 2024

@magland
Copy link
Collaborator Author

magland commented Jul 3, 2024

There are a number of things to think about. What should the global variable (now sp_sampling) be called? Right now sp_sampling.draws has all the chains concatenated, and you also get sp_sampling.num_chains. It reflects the internal structure of the draws in the data model. Should that be adjusted? What's most intuitive for the user creating analysis.py?

@WardBrian
Copy link
Collaborator

Right now the UI lets you click over to analysis.py/analysis.R when you have not yet run sampling. Moving those tabs down to the area we only show after sampling has completed seems like a good idea to prevent this

@WardBrian
Copy link
Collaborator

Should that be adjusted? What's most intuitive for the user creating analysis.py?

The most intuitive will depend pretty strongly on the user and their use case, but I think there are 3 main 'types' of uses:

  • the user cares about the markov chains as markov chains, so having a num_samples x num_chains x num_parameters (or the transpose of this) numpy array will be the most useful for them
  • The user is using an existing tool which relies on pandas dataframes or arviz InferenceData, so whatever we provide should be easy to convert to these
  • The user cares about one specific parameter in the model, in which case providing a way to extract that one item as a numpy array of the correct shape incredibly useful. This is what cmdstanpy calls stan_variable and is implemented in stanio. Note that the reason this is tricky is when I say "parameter", I am including things like a parameter declared to be a matrix, not just one column of output

@magland
Copy link
Collaborator Author

magland commented Jul 3, 2024

Right now the UI lets you click over to analysis.py/analysis.R when you have not yet run sampling. Moving those tabs down to the area we only show after sampling has completed seems like a good idea to prevent this

Okay I have reorganized the tabs. I do think it's important to be able to view and even edit the analysis scripts even prior to the completion of sampling. So those tabs are present. But if you try to execute the scripts before sampling is complete, you get a (helpful?) error message.

image

Copy link
Collaborator

@jsoules jsoules left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions and suggested areas to tighten.

gui/src/app/Project/ProjectQueryLoading.ts Outdated Show resolved Hide resolved
gui/src/app/pages/HomePage/HomePage.tsx Outdated Show resolved Hide resolved
gui/src/app/pyodide/AnalysisPyFileEditor.tsx Outdated Show resolved Hide resolved
gui/src/app/pyodide/AnalysisPyFileEditor.tsx Outdated Show resolved Hide resolved
gui/src/app/pyodide/AnalysisPyFileEditor.tsx Outdated Show resolved Hide resolved
Comment on lines 96 to 101
if (consoleOutputDiv) {
consoleOutputDiv.innerHTML = "";
}
if (imageOutputDiv) {
imageOutputDiv.innerHTML = "";
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these divs don't exist at this point, isn't that a blocking problem? Like we should prevent the run if there's nowhere for it to report to?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could imagine a case (if this component is being reused somewhere else) where we might only be interested in console output (if the script is not expected to generate images) or only image output (if we don't need to show the console).

Comment on lines 104 to 151
const toolbarItems: ToolbarItem[] = useMemo(() => {
const ret: ToolbarItem[] = [];
const runnable = fileContent === editedFileContent && imageOutputDiv;
if (runnable) {
ret.push({
type: "button",
tooltip: "Run script",
label: "Run",
icon: <PlayArrow />,
onClick: handleRun,
color: "black",
});
}
if (!imageOutputDiv) {
ret.push({
type: "text",
label: "No output window",
color: "red",
});
}
let label: string;
let color: string;
if (status === "loading") {
label = "Loading pyodide...";
color = "blue";
} else if (status === "running") {
label = "Running...";
color = "blue";
} else if (status === "completed") {
label = "Completed";
color = "green";
} else if (status === "failed") {
label = "Failed";
color = "red";
} else {
label = "";
color = "black";
}

if (label) {
ret.push({
type: "text",
label,
color,
});
}
return ret;
}, [fileContent, editedFileContent, handleRun, status, imageOutputDiv]);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is also a good candidate for refactoring out and sharing among the different [data | analysis].[py | r] editors.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but I think we should hold off until things solidify more, since we don't yet know exactly what will be in common (or different) between the 4 cases.

Comment on lines 60 to 69
export type PydodideWorkerStatus =
| "idle"
| "loading"
| "running"
| "completed"
| "failed";

export const isPydodideWorkerStatus = (x: any): x is PydodideWorkerStatus => {
return ["idle", "loading", "running", "completed", "failed"].includes(x);
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has its own drawbacks, but defining the status as an enum allows you to pull the keys/values of the enum as a list, which might make the includes check here more intuitive. (That said we probably aren't expanding the number of statuses any time soon, so maybe not worth worrying about?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be wrong about this, but the simplicity of string types makes serialization much easier... and in this case we're passing json messages.

@magland
Copy link
Collaborator Author

magland commented Jul 4, 2024

Thinking about the interface for analysis.py to get the draws data.

We want something that can be made backward compatible so that as we make improvements to the interface, existing SP Projects will continue to function, at least to a reasonable degree.

The only way I can think of for doing that is to provide python imports that implement specific interfaces, and the user can select which they want to use.

So here's an example analysis.py

import matplotlib.pyplot as plt

# Import a specific interface
from sp_util import load_draws_v1

# Internally loads the draws from a file
# The actual implementation can change over time, but the API should be steady
draws = load_draws_v1()

# Get a list of dataframes, one for each chain
a = draws.get_dataframes()
print(a[0])

# Plot a histogram of `lp__`
plt.figure()
plt.hist(a[0]['lp__'])
plt.show()

Here, by virtue of this specific import the user gets a particular interface, with the expectation that it will be maintained in a backward-compatible manner.

I prefer this import method compared with injecting a mysterious variable into the mix, because python isn't supposed to work that way. I think analysis.py should be a well-defined python script that does not assume any magic variables, but assumes certain modules (e.g., sp_util) exist.

In the implementation from my latest commit, the sp_util module does not come from pypi... but rather it is hard-coded in pyodide... so it's internal to the stan-playground web app. The actual sampling data (draws, parameter names, etc) are communicated behind-the-scenes by writing to a file that is available to the sp_util module. The load_draws_v1 function first loads the data from that file and then prepares an object that has the get_dataframes() method. It also has a draws.get_dataframe_longform() method.

@WardBrian
Copy link
Collaborator

from sp_util import load_draws_v1

I think doing semantic-versioning-by-symbol-name is pretty un-Pythonic as well. The code we provide to load the draws is the one thing we completely control, so it's the thing I'm least worried about backwards compatibility for, to be honest -- numpy or pandas updating and breaking something seems much more likely than us doing it. Especially if what we provide is an object, we can always just add more methods on it if we want newer functionality, and leave the old ones.

If we ever really needed to do something backwards-incompatible, I think the better way to handle that is something like a version parameter in the meta storage/query parameters.

I prefer this import method compared with injecting a mysterious variable into the mix, because python isn't supposed to work that way.

I think Python has enough magic in the language that this would be fine. I think the bigger argument as for why to do it by providing a global variable is that it makes it easier to implement a nice 'download this for use locally with cmdstanpy' easy -- we wouldn't need to provide a module to be imported or anything, just some simple-ish glue code which leaves behind the variable name we choose

@magland
Copy link
Collaborator Author

magland commented Jul 8, 2024

@WardBrian Are you suggesting the script would instead be the following, where "draws" is the special variable?

import matplotlib.pyplot as plt

# Get a list of dataframes, one for each chain
a = draws.get_dataframes()
print(a[0])

# Plot a histogram of `lp__`
plt.figure()
plt.hist(a[0]['lp__'])
plt.show()

@WardBrian
Copy link
Collaborator

Something like this, yes

@magland magland closed this Jul 8, 2024
@magland magland reopened this Jul 8, 2024
@magland
Copy link
Collaborator Author

magland commented Jul 10, 2024

@WardBrian @jsoules

Implemented draws object for analysis.py based on what we had discussed. Here's a gist to try

http://localhost:3000?project=https://gist.github.com/magland/d811a59035037dd0e19e73900048a8fb

The analysis.py for this example, which illustrates the functionality.

import numpy as np
import matplotlib.pyplot as plt

# Histograms
for pname in ['gamma', 'beta', 'phi_inv', 'lp__']:
    plt.figure()
    plt.hist(draws.get(pname))
    plt.title(pname)
    plt.show()

print('PARAMETERS')
print('==========')
for p in draws.parameter_names:
    print(p)

print('')
print('PARAMETER VALUES')
print('================')
print('gamma:', draws.get('gamma'))
print('beta:', draws.get('beta'))
print('phi_inv:', draws.get('phi_inv'))

print('')
print('OTHER')
print('=====')
print('Mean of lp__:', np.mean(draws.get('lp__')))
print('Shape of y:', draws.get('y').shape)

print('')
print('DATAFRAME')
print('=========')

df = draws.as_dataframe()
print(df)

Note that y is a matrix parameter, but it gets flattened to a vector. Will want to use stanio instead, as noted in the comments in the source code. @WardBrian do you want to work on this? Here's the relevant script that gets loaded into the worker and then into the pyodide environment

https://github.com/flatironinstitute/stan-playground/pull/120/files#diff-356c7427b0952fcfd57ddd11e46493a6624870c92e611c335ee52f142272654d

@magland
Copy link
Collaborator Author

magland commented Jul 10, 2024

Assuming the structure of this looks good... In terms of merging order, it would help a lot to merge a basic version of this to minimize conflicts with other features we are working on (e.g., data generation). And then reopen a new PR with the needed tweaks to this one. What do you think?

Copy link
Collaborator

@WardBrian WardBrian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few things stood out to me in the current version -

gui/src/app/SamplerOutputView/SamplerOutputView.tsx Outdated Show resolved Hide resolved
gui/public/sp_load_draws.py Outdated Show resolved Hide resolved
gui/src/app/pyodide/pyodideWorker/pyodideWorker.ts Outdated Show resolved Hide resolved
gui/src/app/pyodide/pyodideWorker/pyodideWorker.ts Outdated Show resolved Hide resolved
gui/src/app/pyodide/pyodideWorker/pyodideWorker.ts Outdated Show resolved Hide resolved
@WardBrian
Copy link
Collaborator

@jsoules @magland this PR is now updated with respect to main. I've done a basic once-over to make sure everything still works as expected, but it can now be given a more thorough review!

@magland
Copy link
Collaborator Author

magland commented Jul 23, 2024

@WardBrian Nice! I have tested it and everything seems to be working.

As I mentioned I think we should do a separate PR for analysis.r and we can borrow code from #94 . LMK if you want me to work on that -- I can draft an initial version, but I will need help figuring out how to make the R draws object analogous to the python one.

@WardBrian
Copy link
Collaborator

WardBrian commented Jul 23, 2024

I agree that should be a separate PR -- maybe after this is merged?

I'm currently trying to debug an issue with (I think) the TextEditor component - if I flip back and forth between the data.r and data.py tabs, the other text editors on the screen (e.g. main.stan) flicker/re-render -- maybe due to the monaco singleton getting touched? Any ideas @magland?

It's not introduced by this PR I'm pretty sure, just this makes it noticeable

Edit: This has been fixed

@WardBrian WardBrian requested a review from jsoules July 23, 2024 16:11
Copy link
Collaborator

@jsoules jsoules left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is overall in good shape. I have a few minor notes but would defer anything more substantial to a later refactor.

We should be able to get this merged in short order I think.

gui/src/app/SamplerOutputView/TracePlotsView.tsx Outdated Show resolved Hide resolved
gui/src/app/FileEditor/TextEditor.tsx Outdated Show resolved Hide resolved
gui/src/app/FileEditor/TextEditor.tsx Outdated Show resolved Hide resolved
gui/src/app/pyodide/pyodideWorker/pyodideWorkerTypes.ts Outdated Show resolved Hide resolved
gui/src/app/pyodide/pyodideWorker/pyodideWorkerTypes.ts Outdated Show resolved Hide resolved
@jsoules
Copy link
Collaborator

jsoules commented Jul 23, 2024

Oh, one thing that would be nice that I didn't mention because it isn't strictly germane to this PR, but it would be good for us to document somewhere the expectations around the global data/memory sharing model that's used to pass data around among the workers/non-TS, non-Stan interpreters and our own app. Like, what's the data structure look like, what's the strategy.

@magland
Copy link
Collaborator Author

magland commented Jul 23, 2024

@WardBrian it looks like we had some simultaneous commits, doing some of the same things. Hopefully everything shakes out properly.

@WardBrian
Copy link
Collaborator

Yep, I resolved any conflicts

@jsoules
Copy link
Collaborator

jsoules commented Jul 23, 2024

Ok, looks to me like we're good to merge this--everybody good with that?

@jsoules jsoules merged commit 0f7c84d into main Jul 23, 2024
2 checks passed
@jsoules jsoules deleted the analysis-py branch July 23, 2024 19:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Integrate analysis.py/analysis.R into UI Integrate data.py and data.r into the UI
3 participants