-
Notifications
You must be signed in to change notification settings - Fork 526
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Paper: How the Scientific Python ecosystem helps answer fundamental questions of the Universe #924
Paper: How the Scientific Python ecosystem helps answer fundamental questions of the Universe #924
Conversation
This has been placed in draft mode as I'm still continuing to write. I'm opening the PR now to ensure that it is created before the May 31 deadline. |
The
If I'm citing things that don't have DOIs (like software that doesn't have an archive on Zenodo) is there a way to communicate this to Curvenote to make it clear this is intended? |
8c70188
to
553b629
Compare
@matthewfeickert you can add citation keys you want to ignore in error_rules:
- rule: doi-exists
severity: ignore
keys:
- Atr03
- terradesert |
@matthewfeickert can you make this update so we can run checks? |
553b629
to
c37450b
Compare
Curvenote Preview
|
@matthewfeickert Just a reminder that first submissions must be compete by today. Please do what you need to do to get your PR out of the draft state so we can mark it ready for review and assign a reviewer. We assign reviewers to papers all at once and not piecemeal paper by paper. |
I don't understand this last comment, but I'll assume that it is not something that affects actionable information on my side. Thanks for the reminder that the deadline is today at 23:59 Pacific. 👍 |
2297ef9
to
1580bc8
Compare
046bba9
to
14cb4e8
Compare
No worries @matthewfeickert, we assigned two reviewers to your paper :) |
The data structure of each event consists of variable length lists of physics objects (e.g. electrons, collections of tracks from charged objects). | ||
To study the properties of the physics objects in a statistical manner, a fixed event analysis procedure is repeated over billions of events. | ||
This has traditionally motivated the use of "event loops" that implicitly construct event-level quantities of interest and leveraged the `C++` compiler to produce efficient iterative code. | ||
This precedent made it difficult to take advantage of array programming paradigms that are common in Scientific Python given NumPy [@numpy] vector operations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be beneficial to the audience to include a sentence explaining what array programming is and why it is useful in this context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm hesitant to have this paper have the responsibility of explaining these concepts to the audience, given that we already provide a reference to the NumPy paper Array programming with NumPy. The NumPy paper doesn't make any attempt to further cite the concept of array programming or to elaborate on its definition any further than the first sentence of the paper, I would assume, given how widespread these concepts are in scientific computing. We do also mention in the introduction
The use of dataframes and array programming for data analysis has enhanced the user experience while providing efficient computations without the need of coding optimized low-level routines.
|
||
The most famous and revolutionary discovery in particle physics this century is the discovery of the Higgs boson — the particle corresponding to the quantum field that gives mass to fundamental particles through the Brout-Englert-Higgs mechanism — by the ATLAS and CMS experimental collaborations in 2012. [@HIGG-2012-27;@CMS-HIG-12-028] | ||
This discovery work was done using large amounts of customized `C++` software, but in the following decade the state of the PyHEP community has advanced enough that the workflow can now be done using community Python tooling. | ||
To provide an overview of the tooling and functionality, a high level summary of a simplified analysis workflow of a Higgs "decay" to two intermediate $Z$ bosons that decay to charged leptons $(\ell)$ (i.e. electrons ($e$) and muons ($\mu$)), $H \to Z Z^{*} \to 4 \ell$, on ATLAS open data [@ATLAS-open-data] is summarized in this section. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Optional) Restructure the first paragraph in this section by putting the message of the last sentence first. For example, "In this section we will demonstrate how the scientific Python ecosystem allows for ... with an example of a simplified analysis workflow of a Higgs Boson decay." My reasoning for this suggestion is because during my first read through the paper I missed the subtle transition into the example; it felt like the paper jumped straight into it after giving background regarding the Higgs Boson phenomenon. So by moving the message of the final sentence up, I think it will stand-out more to the reader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand your point, but I don't think we want to restructure that paragraph as this now means awkwardly introducing the Higgs in the middle of another sentence. The focus of the paper is the software, but we still need to motivate it with the science which is the primary driver and so some text needs to be devoted to the physics. Though to try to make the final sentence of the paragraph more clear we've repeated that it is the Pythonic tooling being used. Though if you feel that this still isn't helpful I'm happy to wordsmith this more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this paper does a good job of telling the story that complex, large-scale physics research is now feasible in Python due to recent advancements in the scientific Python ecosystem. The background information clearly defined the technological gap that needed to be addressed to make this type of research possible. The authors were thorough in indicating how each library addressed the former challenges, and how such code should be used in an analysis workflow. Finally, I believe the Higgs boson example was a good choice because it is a convincing real-world problem that demonstrated the ease at which this research can now be conducted in Python.
Thanks very much for your review @Marcdh3! We will get to work on incorporating your feedback shortly (hopefully before SciPy starts). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought this paper was clear and easy to follow. Some very minor comments:
- Should the abstract be updated to refer to the paper instead of the talk?
- You could include an explanation of why ROOT is preferable to other formats (e.g. Arrow?)
- Typo, "a specific calorimeter subsystems"
- The description of the role of "nanobind" probably doesn't need to be repeated between sections.
- Do all computations currently run on CPUs, and if so, is there interest in exploring GPU implementations?
Thanks @cliffckerr for the review! I will address these post haste, but am in week 5 of travel for work and teaching at the moment so it probably won't be until next week. |
Hi @matthewfeickert , just a friendly reminder that all initial reviews are in. I highly recommend you start responding to the comments soon since it'd take time for the reviewers to respond to the changes. Remember that the open review period ends on Sep 2, and you will not be able to make any changes to the manuscript after that point. If you have any questions, please let me know! |
Co-authored-by: Vangelis Kourlitis <[email protected]>
Directive options are parsed line-by-line, so if a directive is split across multiple lines then only the first line will be captured. Co-authored-by: Franklin Koch <[email protected]>
Co-authored-by: Marcus Hill <[email protected]>
* Add quotes around 'bunches'. * Reemphasize that the tooling in question is Pythonic. * Refer to 'paper' instead of 'talk' in the abstract. * Note ROOT's almost 30 year history of columnar data structures. * Add 'Awkward Arrays in Python, C++, and Numba' as a reference. - c.f. https://inspirehep.net/literature/1776192 * Correct typo of 'subsystems' to 'subsystem'. * Add citation to IRIS-HEP AGC Zenodo archive. - c.f. https://doi.org/10.5281/zenodo.7274936
cccf62a
to
54cc270
Compare
Done.
I've added a short mention of it providing columnar data structures with good serialization compression for almost 30 years (1997) (Arrow wasn't released until 2016) along with an additional reference. Though the realities are more complex and well outside of the scope of this short paper, I will note that when the field has well over an exabyte of data stored in the same file format which is used for basically everything there's rather large inertia to switch regardless of motivation.
Thanks! @cliffckerr I'm sorry that there were so many typos that you and @Marcdh3 had to catch (and this is even with the
Can you elaborate a bit more on what is repeated? "nanobind" only appears in the main text in the following two sentences:
and
The first sentence is introducing the concept of custom Python bindings, and the second is highlighting the fact that
Yes. At the moment the focus has been scaling out these workflows at dedicated "analysis facilities" to achieve required data processing throughput rates for the next iteration of the LHC (reference presentation). There is ongoing work to be able to utilize hardware acceleration across the tooling ecosystem. This work is still very much in the research stage and at varying stages of maturity across the ecosystem, but Awkward Array computations work on GPUs and the statistical libraries support hardware acceleration. (Note: I rebased this branch off of the current HEAD of the |
Thanks @matthewfeickert -- my original comment no longer seems so clear to me 😂 It's probably fine as-is. If you do want to revisit the
But it's really not an issue, for some reason I thought there was a full paragraph or at least full sentence repeated. |
* Apply Cliff Kerr's suggestions of avoiding redundant informtion on nanobind.
(Sorry for the slow reply @cliffckerr — been doing a week of international work travel so I'm behind on a bit of everything.)
These two come from figure captions, and I generally like them to be as standalone as possible. Though as you do point out, the sentence preceding Figure 1
makes the mention in the Figure 1 caption
a bit redundant as the focus of Figure 1 is the interface, and while the interface decisions enable performance, the performance isn't the focus. So I'm going to remove this line from Figure 1. 👍 |
@cliffckerr @Marcdh3 Hello reviewers, the author @matthewfeickert have responded to your feedback. If you don't have any further comments, could you please approve the PR? Thanks for your time and effort for the review! |
Thanks @Marcdh3 @cliffckerr for reviewing the paper! |
I don't seem to have the option to re-request review, but I approve! |
Thanks @cliffckerr! If you'd like to formally hit the button I've triggered review request from you. :) We will take no action to mean "approve" though. 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, nice work!
Thanks very much to @cliffckerr @Marcdh3 for their hard work as reviewers to help improve the paper — it is truly appreciated. Thanks also to @hongsupshin and the rest of the Proceedings Committee for the heroic amounts of work that they did to once again serve the SciPy community and provide a wonderful service to the conference. 🙏 |
This PR adds the
myst.md
paper source for the SciPy 2024 proceedings for the talk "How the Scientific Python ecosystem helps answer fundamental questions of the Universe".Editor: Hongsup Shin @hongsupshin
Reviewers: