-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
speed up shot loading #29
Comments
Original comment by Jan Werkmann (Bitbucket: PhyNerd, GitHub: PhyNerd). Ok so I did some digging in the code and I was able to speed up loading files by a factor of 3(!!!) by changing just one line. In update_row() I changed
to
I don't know why this help but at least for my test machines it does. |
Original comment by Jan Werkmann (Bitbucket: PhyNerd, GitHub: PhyNerd). Ok another thing that gives me a factor 2 speed improvement is instead of using the filepath to reference the line that needs updating(in update_row) rather use the dataframe index as this is 2 times faster. The filepath can then be retrieved from the dataframe row.
and
A small improvement in speed was also gained when reordering and rewriting add_files:
To test speed i'm using a feature that I'm currently writing to export and import dataframes. I reload the same dataframe of 2000 shots everytime(from one file). |
Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington). Parallelising HDF5 file reading with threads will not achieve anything as h5py holds the python GIL for all operations. The comment on line 1623 As you've discovered though, there are lots of gains to be had outside of file reading. We are not particularly efficient with many of our dataframe operations, and I'm sure there are gains to be had in the interaction with the Qt model as well. I'll have a look at your above suggestions! |
Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington). If you're posting code changes for discussion, it would be nice if you could post diffs for ease of seeing what has changed. You can run I don't think allowing However, if One way to prevent the race condition from biting anyone in the future if they use
This should be a factor of n faster than actually looking up the filepath, whilst ensuring race conditions don't cause invisible bugs in the future. I can see though how the line
could be a problem for performance. If you're calling it for every update, then it searches through the dataframe once for every file and so updating n files runs in quadratic time in n. When you say a factor of two, do you mean that the There are other ways to optimise lookups rather than searching. For example in sql databases you can "index" a column to speed up searches for rows that have a particular value. I don't know if pandas supports anything like this (googling for "pandas index column" obviously gives many irrelevant results), but we could also do that manually by maintaining a dictionary containing a filepath:row_index mapping that is updated whenever a row is added or removed (or rows re-ordered which we don't currently support but might in future). But your solution with theabove added check should be entirely sufficient. If there are similar slow searches through dataframes elsewhere in the code though that you hit upon, this other approach might be worth considering. Pandas people seem to care a lot about performance so it's possible that this functionality is built-in somewhere if you can work out what to type into google to find it. |
Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington). Regarding the optimisation of I was confused at first though, since you turn filepaths into a set, but then later get it as a list from the dataframe again (presumable to get the correct order since sets don't preserve order). Could you give the set a different name (even as silly as appending Anyway, it's all looking good. These are welcome changes and you should make a pull request about them, though I'm guessing you're waiting on your other pull request to be merged first. |
Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington). Another thing you might look at is in Of course, profiling is the only way to know what bits of code should be looked at! |
Original comment by Jan Werkmann (Bitbucket: PhyNerd, GitHub: PhyNerd). Will definitely do diffs in the future (or create a branch on my fork). Well update_row is current called in 2 places add_files and in the analysis loop. So I don't think this is too bad. Yes as I wrote allready I'm importing/exporting 2000 shot dateframes as a whole(one file) and see a speed increase of a factor 2 there. And update_row is the only thing I'm editing currently(besides add_files) so I guess update_row becomes 2 times faster as a whole. I'm not quite sure how this affects loading 2000 files but would assume a speedup as well but smaller. Pandas supports indexing(https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.set_index.html) and as I've read this also speeds up searches. You then reference rows in loc by that index like df.loc[filepath, column]. But i'm not that great with pandas so thats for someone else to do. The main slowness I think here is the string comparison though. Yes the set is unordered and well order seems to be better than chaos so thats why I'm later using a list. With the other changes made I'm trying to reduce the amount of loops as they were not really needed. Yes I'm waiting on the Update Dataframe in particular as they both touch the update_row function and I don't really want to end up with conflicts again. Also I'm sure there is more speed that can be gained so I'm not in a rush. The incoming buffer I looked at butnothing really sticked out. I already wrote a workaround with the dataframe files that can be exported and imported. Is this something that could be of interest to others? Or is the goal to rather improve loading individual files? |
Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington). Hm, I wonder if that dataframe indexing helps. Yes, the string comparison is slow but I think only because it is being done once for every row in the dataframe. Whereas with the indexing if it's implemented well it should be something like a dictionary lookup, which is a hash-table and only does one string comparison (well, if you're lucky it does only one - it might do two or three but definitely not n). Definitely something I'll keep in mind for future performance improvements! |
Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington). As for the exporting/importing thing, I'm not sure. What does your workaround do? It's for loading files in and out of lyse? I suppose "load the exact state you had before" can be cached without much difficulty, but how would you handle if a file has changed on disk or if different files are to be loaded? I'm open to all sorts of caches, but it's hard to get things like this right. Oh by the way one more thing. At the moment we have some code running that makes sure we are not making calls to Qt outside the main thread. This is very important during development, as missing even a single call that is outside the main thread leads to extremely hard to debug bugs. However, this code runs every single time a Qt function is called, and so it runs a lot during You can disable it by putting somewhere at the top level of
Would be interesting to see if it makes a difference. |
Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington). Oh, actually, reading the code for the |
Original comment by Jan Werkmann (Bitbucket: PhyNerd, GitHub: PhyNerd). Currently I'm not checking if the files still exist/have changed etc. I'm just splitting tha dataframe by sequences and exporting it into a hdf5 file on press of a button. The hdf5 file however to net get confused is saved with the file ending .df . I'm using DataFrame.to_hdf and pandas.read_hdf for this. This is mainly useful when measuring lots of sequences over night. We then export the dataframe(s) and load in the individual sequences one after the other as this is a lot faster than loading the files. Also after things are done we overwrite the old dataframe with a updated version. Also here is the result of your profiler running on add_files: |
Original comment by Jan Werkmann (Bitbucket: PhyNerd, GitHub: PhyNerd). If someones interested my changes are now over at File-loading-Performance. I've improved update_row, add_files and getmodel_row_by_filepath so far. I think there is still a lot to gain with scientific_notation and update_row so I'll be waiting until that is done before creating a pullrequest. |
Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington). There is also about 30% of So a more radical change would be to look into how Qt models work and subclass it to be backed by the dataframe itself, with the tooltips and scientific notation all being computed lazily only when the method to get that data was called (indicating the column was actually visible or the user had actually moused-over for a tooltip) and possibly cached. This was originally suggested by @philipstarkey when we began porting to Qt, and I don't recall why I didn't go for it. In any case the code is organised well enough that the change would be isolated to the That could be something to look into for the future, but these optimisations should be included now anyway! I'm a little skeptical about the profiling results, surprised to not see much in the way of Qt calls in there. Maybe Qt is just really fast, but it's also the case that sometimes the profiler doesn't catch certain things - like I'm not sure if the creation of constants like the dictionaries and string constants in I think you should feel free to make a pull request even if you think something isn't "ready" yet. Pull requests are more visible in the bitbucket UI and are a nice way to see what things are "in progress" from other people's forks. Pull requests have the diff and list of commits visible and have nested comment threads (which issue threads frustratingly lack). They also have the source branch listed so people can see where to pull from for testing without you having to tell us. Just mention in the pull request that you're still looking for feedback and we won't merge - even if you don't mention anything, most things won't be merged without some comments anyway about testing etc. We can always just reject a pull request if an approach is abandoned. |
Original comment by Jan Werkmann (Bitbucket: PhyNerd, GitHub: PhyNerd). Moving the variables to the global namespace didn't do much. As well as removing the try and ecept and replacing it with if else.
I guess scientific notation is just that dominant as it's called lots of times. A "significant" gain however can be achieved by removing resizeRowToContents in add_files but the gain isn't great enough to justify removing it (1 second). The main time consuming factor in update_row seems to be the creation of all the QtGui.QStandardItems. This leads me to believe that rewriting Qt's modle would most likely hold a much larger increase in speed. However pandas and Qt are both not my strong suite as I've had my first contact with them last month. Luckily performance problems with many shots are confined to lyse for the moment. But your profiling tool is/was definitely a big help finding lines that need improvement. I'll wait for the update dataframe pull request to go through and then I'll create the pull request. The repos were quite inactive in the last weeks so I'm not really expecting much response at the moment anyway. |
Original report (archived issue) by Jan Werkmann (Bitbucket: PhyNerd, GitHub: PhyNerd).
When we want to look at data acquired in the past it takes really long to load the shots into lyse. (commonly this can be 2000-10000 shots and take up to half an hour)
I looked at the code and this seems to all be done in one thread. I think the file reading process can be parallelized with a pool of worker threads. This should bring a speed improvement.
The text was updated successfully, but these errors were encountered: