Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

speed up shot loading #29

Open
philipstarkey opened this issue Jul 12, 2017 · 15 comments
Open

speed up shot loading #29

philipstarkey opened this issue Jul 12, 2017 · 15 comments
Labels
enhancement New feature or request major

Comments

@philipstarkey
Copy link
Member

Original report (archived issue) by Jan Werkmann (Bitbucket: PhyNerd, GitHub: PhyNerd).


When we want to look at data acquired in the past it takes really long to load the shots into lyse. (commonly this can be 2000-10000 shots and take up to half an hour)
I looked at the code and this seems to all be done in one thread. I think the file reading process can be parallelized with a pool of worker threads. This should bring a speed improvement.

@philipstarkey
Copy link
Member Author

Original comment by Jan Werkmann (Bitbucket: PhyNerd, GitHub: PhyNerd).


Ok so I did some digging in the code and I was able to speed up loading files by a factor of 3(!!!) by changing just one line. In update_row() I changed

#!python

dataframe_row = dict(self.dataframe.iloc[df_row_index])

to

#!python

dataframe_row = self.dataframe.iloc[df_row_index].to_dict()

I don't know why this help but at least for my test machines it does.

@philipstarkey
Copy link
Member Author

Original comment by Jan Werkmann (Bitbucket: PhyNerd, GitHub: PhyNerd).


  • changed title from "Workers for shot loading" to "speed up shot loading"

@philipstarkey
Copy link
Member Author

Original comment by Jan Werkmann (Bitbucket: PhyNerd, GitHub: PhyNerd).


Ok another thing that gives me a factor 2 speed improvement is instead of using the filepath to reference the line that needs updating(in update_row) rather use the dataframe index as this is 2 times faster. The filepath can then be retrieved from the dataframe row.

#!python

    @inmain_decorator()
    def update_row(self, filepath=None, dataframe_already_updated=False, status_percent=None, new_row_data=None, updated_row_data=None, df_row_index=None):
        """"Updates a row in the dataframe and Qt model
        to the data in the HDF5 file for that shot. Also sets the percent done, if specified"""
        # Update the row in the dataframe first:
        if df_row_index is None:
            if filepath is None:
                raise ValueError('Eigther df_row_index of filepath must be provided!')
            df_row_index = np.where(self.dataframe['filepath'].values == filepath)
            try:
                df_row_index = df_row_index[0][0]
            except IndexError:
                # Row has been deleted, nothing to do here:
                return

and

#!python

        # Update the data in the Qt model:
        dataframe_row = self.dataframe.iloc[df_row_index].to_dict()
        if filepath is None:
            filepath = dataframe_row['filepath', '']
        model_row_number = self.get_model_row_by_filepath(filepath)

A small improvement in speed was also gained when reordering and rewriting add_files:

#!python

    @inmain_decorator()
    def add_files(self, filepaths, new_row_data):
        """Add files to the dataframe model. New_row_data should be a
        dataframe containing the new rows."""
        filepaths = set(filepaths)
        duplicates = set(self.dataframe['filepath'].values)-(set(self.dataframe['filepath'].values)-filepaths)
        for filepath in duplicates:
            app.output_box.output('Warning: Ignoring duplicate shot %s\n' % filepath, red=True)
            if new_row_data is not None:
                df_row_index = np.where(new_row_data['filepath'].values == filepath)
                new_row_data = new_row_data.drop(df_row_index[0])
                new_row_data.index = pandas.Index(range(len(new_row_data)))

        df_len = len(self.dataframe)
        self.dataframe = concat_with_padding(self.dataframe, new_row_data)
        self.update_column_levels()
        filepaths = new_row_data["filepath"].tolist()
        for i, filepath in enumerate(filepaths):
            # Add the new rows to the model:
            self._model.appendRow(self.new_row(filepath))
            vert_header_item = QtGui.QStandardItem('...loading...')
            self._model.setVerticalHeaderItem(self._model.rowCount() - 1, vert_header_item)
            self._view.resizeRowToContents(self._model.rowCount() - 1)
            self.update_row(dataframe_already_updated=True, df_row_index=i+df_len)
        self.renumber_rows()

To test speed i'm using a feature that I'm currently writing to export and import dataframes. I reload the same dataframe of 2000 shots everytime(from one file).
At the beginning this took 45 s with all the changes I made I'm down to about 6 s.
I'm trying to reach 2 s maybe 1 s.
Any more suggestions for improvement?
If this feature is of interest I could also provide this to the main branch.

@philipstarkey
Copy link
Member Author

Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington).


Parallelising HDF5 file reading with threads will not achieve anything as h5py holds the python GIL for all operations. The comment on line 1623 # We open the HDF5 files here outside the GUI thread so as not to hang the GUI: is not (yet) true. H5py plan to remove the GIL holding at some point, but have not yet to my knowledge. They also will most likely replace it with their own lock, which means the GUI won't hang, but parallel hdf5 read operations likely won't yield much either. Furthermore, whilst the files are open, I suspect the bottleneck is likely to be disk and network, not python code execution speed, meaning parallelisation won't yield much unless the files are on different disks or something. So even if the reading was parallelised with multiple processes rather than threads, I think it would still not result in much speedup.

As you've discovered though, there are lots of gains to be had outside of file reading. We are not particularly efficient with many of our dataframe operations, and I'm sure there are gains to be had in the interaction with the Qt model as well.

I'll have a look at your above suggestions!

@philipstarkey
Copy link
Member Author

Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington).


If you're posting code changes for discussion, it would be nice if you could post diffs for ease of seeing what has changed. You can run hg diff somefile.py from a terminal to get the diff text and post it here in a code environment with 'diff' as the language.

I don't think allowing update_row() to take a df_row_index argument is a good idea in general. The reason is that since update_row() can be called from outside the main thread, any thread computing a row index and then calling update_row() runs the risk of the row index having changed before update_row() actually runs. By accepting the filepath instead, update_row ensures that it never has out of date information, as the filepath cannot change.

However, if add_files() is the only caller using the df_row_index argument then that's fine, as add_files() itself runs in the main thread and so can be sure that nothing has changed in between its determination of the row number and the call to update_row. And performance optimisations are often compromises on making interfaces as pure as you'd like them to be, so I think I'm happy to accept this as a performance optimisation if it really is a significant speedup.

One way to prevent the race condition from biting anyone in the future if they use df_row_index argument from a non-main thread is to confirm that the filepath matches. You could have add_files() still pass in the filepath argument (rather than None), and then have update_row() have a check like:

#!python

# If df_row_index was used, check the row actually matches the filepath.
# This might be the case if the caller was not in the main thread, which
# is a situation vulnerable to race conditions:
assert filepath == dataframe_row['filepath', '']

This should be a factor of n faster than actually looking up the filepath, whilst ensuring race conditions don't cause invisible bugs in the future.

I can see though how the line

#!python

df_row_index = np.where(self.dataframe['filepath'].values == filepath)

could be a problem for performance. If you're calling it for every update, then it searches through the dataframe once for every file and so updating n files runs in quadratic time in n.

When you say a factor of two, do you mean that the update_row function is two times faster without this search than with it? If so that's quite a speedup!

There are other ways to optimise lookups rather than searching. For example in sql databases you can "index" a column to speed up searches for rows that have a particular value. I don't know if pandas supports anything like this (googling for "pandas index column" obviously gives many irrelevant results), but we could also do that manually by maintaining a dictionary containing a filepath:row_index mapping that is updated whenever a row is added or removed (or rows re-ordered which we don't currently support but might in future).

But your solution with theabove added check should be entirely sufficient. If there are similar slow searches through dataframes elsewhere in the code though that you hit upon, this other approach might be worth considering. Pandas people seem to care a lot about performance so it's possible that this functionality is built-in somewhere if you can work out what to type into google to find it.

@philipstarkey
Copy link
Member Author

Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington).


Regarding the optimisation of add_files, using a set instead of searching individually is a great idea to speed things up, since set membership can be checked in constant time.

I was confused at first though, since you turn filepaths into a set, but then later get it as a list from the dataframe again (presumable to get the correct order since sets don't preserve order). Could you give the set a different name (even as silly as appending _set to its name) to make it clear they have different purposes? I think the rest of that function looks fine, it's not clear to me why the other changes would result in any speed up but they are harmless. The usage of a set I can definitely see prevents us from having to search the whole dataframe for each shot.

Anyway, it's all looking good. These are welcome changes and you should make a pull request about them, though I'm guessing you're waiting on your other pull request to be merged first.

@philipstarkey
Copy link
Member Author

Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington).


Another thing you might look at is in FileBox.incoming_buffer_loop(), where we batch process HDF5 files in order to minimise the number of large dataframe concatenations. At the moment is is hard-coded to process 5 shots at a time, but you might want to experiment with making that larger, or a function of the number of shots already loaded, or something like that. 5 for me seemed to be the sweet spot when I was experimenting, but if dataframe concatenation is slow for very large numbers of shots then increasing this might be of benefit.

Of course, profiling is the only way to know what bits of code should be looked at!

@philipstarkey
Copy link
Member Author

Original comment by Jan Werkmann (Bitbucket: PhyNerd, GitHub: PhyNerd).


Will definitely do diffs in the future (or create a branch on my fork).

Well update_row is current called in 2 places add_files and in the analysis loop. So I don't think this is too bad.
A check is probably a good idea will add that.

Yes as I wrote allready I'm importing/exporting 2000 shot dateframes as a whole(one file) and see a speed increase of a factor 2 there. And update_row is the only thing I'm editing currently(besides add_files) so I guess update_row becomes 2 times faster as a whole. I'm not quite sure how this affects loading 2000 files but would assume a speedup as well but smaller.

Pandas supports indexing(https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.set_index.html) and as I've read this also speeds up searches. You then reference rows in loc by that index like df.loc[filepath, column]. But i'm not that great with pandas so thats for someone else to do.

The main slowness I think here is the string comparison though.

Yes the set is unordered and well order seems to be better than chaos so thats why I'm later using a list.

With the other changes made I'm trying to reduce the amount of loops as they were not really needed.

Yes I'm waiting on the Update Dataframe in particular as they both touch the update_row function and I don't really want to end up with conflicts again. Also I'm sure there is more speed that can be gained so I'm not in a rush.

The incoming buffer I looked at butnothing really sticked out. I already wrote a workaround with the dataframe files that can be exported and imported. Is this something that could be of interest to others? Or is the goal to rather improve loading individual files?

@philipstarkey
Copy link
Member Author

Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington).


Hm, I wonder if that dataframe indexing helps. Yes, the string comparison is slow but I think only because it is being done once for every row in the dataframe. Whereas with the indexing if it's implemented well it should be something like a dictionary lookup, which is a hash-table and only does one string comparison (well, if you're lucky it does only one - it might do two or three but definitely not n). Definitely something I'll keep in mind for future performance improvements!

@philipstarkey
Copy link
Member Author

Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington).


As for the exporting/importing thing, I'm not sure. What does your workaround do? It's for loading files in and out of lyse? I suppose "load the exact state you had before" can be cached without much difficulty, but how would you handle if a file has changed on disk or if different files are to be loaded? I'm open to all sorts of caches, but it's hard to get things like this right.

Oh by the way one more thing. At the moment we have some code running that makes sure we are not making calls to Qt outside the main thread. This is very important during development, as missing even a single call that is outside the main thread leads to extremely hard to debug bugs. However, this code runs every single time a Qt function is called, and so it runs a lot during update_row().

You can disable it by putting somewhere at the top level of __main__.py:

#!python

from qtutils import qtlock
qtlock.enforce(False)

Would be interesting to see if it makes a difference.

@philipstarkey
Copy link
Member Author

Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington).


Oh, actually, reading the code for the qtlock thing, it looks like the checks only happen in non-main threads, so it won't speed up anything in update_row() since update_row() is in the main thread. It might speed up other code though, but other code is not the bottleneck!

@philipstarkey
Copy link
Member Author

Original comment by Jan Werkmann (Bitbucket: PhyNerd, GitHub: PhyNerd).


Currently I'm not checking if the files still exist/have changed etc. I'm just splitting tha dataframe by sequences and exporting it into a hdf5 file on press of a button. The hdf5 file however to net get confused is saved with the file ending .df . I'm using DataFrame.to_hdf and pandas.read_hdf for this.

This is mainly useful when measuring lots of sequences over night. We then export the dataframe(s) and load in the individual sequences one after the other as this is a lot faster than loading the files. Also after things are done we overwrite the old dataframe with a updated version.
As loading a dataframe like this only takes up about 5 seconds one can take a quick look at old data or quickly run some new routines without the hassle of waiting for up to 15 minutes.

Also here is the result of your profiler running on add_files:
example.png
Haven't really looked at it much but if I interpret it correctly scientific_notation and get_model_row_by_filepath seems to be slow.

@philipstarkey
Copy link
Member Author

Original comment by Jan Werkmann (Bitbucket: PhyNerd, GitHub: PhyNerd).


If someones interested my changes are now over at File-loading-Performance. I've improved update_row, add_files and getmodel_row_by_filepath so far.

I think there is still a lot to gain with scientific_notation and update_row so I'll be waiting until that is done before creating a pullrequest.

@philipstarkey
Copy link
Member Author

Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington).


There is also about 30% of update_row() missing in the percentages, which makes me think that that 30% is made up of many small things none of which is above 5% of run time (BProfile has a default threshold of 5% but you can set it smaller to see more detail). That as well as the visible places it is spending time make me think that unfortunately we're running out of places to optimise without a more radical change. So chipping away at scientific_notation() could be good (possible low hanging fruit - move the constants to the global namespace rather than defining them within the function?), but there's diminishing returns in optimising things more as they are.

So a more radical change would be to look into how Qt models work and subclass it to be backed by the dataframe itself, with the tooltips and scientific notation all being computed lazily only when the method to get that data was called (indicating the column was actually visible or the user had actually moused-over for a tooltip) and possibly cached. This was originally suggested by @philipstarkey when we began porting to Qt, and I don't recall why I didn't go for it. In any case the code is organised well enough that the change would be isolated to the DataFrameModel class (the name maybe hints that I was initially going to go with phil's suggestion but gave up because it looked too tricky?).

That could be something to look into for the future, but these optimisations should be included now anyway!

I'm a little skeptical about the profiling results, surprised to not see much in the way of Qt calls in there. Maybe Qt is just really fast, but it's also the case that sometimes the profiler doesn't catch certain things - like I'm not sure if the creation of constants like the dictionaries and string constants in scientific_notation would appear in the profiling results. I have another profiling tool that measures what percentage of the time is spent on each line of a given file rather than profiling in terms of function calls, which I have found to be super useful, but at the moment this tool is a pile of platform-specific hacks rather than something I could distribute. But I'll think about getting it into shape so we can do some better profiling since performance is so important to people with so many shots.

I think you should feel free to make a pull request even if you think something isn't "ready" yet. Pull requests are more visible in the bitbucket UI and are a nice way to see what things are "in progress" from other people's forks. Pull requests have the diff and list of commits visible and have nested comment threads (which issue threads frustratingly lack). They also have the source branch listed so people can see where to pull from for testing without you having to tell us. Just mention in the pull request that you're still looking for feedback and we won't merge - even if you don't mention anything, most things won't be merged without some comments anyway about testing etc. We can always just reject a pull request if an approach is abandoned.

@philipstarkey
Copy link
Member Author

Original comment by Jan Werkmann (Bitbucket: PhyNerd, GitHub: PhyNerd).


Moving the variables to the global namespace didn't do much. As well as removing the try and ecept and replacing it with if else.

#!python

if exponent % 3 == 0 and -25 < exponent < 25:
...
else:

I guess scientific notation is just that dominant as it's called lots of times.

A "significant" gain however can be achieved by removing resizeRowToContents in add_files but the gain isn't great enough to justify removing it (1 second).

The main time consuming factor in update_row seems to be the creation of all the QtGui.QStandardItems. This leads me to believe that rewriting Qt's modle would most likely hold a much larger increase in speed. However pandas and Qt are both not my strong suite as I've had my first contact with them last month.

Luckily performance problems with many shots are confined to lyse for the moment. But your profiling tool is/was definitely a big help finding lines that need improvement.

I'll wait for the update dataframe pull request to go through and then I'll create the pull request. The repos were quite inactive in the last weeks so I'm not really expecting much response at the moment anyway.

@philipstarkey philipstarkey added major enhancement New feature or request labels Apr 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request major
Projects
None yet
Development

No branches or pull requests

1 participant