to_dataframe() is extremely slow. #104

vtsyganok-microsoft · 2023-06-21T02:54:23Z

I use KQLMagic to train ML models. For this I need data preferably in Panda DataFrame.
My dataset has ~15 columns + 2mln rows and it takes about 2x longer to convert already retrieved data to panda then to retrieve it.
Somewhat average timings are:
query execution to completion (which includes data fetch) ~2min
result.to_dataframe() ~5min.

(And I am working to further increase dataset to ~45-60 columns and most likely to have better precision I will need increase number of rows to 3-5mln) and it makes to_dataframe() prohibitively expensive.

However, if instead of built-in to_dataframe() I use pd.DataFrame.from_dict(results.to_dict()) it's significantly faster:
on 15 columns 2 mln rows dataset it completes in about 40s (e.g. ~7-8x faster, and explainable faster that data retrieval)

I see similar kind of gains on larger dataset, but it's work in progress as I've spent few days to figure out how to speed up cost-prohibitive conversion to Panda DF.

It would be helpful to optimize built-in to_dataframe() to be at least on par with pd.DataFrame.from_dict(results.to_dict()) version.

Feel free to reach out to me internally on MS channels (vtsyganok), I can share notebooks and code/examples internally.

Thank you,
Vadim

The text was updated successfully, but these errors were encountered:

mbnshtck · 2024-04-08T11:23:16Z

Thank you for your comment. WIll check it.

mbnshtck self-assigned this Apr 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

to_dataframe() is extremely slow. #104

to_dataframe() is extremely slow. #104

vtsyganok-microsoft commented Jun 21, 2023 •

edited

Loading

mbnshtck commented Apr 8, 2024

to_dataframe() is extremely slow. #104

to_dataframe() is extremely slow. #104

Comments

vtsyganok-microsoft commented Jun 21, 2023 • edited Loading

mbnshtck commented Apr 8, 2024

vtsyganok-microsoft commented Jun 21, 2023 •

edited

Loading