You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I use KQLMagic to train ML models. For this I need data preferably in Panda DataFrame.
My dataset has ~15 columns + 2mln rows and it takes about 2x longer to convert already retrieved data to panda then to retrieve it.
Somewhat average timings are:
query execution to completion (which includes data fetch) ~2min
result.to_dataframe() ~5min.
(And I am working to further increase dataset to ~45-60 columns and most likely to have better precision I will need increase number of rows to 3-5mln) and it makes to_dataframe() prohibitively expensive.
However, if instead of built-in to_dataframe() I use pd.DataFrame.from_dict(results.to_dict()) it's significantly faster:
on 15 columns 2 mln rows dataset it completes in about 40s (e.g. ~7-8x faster, and explainable faster that data retrieval)
I see similar kind of gains on larger dataset, but it's work in progress as I've spent few days to figure out how to speed up cost-prohibitive conversion to Panda DF.
It would be helpful to optimize built-in to_dataframe() to be at least on par with pd.DataFrame.from_dict(results.to_dict()) version.
Feel free to reach out to me internally on MS channels (vtsyganok), I can share notebooks and code/examples internally.
Thank you,
Vadim
The text was updated successfully, but these errors were encountered:
I use KQLMagic to train ML models. For this I need data preferably in Panda DataFrame.
My dataset has ~15 columns + 2mln rows and it takes about 2x longer to convert already retrieved data to panda then to retrieve it.
Somewhat average timings are:
query execution to completion (which includes data fetch) ~2min
result.to_dataframe() ~5min.
(And I am working to further increase dataset to ~45-60 columns and most likely to have better precision I will need increase number of rows to 3-5mln) and it makes
to_dataframe()
prohibitively expensive.However, if instead of built-in
to_dataframe()
I usepd.DataFrame.from_dict(results.to_dict())
it's significantly faster:on 15 columns 2 mln rows dataset it completes in about 40s (e.g. ~7-8x faster, and explainable faster that data retrieval)
I see similar kind of gains on larger dataset, but it's work in progress as I've spent few days to figure out how to speed up cost-prohibitive conversion to Panda DF.
It would be helpful to optimize built-in
to_dataframe()
to be at least on par withpd.DataFrame.from_dict(results.to_dict())
version.Feel free to reach out to me internally on MS channels (vtsyganok), I can share notebooks and code/examples internally.
Thank you,
Vadim
The text was updated successfully, but these errors were encountered: