Pivot seems to not respect lazy evaluation #163

alberto-i · 2023-04-10T13:52:52Z

Hello, is this the expected behavior?

I'm running the code below, using a composition of groupBy, select and inflate and comparing it to a pivot call, both returning the same result. The first call runs in 0.235 ms while the pivot one runs in 146.8 ms, a 62,000% slower. A call to "toArray" takes 51.27 ms with the groupBy and 34.456 ms using pivot. 48 % faster.

Dataset is a 1.5 Mbytes file containing 27k rows.

const dataForge = require('data-forge');
require('data-forge-fs');

let start = process.hrtime();

const elapsed_time = function(note) {
    const precision = 3; // 3 decimal places
    const elapsed = process.hrtime(start)[1] / 1000000; // divide by a million to get nano to milli
    console.log(process.hrtime(start)[0] + " s, " + elapsed.toFixed(precision) + " ms - " + note); // print message + time
    start = process.hrtime(); // reset the timer
}

const df = dataForge
    .readFileSync('./data.csv')
    .parseCSV({ dynamicTyping: true })
    .withIndex((row) => `${row.meeting_id}_${row.item_id}_${row.user_id}_${row.source_id}`)

elapsed_time('parsecsv')

const sintetico = df
    .groupBy((row) => `${row.meeting_id}_${row.item_id}_${row.vote}`)
    .select((group) => ({
        meeting_id: group.first().meeting_id,
        item_id: group.first().item_id,
        vote: group.first().vote,
        stock: group.deflate(row => row.stock).sum(),
    }))
    .inflate()

elapsed_time('groupBy, select, inflate')

const sinteticoPivot = df.pivot(['meeting_id', 'item_id', 'vote'], {
    stock: dataForge.Series.sum
})

elapsed_time('pivot')

const data = sintetico.head(5).toArray()

elapsed_time('groupBy, select, inflate => toArray')

const data2 = sintetico.head(5).toArray()

elapsed_time('groupBy, select, inflate => toArray again')

const data3 = sinteticoPivot.head(5).toArray()

elapsed_time('pivot => toArray')

const data4 = sinteticoPivot.head(5).toArray()

elapsed_time('pivot => toArray again')

These are the outputs:

0 s, 183.236 ms - parsecsv
0 s, 0.235 ms - groupBy, select, inflate
0 s, 146.789 ms - pivot
0 s, 51.270 ms - groupBy, select, inflate => toArray
0 s, 1.200 ms - groupBy, select, inflate => toArray again
0 s, 34.456 ms - pivot => toArray
0 s, 13.261 ms - pivot => toArray again

Is this intended? Should I dig deeper to fix it and make a pull request?

Thanks,

The text was updated successfully, but these errors were encountered:

alberto-i · 2023-04-10T14:40:54Z

The performance problem seems to be at the orderBy block. I'm not ordering my original groupBy.

Bypassing the orderBy block in the pivot method, these are the timings:

0 s, 157.930 ms - parsecsv
0 s, 0.131 ms - groupBy, select, inflate
0 s, 0.715 ms - pivot
0 s, 65.004 ms - groupBy, select, inflate => toArray
0 s, 2.217 ms - groupBy, select, inflate => toArray again
0 s, 98.662 ms - pivot => toArray
0 s, 6.909 ms - pivot => toArray again

ashleydavis · 2023-04-16T00:48:00Z

Orderby will force the entire data set to be evaluated so it could be quite slow.

If you can improve the performance of it I'm happy to accept a PR.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pivot seems to not respect lazy evaluation #163

Pivot seems to not respect lazy evaluation #163

alberto-i commented Apr 10, 2023

alberto-i commented Apr 10, 2023

ashleydavis commented Apr 16, 2023

Pivot seems to not respect lazy evaluation #163

Pivot seems to not respect lazy evaluation #163

Comments

alberto-i commented Apr 10, 2023

alberto-i commented Apr 10, 2023

ashleydavis commented Apr 16, 2023