Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

batch bulk insert gaps #263

Open
Gabilabrebi opened this issue Nov 30, 2023 · 13 comments
Open

batch bulk insert gaps #263

Gabilabrebi opened this issue Nov 30, 2023 · 13 comments

Comments

@Gabilabrebi
Copy link

Gabilabrebi commented Nov 30, 2023

When using batch to bulk insert, gaps in AUTO_INCREMENT IDs are created when reaching 50+ inserts even with innodb_autoinc_lock_mode set to 1 (consecutive)

The mySQL, documentation says :

With innodb_autoinc_lock_mode set to 0 (“traditional”) or 1 (“consecutive”), the auto-increment values generated by any given statement are consecutive, without gaps, because the table-level AUTO-INC lock is held until the end of the statement, and only one such statement can execute at a time.

So I'm not sure that's intended. Due to the fact that insertId, only return one ID, i think it's very important to not have gaps in IDs so we can deduce inserted ones.

@rusher
Copy link
Collaborator

rusher commented Dec 5, 2023

Actually I'm not sure recreate the insert ids is the way to go.
If ID's are needed, better to disable bulk, setting option bulk to false. This will permit to retrieve all id's, even if slower (this is still faster than normal INSERT * number of bunch of parameter, because of pipelining).
In the near futur, this will be permitted with https://jira.mariadb.org/browse/MDEV-30366

@crushton-qg
Copy link

@rusher Hi, I am currently investigating an issue in our DB where we have gaps in inserted auto increment IDs using the batch function. We have managed to recreate the the bug twice, I have attached an image which shows we have run two separate batch queries with the same 5 rows of data, on both occasions the same ID gaps occurred at the same point.

I just want to check whether this is a bug in the batch function itself, or whether I need to investigate this further inside our DB / code config.

Screenshot 2024-04-17 at 16 00 31

@crushton-qg
Copy link

@rusher forgot to add that in the table in the image above, we currently have about 400k records but the highest id is currently over 500k, meaning about a 20% difference in count and IDs

@crushton-qg
Copy link

@Gabilabrebi did you manage to find a fix / resolution for this?

@Gabilabrebi
Copy link
Author

Gabilabrebi commented Apr 18, 2024

@crushton-qg Yeah i just stopped using batch

for (let i = 0; i < order.content.length; i++) {
const x = order.content[i];
contentInserts.push("(DEFAULT, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, 0)");
contentParameters.push(place, orderId, i, x.item, x.version, x.comment, x.quantity, x.price, x.a, x.b, x.c);
}

db.query(INSERT INTO orders_contents VALUES ${contentInserts.join(",")}, contentParameters);

Something like this basically achieve the same thing for me, but without any id issues. It's still one query and performance are pretty much the same as far as i tested.

@crushton-qg
Copy link

crushton-qg commented Apr 18, 2024

@Gabilabrebi Sorry to be a pain, can you provide a full code sample please? I have tried to replicate what you have but I cannot get it to work. I am struggling to follow your sample where you are providing multiple (DEFAULT, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, 0) in your query string.

I have an array of arrays, each inner array containing the row data, i have changed table names, columns to generic name/data

I have tried adapting my query to:

            var contentInserts = [], contentParameters = [];
            for (const row of row_array)
            {
                contentInserts.push("(?,?,?,?,?,?,?,?,?)");
                contentParameters.push(row);
            }
            var dlo_rule_log_insert = await conn.query(`INSERT INTO table VALUES ${contentInserts.join(",")}`, contentParameters);

but I am getting a syntax error every time:
{
"name": "SqlError",
"sqlMessage": "Parameter at position 3619 is not set",
"sql": "INSERT INTO table VALUES (?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(...",
"fatal": false,
"errno": 45016,
"sqlState": "HY000",
"code": "ER_MISSING_PARAMETER"
}

@Gabilabrebi
Copy link
Author

@crushton-qg I guess you just need to change

contentParameters.push(row);
to
contentParameters.push(...row);

@crushton-qg
Copy link

@Gabilabrebi that was it, brilliant thanks for the help!

@Gabilabrebi
Copy link
Author

@rusher Does using bulk = false as today ensure there is not gap between ids?

Is there any documentation on this option and what it does exactly?

I'm trying to use .batch again but still need contiguous ids

Regards

@rusher
Copy link
Collaborator

rusher commented May 29, 2024

This is a little more complex than that, i wouldn't rely on id's without gap, servers ensure uniqueness, not strict "sequencing".

Server have AUTO_INCREMENT defaulting to 1 that indicate the increment between each IDs.
For a single server, after each INSERT command, IDs will increment by one.
This is not always the case, for galera servers for example (multi-master), each server has an increment offset, to permit avoiding to synchronized ID's current value on each INSERT command. For example for 3 galera servers,
2 INSERT on server A (offset 0), 4 on server B(offset +1) and 1 on SERVER C(offset +2):
ID's created on server A will be 1 and 4.
ID's created on server B will be 2, 5 and 8.
ID's created on server C will be 3.
So you'll have the resulting ids 1,2,3,4,5,8. synchronized, next id on server A will be 10.

The bulk option correspond to real batch command. I'm not completly sure about how this is implemented for bulk, server guys would be more able to help with the details, but what i suppose is that in order to deals with connections inserting to the same table, server can reserve a block of some ids when using bulk and that might result in gap in ids.
Like, a server might reserved a block of 128 IDs for example and bulk will finally only result in 100 inserts, and during that time, other INSERTs might have been done. This is always for the sake of not blocking other connections if there is no need.

So, yes, if using only one server, no galera, without bulk, you will normally not have any gap, but i would clearly nor rely on that.

@Gabilabrebi
Copy link
Author

@rusher Thank you for your answer.

In my case I've table A, B, and C.
I insert many things in A, many things for each A in B with a reference to A.id, and many things in C for each B with a reference to b.id

It's currently 3 queries, one for A, for B, and one for C, but yes i rely on contiguous id.

If i shouldn't rely on contiguous ids, do I have to make tones of single query in loops to ensure to proper relation? Or is there any better alternative?

@rusher
Copy link
Collaborator

rusher commented May 30, 2024

There is various solutions, but all of them rely on knowing the real insert id.
for example, you can use a sequence to handle yourself id's, another solution is just take in account auto increment ids.

for batch, an option fullResult permit to retrieve all ids:
Here is an example (could be simplified, i just write it quickly):

    await shareConn.query('DROP TABLE IF EXISTS B');
    await shareConn.query('DROP TABLE IF EXISTS A');

    const jsonVal = {
      Alan: ['Al'],
      Archibald: ['Archie', 'Baldo'],
      Benjamin: ['Ben', 'Bennie', 'Benjy', 'Jamie']
    };

    await shareConn.query('CREATE TABLE A(id int not null primary key auto_increment, val varchar(20))');
    await shareConn.query(
      'create table B (id int not null primary key auto_increment, nickname varchar(20), id_refA int, foreign key (id_refA) references A(id))'
    );

    const keys = Object.keys(jsonVal).map((x) => [x]);
    //keys : [ [ 'Alan' ], [ 'Archibald' ], [ 'Benjamin' ] ]

    const res = await shareConn.batch({ sql: 'INSERT INTO A(val) value (?)', fullResult: true }, keys);
    // res= [
    //   OkPacket { affectedRows: 1, insertId: 1n, warningStatus: 0 },
    //   OkPacket { affectedRows: 1, insertId: 2n, warningStatus: 0 },
    //   OkPacket { affectedRows: 1, insertId: 3n, warningStatus: 0 }
    // ]
    const aliases = keys
      .map((x, idx) => {
        const key = x[0];
        return jsonVal[key].map((alias) => [alias, res[idx].insertId]);
      })
      .flat();
    console.log(aliases);
    // aliases = [
    //   [ 'Al', 1n ],
    //   [ 'Archie', 2n ],
    //   [ 'Baldo', 2n ],
    //   [ 'Ben', 3n ],
    //   [ 'Bennie', 3n ],
    //   [ 'Benjy', 3n ],
    //   [ 'Jamie', 3n ]
    // ]

    const res2 = await shareConn.batch(
      { sql: 'INSERT INTO B(nickname, id_refA) value (?, ?)', fullResult: true },
      aliases
    );

@Gabilabrebi
Copy link
Author

OH i didn't know about this fullResult: true option.

This is exactly what i was looking for. With this, no more need for contiguous ids. I just add this option when i want ids, and can still benefit full speed of batch when i don't need them.

Thanks a ton!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants