Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean md store objects #8700

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Clean md store objects #8700

wants to merge 1 commit into from

Conversation

vh05
Copy link
Contributor

@vh05 vh05 commented Jan 20, 2025

Clean md store objects once the max deleted objects reach the limit.

Fixes: https://issues.redhat.com/browse/DFBUGS-1339

Clean md store objects once the max deleted objects
reach the limit.

Fixes: https://issues.redhat.com/browse/DFBUGS-1339

Signed-off-by: Vinayakswami Hariharmath <[email protected]>
@@ -58,23 +58,39 @@ async function clean_md_store(last_date_to_remove) {
${total_objects_count} objects - Skipping...`);
return;
}
const objects_to_remove = await clean_md_store_objects(last_date_to_remove, config.DB_CLEANER_DOCS_LIMIT);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you rearrange the same code in smaller functions? I'm not sure if it's needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. looking to clean only deleted md objects and other calls in the function not needed. Divided the function into 3 sub functions.

dbg.log2('DB_CLEANER: list objects:', objects_to_remove);
if (objects_to_remove.length) {
if (objects_to_remove.length > config.MD_STORE_MAX_DELETED_OBJECTS_LIMIT) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we adding this limit? why not deleting less?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the good number ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think number is arbitrary, but we can go with 10 or 20 for example

@@ -42,6 +43,9 @@ class ObjectsReclaimer {
if (has_errors) {
return config.OBJECT_RECLAIMER_ERROR_DELAY;
}

await clean_md_store_objects(Date.now());
Copy link
Contributor

@jackyalbo jackyalbo Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think this belongs here... why should we run db_cleaner inside object_reclaimer - db_cleaner will run on objects with no regard to the list that was reclaimed (and of course, we don't want to db-delete objects that were just marked as deleted completely from the DB!). I think what we wanted here is a new functionality that for each of the objects, will check how many deleted objects we have with the same key and, if it's more than X - delete the older copies - @dannyzaken to keep me honest here...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The jira issue (https://issues.redhat.com/browse/DFBUGS-1339) is specifically mentioned that we can call clean the deleted objects from the md store in object reclaimer. Thought this is the place. @dannyzaken Please correct me here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vh05 I didn't intend for the description in Jira to be specific instructions on how to fix it. I wanted to provide some general context.
I don't see a reason to call from object_reclaimer to the db_cleaner. These are two separate bg_workers.

@dannyzaken
Copy link
Member

@vh05. I think this PR does not do what it is supposed to do. We want to keep up to SOME_CONSTANT deleted object_md per key, not keep SOME_CONSTANT of total deleted object_mds.
This is mainly to handle the example of overwriting an object. Let's say you upload an object with the key foo every few seconds. You will have many rows with the same key in a short time (all except one should be deleted). We want to limit the number of these rows (I think 100 is a reasonable limit to start with).

In your implementation, you delete any object_md that is deleted and only keep 100 in total. This is a bit too aggressive, in my opinion.

Another thing to consider is that you probably want to ignore objectmds that are not marked as reclaimed.

I strongly suggest creating a similar case on a local deployment. You can easily produce a dataset with >100 overwrites of the same key, so you can test your code.

@dannyzaken
Copy link
Member

You will probably need to implement new functions in md_store.
you can take this function as an example

async count_objects_per_bucket(system_id) {
// TODO check which index is needed to cover this aggregation
const res = await this._objects.groupBy({
system: system_id,
deleted: null,
delete_marker: null,
version_past: null
}, {
_id: '$bucket',
count: {
$sum: 1
}
});
const buckets = {};
let total_count = 0;
_.forEach(res, r => {
buckets[r._id] = r.count;
total_count += r.count;
});
buckets[''] = total_count;
return buckets;
}

It is probably not sufficient, and you need to aggregate more data than this (for example, you will need to keep track of the _ids of objects to delete\keep).
Notice that the md_store is written in Mongo query language and is translated to Postgres JSONB queries in the postgres_client.

async groupBy(match, group) {
const WHERE = mongo_to_pg('data', encode_json(this.schema, match), { disableContainmentQuery: true });
const P_GROUP = this._prepare_aggregate_group_query(group);
try {
const res = await this.single_query(`SELECT ${P_GROUP.SELECT} FROM ${this.name} WHERE ${WHERE} GROUP BY ${P_GROUP.GROUP_BY}`);
return res.rows.map(row => { // this is temp fix as all the keys suppose to be ints except _id
const new_row = {};
for (const key of Object.keys(row)) {
if (key === '_id') {
new_row._id = new mongodb.ObjectID(row[key]);
} else {
new_row[key] = parseInt(row[key], 10);
}
}
return new_row;
});
} catch (err) {
dbg.error('groupBy failed', match, group, WHERE, P_GROUP, err);
throw err;
}
}

This translation is far from perfect, and we should verify that the generated queries work as expected and perform reasonably well.

@vh05
Copy link
Contributor Author

vh05 commented Jan 30, 2025

@vh05. I think this PR does not do what it is supposed to do. We want to keep up to SOME_CONSTANT deleted object_md per key, not keep SOME_CONSTANT of total deleted object_mds. This is mainly to handle the example of overwriting an object. Let's say you upload an object with the key foo every few seconds. You will have many rows with the same key in a short time (all except one should be deleted). We want to limit the number of these rows (I think 100 is a reasonable limit to start with).

In your implementation, you delete any object_md that is deleted and only keep 100 in total. This is a bit too aggressive, in my opinion.

Another thing to consider is that you probably want to ignore objectmds that are not marked as reclaimed.

I strongly suggest creating a similar case on a local deployment. You can easily produce a dataset with >100 overwrites of the same key, so you can test your code.

  1. If we delete the objects that are already deleted (keeping only reclaimed), would that harm us or there is any reason we keep the deleted objects in md_store ?

  2. If the deleted objects are unusually high on regular basis, there is high chance of overwrite on a particular key and if we delete them after certain limit, actually we are deleting the deleted objects as result of overwrite. Is n't it ?

  3. As you said, deleting object mds of a particular key requires tracking the bucket and object there on. Is it not too costly operation ?

@dannyzaken
Copy link
Member

  1. If we delete the objects that are already deleted (keeping only reclaimed), would that harm us or there is any reason we keep the deleted objects in md_store ?

We perform soft delete (mark the deleted timestamp) since it can help debug or fix issues with deleted objects. the data is still in the DB and not gone forever.

  1. If the deleted objects are unusually high on regular basis, there is high chance of overwrite on a particular key and if we delete them after certain limit, actually we are deleting the deleted objects as result of overwrite. Is n't it ?

I'm not sure I understand your point. We don't want to remove deleted rows arbitrarily after reaching some limit. We specifically want to handle this overwrite use case, where a single object is frequently overwritten.
for the general cleanup of deleted rows, the db_cleaner deletes them by date, which is good enough for now and we can tweak if necessary. this is not the scope of this bug.

  1. As you said, deleting object mds of a particular key requires tracking the bucket and object there on. Is it not too costly operation ?

Of course we need to profile it once we have the code ready. We can run EXPLAIN queries once we have a working query, and analyze the performance.

@vh05
Copy link
Contributor Author

vh05 commented Jan 31, 2025

  1. If we delete the objects that are already deleted (keeping only reclaimed), would that harm us or there is any reason we keep the deleted objects in md_store ?

We perform soft delete (mark the deleted timestamp) since it can help debug or fix issues with deleted objects. the data is still in the DB and not gone forever.

  1. If the deleted objects are unusually high on regular basis, there is high chance of overwrite on a particular key and if we delete them after certain limit, actually we are deleting the deleted objects as result of overwrite. Is n't it ?

I'm not sure I understand your point. We don't want to remove deleted rows arbitrarily after reaching some limit. We specifically want to handle this overwrite use case, where a single object is frequently overwritten. for the general cleanup of deleted rows, the db_cleaner deletes them by date, which is good enough for now and we can tweak if necessary. this is not the scope of this bug.

  1. As you said, deleting object mds of a particular key requires tracking the bucket and object there on. Is it not too costly operation ?

Of course we need to profile it once we have the code ready. We can run EXPLAIN queries once we have a working query, and analyze the performance.

Sure Danny. That clarifies my queries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants