Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

precondition failed for splits #5431

Open
sergeyignatov opened this issue Sep 17, 2024 · 7 comments
Open

precondition failed for splits #5431

sergeyignatov opened this issue Sep 17, 2024 · 7 comments
Assignees
Labels
bug Something isn't working

Comments

@sergeyignatov
Copy link

sergeyignatov commented Sep 17, 2024

Hi,
We have 4 indexes in quickwit, all indexes are in GCS, quickwit version 0.8.2.
Recently started issue with the biggest index fluentbit-logs-2024-09-12 when splits older 1h is automatically removed. Other indexes are ok. The issue is not gone after the index recreation.

The affected index has way more Staged splits than other indexes.

select count(*), split_state, index_uid from splits  group by index_uid, split_state order by index_uid;
 count |    split_state    |                          index_uid
-------+-------------------+--------------------------------------------------------------

  1941 | Staged            | fluentbit-logs-2024-09-12:01J7MNHZMJSJJ80B2JN40CE12J
   485 | Published         | fluentbit-logs-2024-09-12:01J7MNHZMJSJJ80B2JN40CE12J
  9851 | MarkedForDeletion | fluentbit-logs-2024-09-12:01J7MNHZMJSJJ80B2JN40CE12J
  8349 | MarkedForDeletion | fluentbit-logs-connect-2024-02-05:01HSA1Z6D9N74GPZ0DX9B3RVYV
   902 | Published         | fluentbit-logs-connect-2024-02-05:01HSA1Z6D9N74GPZ0DX9B3RVYV
    16 | Staged            | fluentbit-logs-connect-2024-02-05:01HSA1Z6D9N74GPZ0DX9B3RVYV
  9156 | MarkedForDeletion | fluentbit-logs-sk-2024-01-30:01HNDX8JAKT5A6A06BFVZF8TC2
  1020 | Published         | fluentbit-logs-sk-2024-01-30:01HNDX8JAKT5A6A06BFVZF8TC2
    15 | Staged            | fluentbit-logs-sk-2024-01-30:01HNDX8JAKT5A6A06BFVZF8TC2

In the indexer logs the only errors:

2024-09-16T07:13:19Z 2024-09-16T07:13:19.603Z DEBUG merge{merge_split_id=01J7WTQKDHE1DSZJCC8HZZBM6V split_ids=["01J7WTQ9BFBHCHRXT9MXZ265DA", "01J7WTKJSX9DZ5GACD8GGK3E5T", "01J7WTFVCS7S8NBTFRCN0ZNWAB", "01J7WTCEM5PXZ6BCCN8CP9YG2B", "01J7WT8X7M3YNXWTYB5SVQM6X9", "01J7WT5NG5PNZ571V6PGBMH5TZ", "01J7WT2D80WANG4V0EN4FVZHKF", "01J7WSZ07FDD1DB8SE1M8ZX2ZT", "01J7WSVJXY8G25A2Y99V580GNC", "01J7WSQS4WYMB8V98SJSA89BP4"] typ=Merge}:uploader:upload{split=01J7WTQ
KDHE1DSZJCC8HZZBM6V}:store_split: quickwit_indexing::split_store::indexing_split_store: store-split-remote-success split_size_in_megabytes=2336.5374 num_docs=5074840 elapsed_secs=34.93994 throughput_mb_s=66.87296 is_mature=false
2024-09-16T07:13:19Z 2024-09-16T07:13:19.603Z DEBUG merge{merge_split_id=01J7WTQKDHE1DSZJCC8HZZBM6V split_ids=["01J7WTQ9BFBHCHRXT9MXZ265DA", "01J7WTKJSX9DZ5GACD8GGK3E5T", "01J7WTFVCS7S8NBTFRCN0ZNWAB", "01J7WTCEM5PXZ6BCCN8CP9YG2B", "01J7WT8X7M3YNXWTYB5SVQM6X9", "01J7WT5NG5PNZ571V6PGBMH5TZ", "01J7WT2D80WANG4V0EN4FVZHKF", "01J7WSZ07FDD1DB8SE1M8ZX2ZT", "01J7WSVJXY8G25A2Y99V580GNC", "01J7WSQS4WYMB8V98SJSA89BP4"] typ=Merge}:uploader:upload{split=01J7WTQ
KDHE1DSZJCC8HZZBM6V}:store_split: quickwit_indexing::split_store::indexing_split_store: store-in-cache
2024-09-16T07:13:19Z 2024-09-16T07:13:19.604Z DEBUG merge{merge_split_id=01J7WTQKDHE1DSZJCC8HZZBM6V split_ids=["01J7WTQ9BFBHCHRXT9MXZ265DA", "01J7WTKJSX9DZ5GACD8GGK3E5T", "01J7WTFVCS7S8NBTFRCN0ZNWAB", "01J7WTCEM5PXZ6BCCN8CP9YG2B", "01J7WT8X7M3YNXWTYB5SVQM6X9", "01J7WT5NG5PNZ571V6PGBMH5TZ", "01J7WT2D80WANG4V0EN4FVZHKF", "01J7WSZ07FDD1DB8SE1M8ZX2ZT", "01J7WSVJXY8G25A2Y99V580GNC", "01J7WSQS4WYMB8V98SJSA89BP4"] typ=Merge}:uploader: quickwit_actors::ac
tor_context: from=MergeUploader-ancient-MtKx send=MergePublisher-cold-XqZ7 msg=SplitsUpdate { index_id: "fluentbit-logs-2024-09-12", new_splits: "01J7WTQKDHE1DSZJCC8HZZBM6V", checkpoint_delta: None }
2024-09-16T07:13:19Z 2024-09-16T07:13:19.635Z ERROR quickwit_actors::spawn_builder: actor-failure cause=failed to publish splits
2024-09-16T07:13:19Z
2024-09-16T07:13:19Z Caused by:
2024-09-16T07:13:19Z precondition failed for splits `01J7WT2D80WANG4V0EN4FVZHKF, 01J7WSQS4WYMB8V98SJSA89BP4, 01J7WSVJXY8G25A2Y99V580GNC, 01J7WSZ07FDD1DB8SE1M8ZX2ZT`: splits are not marked for deletion exit_status=Failure(failed to publish splits
2024-09-16T07:13:19Z
2024-09-16T07:13:19Z Caused by:
2024-09-16T07:13:19Z precondition failed for splits `01J7WT2D80WANG4V0EN4FVZHKF, 01J7WSQS4WYMB8V98SJSA89BP4, 01J7WSVJXY8G25A2Y99V580GNC, 01J7WSZ07FDD1DB8SE1M8ZX2ZT`: splits are not marked for deletion)
2024-09-16T07:13:19Z 2024-09-16T07:13:19.635Z  INFO quickwit_actors::spawn_builder: actor-exit actor_id=MergePublisher-cold-XqZ7 exit_status=failure(cause=failed to publish splits
2024-09-16T07:13:19Z
2024-09-16T07:13:19Z Caused by:
2024-09-16T07:13:19Z precondition failed for splits `01J7WT2D80WANG4V0EN4FVZHKF, 01J7WSQS4WYMB8V98SJSA89BP4, 01J7WSVJXY8G25A2Y99V580GNC, 01J7WSZ07FDD1DB8SE1M8ZX2ZT`: splits are not marked for deletion)
2024-09-16T07:13:19Z 2024-09-16T07:13:19.635Z ERROR quickwit_actors::actor_context: exit activating-kill-switch actor=MergePublisher-cold-XqZ7 exit_status=Failure(failed to publish splits
2024-09-16T07:13:19Z
2024-09-16T07:13:19Z Caused by:
2024-09-16T07:13:19Z precondition failed for splits `01J7WT2D80WANG4V0EN4FVZHKF, 01J7WSQS4WYMB8V98SJSA89BP4, 01J7WSVJXY8G25A2Y99V580GNC, 01J7WSZ07FDD1DB8SE1M8ZX2ZT`: splits are not marked for deletion)
2024-09-16T07:13:19Z 2024-09-16T07:13:19.635Z DEBUG quickwit_common::kill_switch: kill-switch-activated
2024-09-16T07:13:19Z 2024-09-16T07:13:19.813Z DEBUG quickwit_indexing::actors::merge_pipeline: Merge pipeline running. pipeline_id=IndexingPipelineId { node_id: "quickwit-indexer-27", index_uid: IndexUid { index_id: "otel-logs-v0_7", incarnation_id: Ulid(2063959136266216495463690505829921250) }, source_id: "_ingest-api-source", pipeline_uid: Pipeline(01J7WN5H0HPQKHF24X7WR6ZCWG) } generation=1 healthy_actors=["MergePlanner-spring-SN2A", "MergeSplitDownloader-throbbing-dUH8", "MergeExecutor-dawn-pAvm", "MergePackager-twilight-VTk7", "MergeUploader-black-JdUK", "MergePublisher-old-iQPj"] failed_or_unhealthy_actors=[] success_actors=[]
2024-09-16T07:13:19Z 2024-09-16T07:13:19.813Z DEBUG quickwit_indexing::actors::merge_pipeline: Merge pipeline running. pipeline_id=IndexingPipelineId { node_id: "quickwit-indexer-27", index_uid: IndexUid { index_id: "query-log-2024-06-17", incarnation_id: Ulid(2077716558178011118253688979804328215) }, source_id: "_ingest-api-source", pipeline_uid: Pipeline(01J7WN5H0HD2FE5DTTAE559ZNY) } generation=1 healthy_actors=["MergePlanner-broken-d4ur", "MergeSplitDownloader-black-YjDS", "MergeExecutor-nameless-ti3k", "MergePackager-ancient-RdhO", "MergeUploader-sparkling-RbOn", "MergePublisher-falling-k4Ls"] failed_or_unhealthy_actors=[] success_actors=[]

index config

 - version: 0.8
    index_id: fluentbit-logs-2024-09-12
    doc_mapping:
      mode: dynamic
      field_mappings:
      - name: timestamp
        type: datetime
        input_formats:
        - iso8601
        output_format: unix_timestamp_millis
        precision: milliseconds
        fast: true
      - name: message
        type: text
        tokenizer: default
        record: position
        stored: true
      - name: log
        type: text
        tokenizer: default
        record: position
        stored: true

      timestamp_field: timestamp
    indexing_settings:
      commit_timeout_secs: 10
    search_settings:
      default_search_fields:
      - message
      - log
    retention:
      period: 5 days
      schedule: hourly
@sergeyignatov sergeyignatov added the bug Something isn't working label Sep 17, 2024
@fulmicoton
Copy link
Contributor

@trinity-1686a any idea what happened here?

@trinity-1686a
Copy link
Contributor

the error message is a bit misleading. iiuc, this error happens when a split was expected to be Published and is actually not.
The only situation where i can imagine a split being Staged according to the metastore, but a merge candidate for an indexer, is if the metastore was rolled back. That can happen easily with the S3/GCS metastore (if multiple metastores are running and they fight each other), not so much on postgres.
The more likely scenario is the splits are already MarkedForDeletion, which could happen due to the same kind of issues, but also other reasons. The only persistent issue that comes to mind would be that two indexers have the same node_id, so they try to merge splits from each others.

Do you know if some of your indexes have the same node_id? (either explicitly configured, or same short hostname)? If you search for 01J7WT2D80WANG4V0EN4FVZHKF in all your logs, do you get logs for more than one indexer?

@sergeyignatov
Copy link
Author

The split 01J7WT2D80WANG4V0EN4FVZHKF is present only in the quickwit-indexer-27 and metastore logs.


2024-09-16 07:13:19.5342024-09-16T07:13:19.533Z  INFO stage_splits{split_ids="[\"01J7WTV5YEJ4KCD5B27SZ2XMXS\"]"}: quickwit_metastore::metastore::postgres::metastore: staged `1` splits successfully index_id=fluentbit-logs-2024-09-122024-09-16 07:13:19.5342024-09-16T07:13:19.534Z  INFO stage_splits{split_ids="[\"01J7WTV5JQ2HW1XD7XEWB5CBFG\"]"}: quickwit_metastore::metastore::postgres::metastore: staged `1` splits successfully index_id=fluentbit-logs-2024-09-12 | 2024-09-16 07:13:19.534 | 2024-09-16T07:13:19.533Z  INFO stage_splits{split_ids="[\"01J7WTV5YEJ4KCD5B27SZ2XMXS\"]"}: quickwit_metastore::metastore::postgres::metastore: staged `1` splits successfully index_id=fluentbit-logs-2024-09-12

2024-09-16 07:13:19.534 | 2024-09-16T07:13:19.534Z  INFO stage_splits{split_ids="[\"01J7WTV5JQ2HW1XD7XEWB5CBFG\"]"}: quickwit_metastore::metastore::postgres::metastore: staged `1` splits successfully index_id=fluentbit-logs-2024-09-12 
2024-09-16 07:13:19.534 | 2024-09-16T07:13:19.533Z  INFO stage_splits{split_ids="[\"01J7WTV5YEJ4KCD5B27SZ2XMXS\"]"}: quickwit_metastore::metastore::postgres::metastore: staged `1` splits successfully index_id=fluentbit-logs-2024-09-12
2024-09-16 07:13:19.534 | 2024-09-16T07:13:19.534Z  INFO stage_splits{split_ids="[\"01J7WTV5JQ2HW1XD7XEWB5CBFG\"]"}: quickwit_metastore::metastore::postgres::metastore: staged `1` splits successfully index_id=fluentbit-logs-2024-09-12 

2024-09-16 07:13:19.638 | 2024-09-16T07:13:19.634Z  WARN publish_splits{request=PublishSplitsRequest { index_uid: Some(IndexUid { index_id: "fluentbit-logs-2024-09-12", incarnation_id: Ulid(2086843892575080035060981602672116818) }), staged_split_ids: ["01J7WTQKDHE1DSZJCC8HZZBM6V"], replaced_split_ids: ["01J7WSVJXY8G25A2Y99V580GNC", "01J7WT2D80WANG4V0EN4FVZHKF", "01J7WT5NG5PNZ571V6PGBMH5TZ", "01J7WTQ9BFBHCHRXT9MXZ265DA", "01J7WTKJSX9DZ5GACD8GGK3E5T", "01J7WTCEM5PXZ6BCCN8CP9YG2B", "01J7WSQS4WYMB8V98SJSA89BP4", "01J7WT8X7M3YNXWTYB5SVQM6X9", "01J7WTFVCS7S8NBTFRCN0ZNWAB", "01J7WSZ07FDD1DB8SE1M8ZX2ZT"], index_checkpoint_delta_json_opt: None, publish_token_opt: None }}: quickwit_metastore::metastore::postgres::metastore: rollback

2024-09-16 07:13:19.638 | 2024-09-16T07:13:19.636Z  INFO publish_splits{request=PublishSplitsRequest { index_uid: Some(IndexUid { index_id: "fluentbit-logs-connect-2024-02-05", incarnation_id: Ulid(2068240375732922833404597870727557083) }), staged_split_ids: ["01J7WTV646WZ63XE6NXYJVYVD1"], replaced_split_ids: [], index_checkpoint_delta_json_opt: Some("{\"source_id\":\"_ingest-api-source\",\"source_delta\":{\"per_partition\":{\"ingest_partition_01J7WNDBAVYDR8BBZB14NBGMA4\":{\"from\":\"00000000000000449363\",\"to\":\"00000000000000458417\"}}}}"), publish_token_opt: None }}: quickwit_metastore::metastore::postgres::metastore: published 1 splits and marked 0 for deletion successfully index_id=fluentbit-logs-connect-2024-02-052024-09-16 07:13:19.7572024-09-16T07:13:19.755Z  INFO stage_splits{split_ids="[\"01J7WTV66RWB63VWTKNGH999D4\"]"}: quickwit_metastore::metastore::postgres::metastore: staged `1` splits successfully index_id=fluentbit-logs-2024-09-12 

2024-09-16 07:13:19.638 | 2024-09-16T07:13:19.636Z  INFO publish_splits{request=PublishSplitsRequest { index_uid: Some(IndexUid { index_id: "fluentbit-logs-connect-2024-02-05", incarnation_id: Ulid(2068240375732922833404597870727557083) }), staged_split_ids: ["01J7WTV646WZ63XE6NXYJVYVD1"], replaced_split_ids: [], index_checkpoint_delta_json_opt: Some("{\"source_id\":\"_ingest-api-source\",\"source_delta\":{\"per_partition\":{\"ingest_partition_01J7WNDBAVYDR8BBZB14NBGMA4\":{\"from\":\"00000000000000449363\",\"to\":\"00000000000000458417\"}}}}"), publish_token_opt: None }}: quickwit_metastore::metastore::postgres::metastore: published 1 splits and marked 0 for deletion successfully index_id=fluentbit-logs-connect-2024-02-05

2024-09-16 07:13:19.757 | 2024-09-16T07:13:19.755Z  INFO stage_splits{split_ids="[\"01J7WTV66RWB63VWTKNGH999D4\"]"}: quickwit_metastore::metastore::postgres::metastore: staged `1` splits successfully index_id=fluentbit-logs-2024-09-12
2024-09-16 07:13:19.638 | 2024-09-16T07:13:19.636Z  INFO publish_splits{request=PublishSplitsRequest { index_uid: Some(IndexUid { index_id: "fluentbit-logs-connect-2024-02-05", incarnation_id: Ulid(2068240375732922833404597870727557083) }), staged_split_ids: ["01J7WTV646WZ63XE6NXYJVYVD1"], replaced_split_ids: [], index_checkpoint_delta_json_opt: Some("{\"source_id\":\"_ingest-api-source\",\"source_delta\":{\"per_partition\":{\"ingest_partition_01J7WNDBAVYDR8BBZB14NBGMA4\":{\"from\":\"00000000000000449363\",\"to\":\"00000000000000458417\"}}}}"), publish_token_opt: None }}: quickwit_metastore::metastore::postgres::metastore: published 1 splits and marked 0 for deletion successfully index_id=fluentbit-logs-connect-2024-02-05
2024-09-16 07:13:19.757 | 2024-09-16T07:13:19.755Z  INFO stage_splits{split_ids="[\"01J7WTV66RWB63VWTKNGH999D4\"]"}: quickwit_metastore::metastore::postgres::metastore: staged `1` splits successfully index_id=fluentbit-logs-2024-09-12


@sergeyignatov
Copy link
Author

Also in the affected index the "timestamp range start" not equal to "Timestamp range end " - retention

General Information
 --------------------------------------------+-------------------------------------------------------------------------------------
  Index ID                                   | fluentbit-logs-2024-09-12
  Number of published documents              | 357.045612 M (357,045,612)
  Size of published documents (uncompressed) | 499.8 GB
  Number of published splits                 | 822
  Size of published splits                   | 149.7 GB
  Timestamp field                            | "timestamp"
  Timestamp range start                      | 2024-09-06 06:11:43 (Timestamp: 1725603103)
  Timestamp range end                        | 2024-09-17 18:15:21 (Timestamp: 1726596921)

@sergeyignatov
Copy link
Author

How likely a logs with the timestamp older than now() - "retention days" may break the retention logic ?

@trinity-1686a
Copy link
Contributor

the retention logic looks at the timestamp range end (of individual splits).
If quickwit receives a batch of documents that are retention days old, and make a split only of that (with no recent document), the retention policy will mark the split for deletion on the next round. If the indexer tries to merge it after that, it will probably get the error you're getting

@sergeyignatov
Copy link
Author

we started to drop logs with timestamp in the far past on the fluent-bit side and the issue has been resolved.
It would be good to have such protection on the quickwit ingestion side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants