[BUG] Reading data from OpenSearch is pulling all the data and not pushing down the aggregates #302

raviprakashshahi · 2023-07-07T04:24:24Z

What is the bug?

Trying to query OpenSearch 2.3 from a Glue job. The job is not pushing down the aggregate step and rather pulls all the data from OpenSearch after applying the filter and does groupBy and count in memory. The data size is too huge, I am expecting it to push the aggregate function to the OpenSearch instead.

Filter pushdown is working but aggregate pushdown is not working.

The relevant configs are enabled -

.option("opensearch.pushdown.aggregation.enabled", "true") .option("opensearch.internal.spark.sql.pushdown", "true")

val result2 = sparkSession.read.format("opensearch").load(indexName) .filter(col("status").equalTo("active")) .groupBy("id") .agg(count("*").alias("count"))

I also tried passing the query directly instead of using Spark aggregate function, this just pulls all the data from OpenSearch and doesn't honor the query.

`conf.set("opensearch.query", """{"search": 0, "query": {"bool":{"filter":[{"term":{"status":"active"}}]}}, "aggs": {"id": { "terms": { "field": "id", "size": 20 }}}}""".stripMargin)

sparkSession.read.format("opensearch").load(indexName)
`

Spark Physical plan being generated -

== Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- HashAggregate(keys=[id#11], functions=[count(1)], output=[id#11, count#65L]) +- Exchange hashpartitioning(id#11, 36), ENSURE_REQUIREMENTS, [id=#14] +- HashAggregate(keys=[id#11], functions=[partial_count(1)], output=[id#11, count#69L]) +- Project [id#11] +- Filter (isnotnull(status#14) AND (status#14 = active)) +- Scan OpenSearchRelation(Map(opensearch.resource -> ****),org.apache.spark.sql.SQLContext@27c827ec,None) [id#11,status#14] PushedFilters: [IsNotNull(status), EqualTo(status,active)], ReadSchema: struct<id:string,status:string>

What is the expected behavior?

I am expecting it to push the aggregate function to OpenSearch instead of pulling the data and aggregating it in the spark executor.

What is your host/environment?

Glue: 3 (Spark 3.0, scala: 2)
OpenSearch: 2.3
opensearch-hadoop connector: opensearch-spark-30_2.12-3.0.0-SNAPSHOT.jar

The text was updated successfully, but these errors were encountered:

wbeckler · 2023-08-23T02:54:23Z

This makes sense. If anyone is interested in making this change, I think it would be welcome.

raviprakashshahi added bug Something isn't working untriaged labels Jul 7, 2023

wbeckler removed the untriaged label Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Reading data from OpenSearch is pulling all the data and not pushing down the aggregates #302

[BUG] Reading data from OpenSearch is pulling all the data and not pushing down the aggregates #302

raviprakashshahi commented Jul 7, 2023 •

edited

Loading

wbeckler commented Aug 23, 2023

[BUG] Reading data from OpenSearch is pulling all the data and not pushing down the aggregates #302

[BUG] Reading data from OpenSearch is pulling all the data and not pushing down the aggregates #302

Comments

raviprakashshahi commented Jul 7, 2023 • edited Loading

What is the bug?

What is the expected behavior?

What is your host/environment?

wbeckler commented Aug 23, 2023

raviprakashshahi commented Jul 7, 2023 •

edited

Loading