You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Trying to query OpenSearch 2.3 from a Glue job. The job is not pushing down the aggregate step and rather pulls all the data from OpenSearch after applying the filter and does groupBy and count in memory. The data size is too huge, I am expecting it to push the aggregate function to the OpenSearch instead.
Filter pushdown is working but aggregate pushdown is not working.
val result2 = sparkSession.read.format("opensearch").load(indexName) .filter(col("status").equalTo("active")) .groupBy("id") .agg(count("*").alias("count"))
I also tried passing the query directly instead of using Spark aggregate function, this just pulls all the data from OpenSearch and doesn't honor the query.
What is the bug?
Trying to query OpenSearch 2.3 from a Glue job. The job is not pushing down the
aggregate step
and rather pulls all the data from OpenSearch after applying the filter and doesgroupBy
andcount
in memory. The data size is too huge, I am expecting it to push the aggregate function to the OpenSearch instead.Filter pushdown is working but aggregate pushdown is not working.
The relevant configs are enabled -
.option("opensearch.pushdown.aggregation.enabled", "true") .option("opensearch.internal.spark.sql.pushdown", "true")
val result2 = sparkSession.read.format("opensearch").load(indexName) .filter(col("status").equalTo("active")) .groupBy("id") .agg(count("*").alias("count"))
I also tried passing the query directly instead of using Spark aggregate function, this just pulls all the data from OpenSearch and doesn't honor the query.
`conf.set("opensearch.query", """{"search": 0, "query": {"bool":{"filter":[{"term":{"status":"active"}}]}}, "aggs": {"id": { "terms": { "field": "id", "size": 20 }}}}""".stripMargin)
sparkSession.read.format("opensearch").load(indexName)
`
Spark Physical plan being generated -
== Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- HashAggregate(keys=[id#11], functions=[count(1)], output=[id#11, count#65L]) +- Exchange hashpartitioning(id#11, 36), ENSURE_REQUIREMENTS, [id=#14] +- HashAggregate(keys=[id#11], functions=[partial_count(1)], output=[id#11, count#69L]) +- Project [id#11] +- Filter (isnotnull(status#14) AND (status#14 = active)) +- Scan OpenSearchRelation(Map(opensearch.resource -> ****),org.apache.spark.sql.SQLContext@27c827ec,None) [id#11,status#14] PushedFilters: [IsNotNull(status), EqualTo(status,active)], ReadSchema: struct<id:string,status:string>
What is the expected behavior?
I am expecting it to push the aggregate function to OpenSearch instead of pulling the data and aggregating it in the spark executor.
What is your host/environment?
Glue: 3 (Spark 3.0, scala: 2)
OpenSearch: 2.3
opensearch-hadoop connector: opensearch-spark-30_2.12-3.0.0-SNAPSHOT.jar
The text was updated successfully, but these errors were encountered: