[FEA][Audit] [follow-up] Default values in schema caused Spark RAPIDS faliure #12031

res-life · 2025-01-26T10:10:36Z

Is your feature request related to a problem? Please describe.
Apache Spark posted a commit commit to fix a bug related to default values in schema. Currently Spark-Rapids does not support default values in schema, so this kind of bug does not occur in Spark-Rapids. This issue is just a follow-up in case Spark-Rapids supports default values in schema in future.
Mark P2 priority for this issue, because currently Spark-Rapids does not trigger the bug.

How to trigger the bug on Spark without Spark-Rapids:

spark.conf.set("spark.rapids.sql.enabled", false)

// disable Spark-Rapids; Run on Spark 3.5

import org.apache.spark.sql.types._
import org.apache.spark.sql._
import collection.JavaConverters._
val data = Seq(Row(Decimal("13.0")))
// decimal type does not fit in int or long
val wideDecimal = DecimalType(32, 10)
val initialSchema = StructType(Seq(
  StructField("f1", wideDecimal)
))
val evolvedSchemaWithDefaultValue = StructType(Seq(
  StructField("f1", wideDecimal),
  StructField("f2", wideDecimal).withExistenceDefaultValue("42.0")
))
val path = "/tmp/a.parquet"
val df = spark.createDataFrame(data.asJava, initialSchema)
df.write.mode("overwrite").parquet(path)
val res = spark.read.schema(evolvedSchemaWithDefaultValue).parquet(path).collect()
assert(res.length == 1)
assert(res(0).getDecimal(0).toBigInteger.longValueExact() == 13)
assert(res(0).getDecimal(1).toBigInteger.longValueExact() == 42)

Reports:

java.lang.NullPointerException: Cannot store to long array because "this.longData" is null
	at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putLongs(OnHeapColumnVector.java:367)

Currently Spark-Rapids does not support default value

spark.conf.set("spark.rapids.sql.enabled", true)

// run the above code

Reports:

!Exec <FileSourceScanExec> cannot run on GPU because GpuParquetScan does not support default values in schema

Describe the solution you'd like
Do not require any fix/develop currently.
When support default values in schema in future, should add this case as regression test to check the behavior of GPU .

The text was updated successfully, but these errors were encountered:

res-life added ? - Needs Triage Need team to review and classify audit_4.0.0 Audit related tasks for 4.0.0 feature request New feature or request P2 Not required for release labels Jan 26, 2025

mattahrens changed the title ~~[FEA][Audit] [follow-up]~~ [FEA][Audit] [follow-up] Default values in schema caused Spark RAPIDS faliure Jan 29, 2025

mattahrens removed the ? - Needs Triage Need team to review and classify label Jan 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA][Audit] [follow-up] Default values in schema caused Spark RAPIDS faliure #12031

[FEA][Audit] [follow-up] Default values in schema caused Spark RAPIDS faliure #12031

res-life commented Jan 26, 2025

[FEA][Audit] [follow-up] Default values in schema caused Spark RAPIDS faliure #12031

[FEA][Audit] [follow-up] Default values in schema caused Spark RAPIDS faliure #12031

Comments

res-life commented Jan 26, 2025

How to trigger the bug on Spark without Spark-Rapids:

Currently Spark-Rapids does not support default value