Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] - ModuleNotFoundError when calling a function that uses a User Defined Function (UDF) #975

Open
pietrodantuono opened this issue Dec 14, 2023 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@pietrodantuono
Copy link

System information

  • Runtime: Databricks-VSCode (Databricks Runtime 13.3.x Scala 2.12)
  • PySpark version: 3.4.2
  • Python version: 3.10.1
  • Operating system: Windows 10 Build 19045

Code structure

repo/
├── helper/
│   ├── __init__.py
│   ├── helper_module.py
│   └── ...
├── notebooks/
│   ├── notebook.ipynb
│   └── ...
└── pyproject.toml

Code sample

# helper_module.py

# From the Python Standard Library
import struct
# From PySpark
import pyspark.sql.functions as F
import pyspark.sql.types as T
from pyspark.sql import DataFrame



def str_hex_to_numeric(
    hex_value: str,
    data_type_name: str
) -> float:
    """Convert a hex string to a numeric value."""
    if data_type_name == "Float":
        return struct.unpack('!f', bytes.fromhex(hex_value))[0]
    raise ValueError(f"Unknown data type: {data_type_name}")

def value_col_hex_to_numeric(
    df: DataFrame,
    value_col: str = "VALUE",
    data_type_name_col: str = "DATA_TYPE_NAME"
) -> DataFrame:
    """Convert a hex string to a numeric value."""
    return df.withColumn(
        value_col,
        F.udf(
            str_hex_to_numeric, T.FloatType()
        )(F.col(value_col), F.col(data_type_name_col))
    )
# notebook.ipynb
# Navigate to the repo root directory and install the helper module
%pip install -e .

# Import the helper module
from helper import helper_module

# Create a Spark DataFrame
df = spark.createDataFrame([("1", "Float", "3f800000"), ("2", "Float", "40000000"),
                            ("3", "Float", "40400000"), ("4", "Float", "40800000")],
                            ["INDEX", "DATA_TYPE_NAME", "VALUE"])

# Convert the hex string to a numeric value
df = helper_module.value_col_hex_to_numeric(df)

# Display the DataFrame
df.show()

# -- Databricks Connect returns the following error --
# ModuleNotFoundError: No module named 'helper'
# 
# -- While Azure Databricks returns the expected output --
# +-----+--------------+----------+
# |INDEX|DATA_TYPE_NAME|     VALUE|
# +-----+--------------+----------+
# |    1|         Float|       1.0|
# |    2|         Float|       2.0|
# |    3|         Float|       3.0|
# |    4|         Float|       4.0|
# +-----+--------------+----------+ 
@pietrodantuono pietrodantuono added the bug Something isn't working label Dec 14, 2023
@kartikgupta-db kartikgupta-db self-assigned this Dec 18, 2023
@AlexWehning
Copy link

AlexWehning commented Feb 8, 2024

I have the same issue.

Maybe to add to this:
When using the @udf decorator, or wrapping the str_hex_to_numeric function it works for me!

@udf
def str_hex_to_numeric(hex_value: str, data_type_name: str) -> float:
    ...

or

def udf_wrapper():
    def str_hex_to_numeric(hex_value: str, data_type_name: str) -> float:
        ...
    return udf(str_hex_to_numeric, FloatType())

What also doesn't work is referencing things from outside the functions scope, constants for example.

@sebastus
Copy link

sebastus commented Feb 28, 2024

I have the same issue. I found it with this use case:

df = df.withColumn('result', my_udf(col('some_data')))

where my_udf is in a helper module.

The only solution I've found to this point is to package up the helper in a wheel and install the wheel on the cluster. And then run my notebook from the databricks workspace rather than vscode.

@tplatenburg
Copy link

tplatenburg commented Oct 31, 2024

I have the same issue.

Maybe to add to this: When using the @udf decorator, or wrapping the str_hex_to_numeric function it works for me!

@udf
def str_hex_to_numeric(hex_value: str, data_type_name: str) -> float:
    ...

or

def udf_wrapper():
    def str_hex_to_numeric(hex_value: str, data_type_name: str) -> float:
        ...
    return udf(str_hex_to_numeric, FloatType())

What also doesn't work is referencing things from outside the functions scope, constants for example.

Had the same issue with "databricks-connect==15.3.0" . With the UDF decorator it indeed works! Thanks!

Full resolution code sample with the decorator:

# helper_module.py

# From the Python Standard Library
import struct
# From PySpark
import pyspark.sql.functions as F
import pyspark.sql.types as T
from pyspark.sql import DataFrame
from pyspark.sql.functions import udf


@udf(T.FloatType())
def str_hex_to_numeric(
    hex_value: str,
    data_type_name: str
) -> float:
    """Convert a hex string to a numeric value."""
    if data_type_name == "Float":
        return struct.unpack('!f', bytes.fromhex(hex_value))[0]
    raise ValueError(f"Unknown data type: {data_type_name}")

def value_col_hex_to_numeric(
    df: DataFrame,
    value_col: str = "VALUE",
    data_type_name_col: str = "DATA_TYPE_NAME"
) -> DataFrame:
    """Convert a hex string to a numeric value."""
    return df.withColumn(
        value_col,
        str_hex_to_numeric(F.col(value_col), F.col(data_type_name_col))
    )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants