-
Notifications
You must be signed in to change notification settings - Fork 318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] [Documentation] UDF Guide #416
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
# User-Defined Functions - C# | ||
This documentation contains user-defined function (UDF) examples. It shows how to define UDFs and how to use UDFs with Row objects as examples. | ||
|
||
## Pre-requisites: | ||
When you want to execute a C# UDF, Spark needs to understand how to launch the .NET CLR to execute this UDF. Microsoft.Spark.Worker provides a collection of classes to Spark that enable this functionality. Thus, you need to [install the Microsoft.Spark.Worker](https://docs.microsoft.com/en-us/dotnet/spark/tutorials/get-started#5-install-net-for-apache-spark). | ||
|
||
Additionally, [you may need to configure certain environment variables and parameters](https://docs.microsoft.com/en-us/dotnet/spark/how-to-guides/deploy-worker-udf-binaries) to deploy worker and UDF binaries when submitting your Spark app. | ||
|
||
## UDF that takes in Row objects | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why are we only showing examples of UDFs with Row objects? It seems like it'd be valuable to have this document explain how to write any UDF and show examples of all (or at least more types) of UDFs? Or is the goal of this doc to only show Row-based UDFs (in this case, we should change the title and intro of the doc to reflect that, because right now it seems like it should explain all UDFs)? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the purpose of this doc is There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can start with UDFs with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Sounds good! |
||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we add some sentences providing context to this example? For instance, as a reader, I have the following questions:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for your suggestion! I was looking at UDF docs here. I am not sure how much detail we want to go with this intro guide. Should we just consider this as a |
||
```csharp | ||
// Create DataFrame which will also be used in the following examples. | ||
DataFrame df = spark.Range(0, 5).WithColumn("structId", Struct("id")); | ||
|
||
// Define UDF that takes in Row objects | ||
Func<Column, Column> udf1 = Udf<Row, int>( | ||
row => row.GetAs<int>(0) + 100); | ||
|
||
// Use UDF with DataFrames | ||
df.Select(udf(df["structId"])).Show(); | ||
``` | ||
|
||
## UDF that returns Row objects | ||
Please note that `GenericRow` objects need to be used here. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same questions as above, so I think it'd be great to provide some additional context here. Also, why does |
||
|
||
```csharp | ||
// Define UDF that returns Row objects | ||
var schema = new StructType(new[] | ||
{ | ||
new StructField("col1", new IntegerType()), | ||
new StructField("col2", new StringType()) | ||
}); | ||
Func<Column, Column> udf2 = Udf<int>( | ||
id => new GenericRow(new object[] { 1, "abc" }), schema); | ||
|
||
// Use UDF with DataFrames | ||
df.Select(udf(df["id"])).Show(); | ||
``` | ||
|
||
## Chained UDF with Row objects | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As above, it would be great to add some context/explanation. What is a scenario when I'd need to chain UDFs? What does this code do? |
||
|
||
```csharp | ||
// Chained UDF using udf1 and udf2 defined above. | ||
df.Select(udf1(udf2(df["id"]))).Show(); | ||
``` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Adding a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are UDFs applicable to any C# app, or just .NET for Spark apps? If they're just used in .NET for Spark apps, I'd add a sentence or two explaining how UDFs apply to/are useful in .NET for Spark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we're focusing on Row object examples? Could we include other examples and then make this intro more general (i.e. "...This document explains how to construct UDFs in C# and includes example functions.")?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we talk about UDF used within .NET for Spark here.