Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] [Documentation] UDF Guide #416

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 45 additions & 0 deletions docs/user-defined-functions-c#.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# User-Defined Functions - C#
This documentation contains user-defined function (UDF) examples. It shows how to define UDFs and how to use UDFs with Row objects as examples.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This documentation contains user-defined function (UDF) examples. It shows how to define UDFs and how to use UDFs with Row objects as examples.
A user-defined function, or UDF, is a routine that can take in parameters, perform some sort of calculation, and then return a result. This document explains how to construct UDFs in C# and includes example functions, such as how to use UDFs with Row objects.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are UDFs applicable to any C# app, or just .NET for Spark apps? If they're just used in .NET for Spark apps, I'd add a sentence or two explaining how UDFs apply to/are useful in .NET for Spark.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we're focusing on Row object examples? Could we include other examples and then make this intro more general (i.e. "...This document explains how to construct UDFs in C# and includes example functions.")?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are UDFs applicable to any C# app, or just .NET for Spark apps? If they're just used in .NET for Spark apps, I'd add a sentence or two explaining how UDFs apply to/are useful in .NET for Spark.

I think we talk about UDF used within .NET for Spark here.


## Pre-requisites:
When you want to execute a C# UDF, Spark needs to understand how to launch the .NET CLR to execute this UDF. Microsoft.Spark.Worker provides a collection of classes to Spark that enable this functionality. Thus, you need to [install the Microsoft.Spark.Worker](https://docs.microsoft.com/en-us/dotnet/spark/tutorials/get-started#5-install-net-for-apache-spark).

Additionally, [you may need to configure certain environment variables and parameters](https://docs.microsoft.com/en-us/dotnet/spark/how-to-guides/deploy-worker-udf-binaries) to deploy worker and UDF binaries when submitting your Spark app.

## UDF that takes in Row objects
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we only showing examples of UDFs with Row objects? It seems like it'd be valuable to have this document explain how to write any UDF and show examples of all (or at least more types) of UDFs?

Or is the goal of this doc to only show Row-based UDFs (in this case, we should change the title and intro of the doc to reflect that, because right now it seems like it should explain all UDFs)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the purpose of this doc is using UDF with Row objects readme file. This goes with the recent PR which exposes the UDF that returns Row objects. I think we can add more types later. @imback82 what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can start with UDFs with Row since there are few gotchas with them, and we can expand this.

Copy link
Contributor Author

@elvaliuliuliu elvaliuliuliu Feb 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can start with UDFs with Row since there are few gotchas with them, and we can expand this.

Sounds good!


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some sentences providing context to this example?

For instance, as a reader, I have the following questions:

  • When would I use a UDF that takes in Row objects (as opposed to other types of UDFs)?
  • Do all UDFs just take in or return Row objects (since that's all that is shown in this doc)?
  • What is the goal of this code? What calculation or filtering is it performing and why?
  • What would be the output of this code?
  • Is this the only way to define UDFs (using Func<> myUdf = Udf<>(...))? What about spark.Udf().Register<>...?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your suggestion! I was looking at UDF docs here. I am not sure how much detail we want to go with this intro guide. Should we just consider this as a using UDF with Row objects readme file or UDF tutorial? This goes with your previous question also.

```csharp
// Create DataFrame which will also be used in the following examples.
DataFrame df = spark.Range(0, 5).WithColumn("structId", Struct("id"));

// Define UDF that takes in Row objects
Func<Column, Column> udf1 = Udf<Row, int>(
row => row.GetAs<int>(0) + 100);

// Use UDF with DataFrames
df.Select(udf(df["structId"])).Show();
```

## UDF that returns Row objects
Please note that `GenericRow` objects need to be used here.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same questions as above, so I think it'd be great to provide some additional context here. Also, why does GenericRow need to be used here?


```csharp
// Define UDF that returns Row objects
var schema = new StructType(new[]
{
new StructField("col1", new IntegerType()),
new StructField("col2", new StringType())
});
Func<Column, Column> udf2 = Udf<int>(
id => new GenericRow(new object[] { 1, "abc" }), schema);

// Use UDF with DataFrames
df.Select(udf(df["id"])).Show();
```

## Chained UDF with Row objects
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, it would be great to add some context/explanation. What is a scenario when I'd need to chain UDFs? What does this code do?


```csharp
// Chained UDF using udf1 and udf2 defined above.
df.Select(udf1(udf2(df["id"]))).Show();
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a Next Steps or Resources or Wrap Up section at the end could be really helpful. i.e., "If you'd like to see more examples of UDFs in action, check out our XYZ examples in the .NET for Apache Spark GitHub repo."