Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Join vignette #6478

Open
wants to merge 15 commits into
base: master
Choose a base branch
from
Open

Conversation

AngelFelizR
Copy link

This pull request is my solution to issue #2181

)
```

In this vignette you will learn how to perform any join operation using next resources available in the `data.table` syntax.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete "next"

set.seed(5415)

ProductSales = data.table(
id = 1:10,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here there are ten rows which may be overly complex for a first demonstration example
could it be reduced to two rows?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be challenging, as I would need to modify the examples below, since they are related. However, I do not expect the user to pay much attention to that detail.

We can consider avoiding the table display.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is ok to keep all ten rows if necessary, but smaller is better in terms of learning examples usually.

```
x[i, on, nomatch]
| | | |
| | | ----> If NULL only returns rows linked in x and i tables
Copy link
Member

@tdhock tdhock Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe use \ and _ instead of - ?

|
\__

| | | ----> If NULL only returns rows linked in x and i tables
| | ----> a character vector o list defining match logict
| ----> principal data.table, list or data.frame
----> secundary data.table
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

secondary

| | | |
| | | ----> If NULL only returns rows linked in x and i tables
| | ----> a character vector o list defining match logict
| ----> principal data.table, list or data.frame
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

primary instead of principal?

also where do these names, principal/secondary come from? I don't see them in ?data.table so I would suggest to keep the standard/docmented names, i and x

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Toby, thanks for pointing this out, I will change it from principal to primary.

I am trying to keep using the documented reference.

I just thought it could be useful for a new user to understand that the i table is more important in defining the number of rows to return, regardless of whether they are trying to make a left or right join.

----> secundary data.table
```

An important difference between the regular `data.table` syntax is that _the only argument you can pass by position is the `i` argument_, the rest as you will see are going to be passed by name, so **feel free to change the argument order any time using the argument names** if it seems more convenient.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not true. In R you can pass any argument by name or position. on is 15th so it is difficult, but possible.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand your point.

Here is the new text

Please keep in mind that the standard argument order in data.table is dt[i, j, by]. For join operations, it is recommended to pass the on and nomatch arguments by name to avoid using j and by when they are not needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand "to avoid using j and by when they are not needed." can you please clarify?

Here is my attempt to clarify. Is this similar to what you meant? "For join operations using the data table square brackets, the first argument (i=table or names) is often specified as a positional argument, and the on and nomatch arguments are often specified by name, for example: dt1[dt2, on="variable", nomatch=0L]."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here is a demonstration that it is possible to use on argument by position instead of name (I don't think this is a great idea to do in practice though haha)

> dt1=data.table(x=1:2, y=3:4)
> dt2=data.table(x=2)
> dt1[dt2,,,,,,,,,,,,,,"x"]
       x     y
   <int> <int>
1:     2     4


### 3.1. Right join

Use this method if you need to combine columns from 2 tables based on one or more references but ***keeping all rows present in the table located on the right***.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change right to i ? or at least mention i in addition to right

Copy link
Member

@tdhock tdhock Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "table in the square brackets" or "i argument" should be mentioned in addition to "table located on the right"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change right to i ? or at least mention i in addition to right

I think it can be confusing as the left join it's also an i join, just exchanging the position of each table


```{r}
Products[ProductReceived,
on = c("id" = "product_id")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"id" quotes not necessary

Our recommendation is to use the second alternative if possible, as it is **faster** and uses **less memory** than the first one.


##### 3.1.3.1. Managing shared column Names with the j argument
Copy link
Member

@tdhock tdhock Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

four levels of section numbering (3.1.3.1) is difficult to understand. can it be kept to 2 or 3 please? (3.1) or (3.1.3)

Comment on lines 257 to 261
dt1 =
ProductReceived[Products,
on = c("product_id" = "id"),
by = .EACHI,
j = .(total_value_received = sum(price * count))]
Copy link
Member

@tdhock tdhock Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please consider alternative whitespace formatting below.
when each argument of square brackets is on its own line, each argument can be commented easily

dt1 = ProductReceived[
  Products,
  on = c("product_id" = "id"),
  by = .EACHI,
  j = .(total_value_received  = sum(price * count))
]

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this format.


```r
DT[ ...
][ ...
Copy link
Member

@tdhock tdhock Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DT[
  ...
][
  ...
][
  ...
]

]
```

So far, if after applying all that operations **we want to join new columns without removing any row**, we would need to stop the chaining process, save a temporal table and later apply the join operation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

temporal -> temporary

allow.cartesian = TRUE]
```

> `allow.cartesian` is defaulted to FALSE as this joins can can lead to a very large number of rows in the result.For example, if Table A has 100 rows and Table B has 50 rows, their Cartesian product would result in 5000 rows (100 * 50). This can quickly become memory-intensive for large datasets.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add space after period (.)

Copy link
Member

@tdhock tdhock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the highly significant contribution!
looks great overall, I left a few comments to think about.

@ChristianWia
Copy link

is this vignette candidate too for translations ?

@AngelFelizR
Copy link
Author

After getting the approval, I can make a version in Spanish as it is my native language.

@tdhock
Copy link
Member

tdhock commented Oct 4, 2024

Hi @AngelFelizR this is a highly non-trivial contribution of documentation, so can you please add your name to DESCRIPTION as contributor? After that I will merge. Thanks for your revisions!

@AngelFelizR
Copy link
Author

Hi @AngelFelizR this is a highly non-trivial contribution of documentation, so can you please add your name to DESCRIPTION as contributor? After that I will merge. Thanks for your revisions!

Thanks Toby

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants