Join vignette #6478

AngelFelizR · 2024-09-06T05:46:13Z

This pull request is my solution to issue #2181

tdhock · 2024-09-06T13:43:29Z

vignettes/datatable-joins.Rmd

+)
+```
+
+In this vignette you will learn how to perform any join operation using next resources available in the `data.table` syntax.


delete "next"

tdhock · 2024-09-06T13:46:28Z

vignettes/datatable-joins.Rmd

+set.seed(5415)
+
+ProductSales = data.table(
+  id = 1:10,


here there are ten rows which may be overly complex for a first demonstration example
could it be reduced to two rows?

That would be challenging, as I would need to modify the examples below, since they are related. However, I do not expect the user to pay much attention to that detail.

We can consider avoiding the table display.

it is ok to keep all ten rows if necessary, but smaller is better in terms of learning examples usually.

tdhock · 2024-09-06T13:47:01Z

vignettes/datatable-joins.Rmd

+```
+x[i, on, nomatch]
+| |  |   |
+| |  |   ----> If NULL only returns rows linked in x and i tables


maybe use \ and _ instead of - ?

| \__

tdhock · 2024-09-06T13:47:43Z

vignettes/datatable-joins.Rmd

+| |  |   ----> If NULL only returns rows linked in x and i tables
+| |  ----> a character vector o list defining match logict
+| ----> principal data.table, list or data.frame
+----> secundary data.table


tdhock · 2024-09-06T13:50:16Z

vignettes/datatable-joins.Rmd

+| |  |   |
+| |  |   ----> If NULL only returns rows linked in x and i tables
+| |  ----> a character vector o list defining match logict
+| ----> principal data.table, list or data.frame


primary instead of principal?

also where do these names, principal/secondary come from? I don't see them in ?data.table so I would suggest to keep the standard/docmented names, i and x

Toby, thanks for pointing this out, I will change it from principal to primary.

I am trying to keep using the documented reference.

I just thought it could be useful for a new user to understand that the i table is more important in defining the number of rows to return, regardless of whether they are trying to make a left or right join.

tdhock · 2024-09-06T13:52:25Z

vignettes/datatable-joins.Rmd

+----> secundary data.table
+```
+
+An important difference between the regular `data.table` syntax is that _the only argument you can pass by position is the `i` argument_, the rest as you will see are going to be passed by name, so **feel free to change the argument order any time using the argument names** if it seems more convenient.


this is not true. In R you can pass any argument by name or position. on is 15th so it is difficult, but possible.

I understand your point.

Here is the new text

Please keep in mind that the standard argument order in data.table is dt[i, j, by]. For join operations, it is recommended to pass the on and nomatch arguments by name to avoid using j and by when they are not needed.

I'm not sure I understand "to avoid using j and by when they are not needed." can you please clarify?

Here is my attempt to clarify. Is this similar to what you meant? "For join operations using the data table square brackets, the first argument (i=table or names) is often specified as a positional argument, and the on and nomatch arguments are often specified by name, for example: dt1[dt2, on="variable", nomatch=0L]."

here is a demonstration that it is possible to use on argument by position instead of name (I don't think this is a great idea to do in practice though haha)

> dt1=data.table(x=1:2, y=3:4) > dt2=data.table(x=2) > dt1[dt2,,,,,,,,,,,,,,"x"] x y <int> <int> 1: 2 4

tdhock · 2024-09-06T13:53:48Z

vignettes/datatable-joins.Rmd

+
+### 3.1. Right join
+
+Use this method if you need to combine columns from 2 tables based on one or more references but ***keeping all rows present in the table located on the right***.


change right to i ? or at least mention i in addition to right

I think "table in the square brackets" or "i argument" should be mentioned in addition to "table located on the right"

change right to i ? or at least mention i in addition to right

I think it can be confusing as the left join it's also an i join, just exchanging the position of each table

tdhock · 2024-09-06T13:56:11Z

vignettes/datatable-joins.Rmd

+
+```{r}
+Products[ProductReceived,
+         on = c("id" = "product_id")]


"id" quotes not necessary

tdhock · 2024-09-06T13:57:42Z

vignettes/datatable-joins.Rmd

+Our recommendation is to use the second alternative if possible, as it is **faster** and uses **less memory** than the first one.
+
+
+##### 3.1.3.1. Managing shared column Names with the j argument


four levels of section numbering (3.1.3.1) is difficult to understand. can it be kept to 2 or 3 please? (3.1) or (3.1.3)

tdhock · 2024-09-06T14:00:56Z

vignettes/datatable-joins.Rmd

+dt1 =
+  ProductReceived[Products,
+                  on = c("product_id" = "id"),
+                  by = .EACHI,
+                  j = .(total_value_received  = sum(price * count))]


please consider alternative whitespace formatting below.
when each argument of square brackets is on its own line, each argument can be commented easily

dt1 = ProductReceived[ Products, on = c("product_id" = "id"), by = .EACHI, j = .(total_value_received = sum(price * count)) ]

I like this format.

tdhock · 2024-09-06T14:04:34Z

vignettes/datatable-joins.Rmd

+
+```r
+DT[ ...
+   ][ ...


DT[ ... ][ ... ][ ... ]

tdhock · 2024-09-06T14:04:58Z

vignettes/datatable-joins.Rmd

+       ]
+```
+
+So far, if after applying all that operations **we want to join new columns without removing any row**, we would need to stop the chaining process, save a temporal table and later apply the join operation.


temporal -> temporary

tdhock · 2024-09-06T14:05:41Z

vignettes/datatable-joins.Rmd

+                allow.cartesian = TRUE]
+```
+
+> `allow.cartesian` is defaulted to FALSE as this joins can can lead to a very large number of rows in the result.For example, if Table A has 100 rows and Table B has 50 rows, their Cartesian product would result in 5000 rows (100 * 50). This can quickly become memory-intensive for large datasets.


Add space after period (.)

tdhock

thanks for the highly significant contribution!
looks great overall, I left a few comments to think about.

ChristianWia · 2024-09-07T12:47:48Z

is this vignette candidate too for translations ?

AngelFelizR · 2024-09-07T13:09:09Z

After getting the approval, I can make a version in Spanish as it is my native language.

Some tweaks. Related: Rdatatable#6478 Rdatatable#2181

tdhock · 2024-10-04T13:09:13Z

Hi @AngelFelizR this is a highly non-trivial contribution of documentation, so can you please add your name to DESCRIPTION as contributor? After that I will merge. Thanks for your revisions!

AngelFelizR · 2024-10-05T05:34:52Z

Hi @AngelFelizR this is a highly non-trivial contribution of documentation, so can you please add your name to DESCRIPTION as contributor? After that I will merge. Thanks for your revisions!

Thanks Toby

AngelFelizR added 8 commits June 6, 2024 07:26

adding joining vignette

26dc8ce

adding links to the datatable-intro

7711ed0

adding vignette links to datatable-keys-fast-subset

728dc5a

adding vignette links to datatable-reference-semantics

464c38d

adding vignette links to datatable-sd-usage

96f2b77

adding vignette links to datatable-secondary-indices-and-auto-indexing

0a02a19

starting non-equi join example

8b265d2

join vignette complete

87c4200

AngelFelizR requested a review from MichaelChirico as a code owner September 6, 2024 05:46

Merge branch 'master' into join_vignette

ac49cd4

tdhock reviewed Sep 6, 2024

View reviewed changes

tdhock requested changes Sep 6, 2024

View reviewed changes

AngelFelizR added 2 commits September 6, 2024 17:16

making changes due Toby review

f943398

Merge remote-tracking branch 'origin/join_vignette' into join_vignette

72fbf84

rikivillalba and others added 2 commits September 7, 2024 11:12

Tweaks on datatable-joins.Rmd

3abc491

Some tweaks. Related: Rdatatable#6478 Rdatatable#2181

solving lint-r problems

6c9e14a

changing ouputformat

63f1db4

adding Angel Feliz as contributor

cf22514

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Join vignette #6478

Join vignette #6478

AngelFelizR commented Sep 6, 2024

tdhock Sep 6, 2024

tdhock Sep 6, 2024

AngelFelizR Sep 6, 2024

tdhock Sep 7, 2024

tdhock Sep 6, 2024 •

edited

Loading

tdhock Sep 6, 2024

tdhock Sep 6, 2024

AngelFelizR Sep 6, 2024

tdhock Sep 6, 2024

AngelFelizR Sep 6, 2024

tdhock Oct 4, 2024

tdhock Oct 4, 2024

tdhock Sep 6, 2024

tdhock Sep 6, 2024 •

edited

Loading

AngelFelizR Sep 6, 2024

tdhock Sep 6, 2024

tdhock Sep 6, 2024 •

edited

Loading

tdhock Sep 6, 2024 •

edited

Loading

AngelFelizR Sep 6, 2024

tdhock Sep 6, 2024 •

edited

Loading

tdhock Sep 6, 2024

tdhock Sep 6, 2024

tdhock left a comment

ChristianWia commented Sep 7, 2024

AngelFelizR commented Sep 7, 2024

tdhock commented Oct 4, 2024

AngelFelizR commented Oct 5, 2024


		### 3.1. Right join

		Use this method if you need to combine columns from 2 tables based on one or more references but *keeping all rows present in the table located on the right*.

		Our recommendation is to use the second alternative if possible, as it is faster and uses less memory than the first one.


		##### 3.1.3.1. Managing shared column Names with the j argument

Join vignette #6478

Are you sure you want to change the base?

Join vignette #6478

Conversation

AngelFelizR commented Sep 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tdhock Sep 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tdhock Sep 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tdhock Sep 6, 2024 • edited Loading

Choose a reason for hiding this comment

tdhock Sep 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tdhock Sep 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tdhock left a comment

Choose a reason for hiding this comment

ChristianWia commented Sep 7, 2024

AngelFelizR commented Sep 7, 2024

tdhock commented Oct 4, 2024

AngelFelizR commented Oct 5, 2024

tdhock Sep 6, 2024 •

edited

Loading

tdhock Sep 6, 2024 •

edited

Loading

tdhock Sep 6, 2024 •

edited

Loading

tdhock Sep 6, 2024 •

edited

Loading

tdhock Sep 6, 2024 •

edited

Loading