Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parent array dropped when child array is empty #117

Open
msinjin opened this issue Dec 15, 2019 · 5 comments
Open

Parent array dropped when child array is empty #117

msinjin opened this issue Dec 15, 2019 · 5 comments

Comments

@msinjin
Copy link

msinjin commented Dec 15, 2019

[cross posted on stack overflow, getting no love]

It seems as though tidyjson uses an inner-join-like behaviour on nested arrays, thus dropping records with empty child arrays. Is there a way to get left-join-like behaviour instead, filling with NAs?

For example, these fake data have one record with a populated nested array (middles) and two records where middles is empty:

library(tidyjson)

people <- c('{"age": 32, "name": [{"first": "Bob",   "last": "Smith", "middles":[{"middle1":"John", "middle2":"Rick"}]}]}',
            '{"age": 54, "name": [{"first": "Susan", "last": "Doe", "middles":[]}]}',
            '{"age": 18, "name": [{"first": "Ann",   "last": "Jones", "middles":[]}]}')

From these data I wish to have a dataframe with all the parent records retained and missing child-array information filled with NAs (~ left join) as such:

# A tibble: 3 x 5
    age first last  middle1 middle2
  <dbl> <chr> <chr> <chr>   <chr>  
1    32 Bob   Smith John    Rick   
2    54 Susan Doe   NA      NA
3    18 Ann   Jones NA      NA

However, extracting a nested array with some empty child arrays causes loss of their parent's information (~ inner join):

people %>% 
    spread_all() %>%  
    enter_object("name") %>% gather_array() %>% 
    spread_all() %>% select(-document.id,-array.index) %>% 
    enter_object("middles") %>% gather_array %>% 
    spread_all()  %>% select(-array.index) %>% 
    tbl_df()
# A tibble: 1 x 5
    age first last  middle1 middle2
  <dbl> <chr> <chr> <chr>   <chr>  
1    32 Bob   Smith John    Rick   

Is there a way to avoid this; i.e., to retain all the rows even when child arrays are empty?

@colearendt
Copy link
Owner

Thanks for reporting, and apologies for missing the SO post 😄

I think this is related to #95 and possibly others. enter_object() is the problem here, as it discards records that do not have that object (And in this case, empty is treated as "does not have").

There is a short-term solution in splitting up the pipeline and doing the left_join() yourself. Not ideal I don't think, but functional for now until we have a bit more time to think on this one 😄

library(dplyr)
library(tidyjson)


people <- c('{"age": 32, "name": [{"first": "Bob",   "last": "Smith", "middles":[{"middle1":"John", "middle2":"Rick"}]}]}',
            '{"age": 54, "name": [{"first": "Susan", "last": "Doe", "middles":[]}]}',
            '{"age": 18, "name": [{"first": "Ann",   "last": "Jones", "middles":[]}]}')

# does not work as we want
people %>% 
  spread_all() %>%  
  enter_object("name") %>% gather_array() %>% 
  spread_all() %>% select(-document.id,-array.index) %>% 
  enter_object("middles") %>% gather_array %>% 
  spread_all()  %>% select(-array.index) %>% 
  tbl_df()
#> # A tibble: 1 x 5
#>     age first last  middle1 middle2
#>   <dbl> <chr> <chr> <chr>   <chr>  
#> 1    32 Bob   Smith John    Rick

# prepare the JSON
prep <- people %>% 
  spread_all() %>% 
  enter_object("name") %>% 
  gather_array() %>% 
  spread_all() %>% 
  select(-array.index)

# extract the sub-object
middles <- prep %>% 
  enter_object("middles") %>%
  gather_array() %>% 
  spread_all() %>% 
  select("document.id", starts_with("middle")) %>%
  as_tibble()

# join together
prep %>% 
  as_tibble() %>%
  left_join(
    middles, by = c("document.id")
    )
#> # A tibble: 3 x 6
#>   document.id   age first last  middle1 middle2
#>         <int> <dbl> <chr> <chr> <chr>   <chr>  
#> 1           1    32 Bob   Smith John    Rick   
#> 2           2    54 Susan Doe   <NA>    <NA>   
#> 3           3    18 Ann   Jones <NA>    <NA>

@colearendt
Copy link
Owner

And another approach that gets close (going tall first and avoiding enter_object()... but we'd really rather have a "list" column that you can then explore, rather than a character).

library(dplyr)
library(tidyr)
library(tidyjson)

people <- c('{"age": 32, "name": [{"first": "Bob",   "last": "Smith", "middles":[{"middle1":"John", "middle2":"Rick"}]}]}',
            '{"age": 54, "name": [{"first": "Susan", "last": "Doe", "middles":[]}]}',
            '{"age": 18, "name": [{"first": "Ann",   "last": "Jones", "middles":[]}]}')
people %>%
  spread_all() %>%
  enter_object("name") %>%
  gather_array() %>%
  gather_object() %>%
  append_values_string() %>%
  as_tibble() %>%
  spread(name, string)
#> # A tibble: 3 x 6
#>   document.id   age array.index first last  middles                             
#>         <int> <dbl>       <int> <chr> <chr> <chr>                               
#> 1           1    32           1 Bob   Smith "list(middle1 = \"John\", middle2 =…
#> 2           2    54           1 Susan Doe   NA                                  
#> 3           3    18           1 Ann   Jones NA

Created on 2019-12-15 by the reprex package (v0.2.1)

@colearendt
Copy link
Owner

colearendt commented Dec 15, 2019

I had a few minutes, so I tried adding a simple append_values_list() that makes the latter a bit more plausible... enter_object() is a bit more complex to dig into and will need more thought. Column types are still a little weird here with list-cols.

devtools::install_github("colearendt/tidyjson@add_append_values")
library(dplyr)
library(tidyr)
library(purrr)
library(tidyjson)

people <- c('{"age": 32, "name": [{"first": "Bob",   "last": "Smith", "middles":[{"middle1":"John", "middle2":"Rick"}]}]}',
            '{"age": 54, "name": [{"first": "Susan", "last": "Doe", "middles":[]}]}',
            '{"age": 18, "name": [{"first": "Ann",   "last": "Jones", "middles":[]}]}')

structured <- people %>%
  spread_all() %>%
  enter_object("name") %>%
  gather_array() %>%
  gather_object() %>%
  append_values_list() %>% 
  as_tibble()

structured
#> # A tibble: 9 x 5
#>   document.id   age array.index name    list            
#>         <int> <dbl>       <int> <chr>   <list>          
#> 1           1    32           1 first   <chr [1]>       
#> 2           1    32           1 last    <chr [1]>       
#> 3           1    32           1 middles <named list [2]>
#> 4           2    54           1 first   <chr [1]>       
#> 5           2    54           1 last    <chr [1]>       
#> 6           2    54           1 middles <lgl [1]>       
#> 7           3    18           1 first   <chr [1]>       
#> 8           3    18           1 last    <chr [1]>       
#> 9           3    18           1 middles <lgl [1]>

spread_it_out <- structured %>% spread(name, list)
spread_it_out
#> # A tibble: 3 x 6
#>   document.id   age array.index first     last      middles         
#>         <int> <dbl>       <int> <list>    <list>    <list>          
#> 1           1    32           1 <chr [1]> <chr [1]> <named list [2]>
#> 2           2    54           1 <chr [1]> <chr [1]> <lgl [1]>       
#> 3           3    18           1 <chr [1]> <chr [1]> <lgl [1]>

spread_it_out %>% mutate(first = flatten_chr(first), last = flatten_chr(last))
#> # A tibble: 3 x 6
#>   document.id   age array.index first last  middles         
#>         <int> <dbl>       <int> <chr> <chr> <list>          
#> 1           1    32           1 Bob   Smith <named list [2]>
#> 2           2    54           1 Susan Doe   <lgl [1]>       
#> 3           3    18           1 Ann   Jones <lgl [1]>

Created on 2019-12-15 by the reprex package (v0.2.1)

@msinjin
Copy link
Author

msinjin commented Dec 16, 2019

Thanks for reporting, and apologies for missing the SO post 😄

All good. I really wasn't sure if it was a bug or a feature!

I think this is related to #95 and possibly others. enter_object() is the problem here, as it discards records that do not have that object (And in this case, empty is treated as "does not have").

Interesting. I had considered issue #95 while researching a solution, so that makes sense.

There is a short-term solution in splitting up the pipeline and doing the left_join() yourself. Not ideal I don't think, but functional for now until we have a bit more time to think on this one 😄

I had come up with a similar left_join() work around, which I included in the SO post. It gets the job done but doubles the execution time which isn't great because I'm dealing with GBs of JSON.

I'll try out your other ideas and see how it goes.

Thanks for the quick answer!

@colearendt
Copy link
Owner

Another possible pattern I just added to the SO post.

library(tidyjson)
library(dplyr)
library(tidyr)

people <- c('{"age": 32, "name": [{"first": "Bob",   "last": "Smith", "middles":[{"middle1":"John", "middle2":"Rick"}]}]}',
            '{"age": 54, "name": [{"first": "Susan", "last": "Doe", "middles":[]}]}',
            '{"age": 18, "name": [{"first": "Ann",   "last": "Jones", "middles":[]}]}')

as_tbl_json(people) %>% spread_all() %>% enter_object("name") %>% gather_array("nameid") %>%
  gather_object("key") %>% 
  {bind_rows(
    filter(., key != "middles") %>% append_values_string("value"),
    filter(., key == "middles") %>% gather_array("middleid") %>% 
      select(-key) %>%
      gather_object("key") %>%
      append_values_string("value")
  )} %>%
  # drop tbl_json
  as_tibble() %>%
  select(-document.id, -nameid, -middleid) %>%
  # could also use tidyr::pivot_wider
  tidyr::spread(key, value)
#> # A tibble: 3 x 5
#>     age first last  middle1 middle2
#>   <dbl> <chr> <chr> <chr>   <chr>  
#> 1    18 Ann   Jones <NA>    <NA>   
#> 2    32 Bob   Smith John    Rick   
#> 3    54 Susan Doe   <NA>    <NA>

Created on 2020-06-28 by the reprex package (v0.3.0)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants