Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NDJSON test data doesn't contain variable names #49

Open
nicholas-masel opened this issue Sep 6, 2024 · 3 comments
Open

NDJSON test data doesn't contain variable names #49

nicholas-masel opened this issue Sep 6, 2024 · 3 comments

Comments

@nicholas-masel
Copy link
Collaborator

The NDJSON data doesn't contain variable names in each row, only values.

For example:

With variables names:
{"name": "Leandro","lastName": "Shokida"} {"name": "Mariano","lastName": "De Achaval"}

Without variable names:
{"Leandro", "Shokida"} {"Mariano", "De Achaval"}

From what I can tell we can:

  1. Update the NDJSON test data so that each row contains variable names. We can then stream this directly to a data frame.
  2. If this is intentional, we can read this in as a list of lists, bind the rows and convert to a data frame.
@nicholas-masel
Copy link
Collaborator Author

@mstackhouse Are you aware or able to check with Sam or Lex to confirm the test data for ndjson is valid?

@mstackhouse
Copy link
Contributor

@nicholas-masel are you talking about the row-level data itself? So for the data records, or for the variable level metadata? Because this is the same case for the non-NDJSON data too:

From here

{
  "datasetJSONCreationDateTime": "2023-06-28T15:38:43",
  "datasetJSONVersion": "1.1.0",
  "fileOID": "www.sponsor.xyz.org.project123.final",
  "dbLastModifiedDateTime": "2023-05-31T00:00:00",
  "originator": "Sponsor XYZ",
  "sourceSystem": {
      "name": "Software ABC",
      "version": "1.0.0"
  },
  "studyOID": "cdisc.com.CDISCPILOT01",
  "metaDataVersionOID": "MDV.MSGv2.0.SDTMIG.3.3.SDTM.1.7",
  "metaDataRef": "https://metadata.location.org/CDISCPILOT01/define.xml",
  "itemGroupOID": "IG.DM",
  "isReferenceData": false,
  "records": 18,
  "name": "DM",
  "label": "Demographics",
  "columns": [
      {"itemOID": "ITEMGROUPDATASEQ", "name": "ITEMGROUPDATASEQ", "label": "Record Identifier", "dataType": "integer"},
      {"itemOID": "IT.STUDYID", "name": "STUDYID", "label": "Study Identifier", "dataType": "string", "length": 12, "keySequence": 1},
      {"itemOID": "IT.DOMAIN", "name": "DOMAIN", "label": "Domain Abbreviation", "dataType": "string", "length": 2},
      {"itemOID": "IT.USUBJID", "name": "USUBJID", "label": "Unique Subject Identifier", "dataType": "string", "length": 8, "keySequence": 2},
      {"itemOID": "IT.AGE", "name": "AGE", "label": "Age", "dataType": "integer"}
  ],
  "rows": [
      [1, "CDISCPILOT01", "DM", "CDISC001", 84],
      [2, "CDISCPILOT01", "DM", "CDISC002", 76],
      [3, "CDISCPILOT01", "DM", "CDISC003", 61],
      ...
  ]
}

The only change for NDJSON is that the rows elements are instead there own lines of the file:

{
  "datasetJSONCreationDateTime": "2023-06-28T15:38:43",
  "datasetJSONVersion": "1.1.0",
  "fileOID": "www.sponsor.xyz.org.project123.final",
  "dbLastModifiedDateTime": "2023-05-31T00:00:00",
  "originator": "Sponsor XYZ",
  "sourceSystem": {
      "name": "Software ABC",
      "version": "1.0.0"
  },
  "studyOID": "cdisc.com.CDISCPILOT01",
  "metaDataVersionOID": "MDV.MSGv2.0.SDTMIG.3.3.SDTM.1.7",
  "metaDataRef": "https://metadata.location.org/CDISCPILOT01/define.xml",
  "itemGroupOID": "IG.DM",
  "isReferenceData": false,
  "records": 18,
  "name": "DM",
  "label": "Demographics",
  "columns": [
      {"itemOID": "ITEMGROUPDATASEQ", "name": "ITEMGROUPDATASEQ", "label": "Record Identifier", "dataType": "integer"},
      {"itemOID": "IT.STUDYID", "name": "STUDYID", "label": "Study Identifier", "dataType": "string", "length": 12, "keySequence": 1},
      {"itemOID": "IT.DOMAIN", "name": "DOMAIN", "label": "Domain Abbreviation", "dataType": "string", "length": 2},
      {"itemOID": "IT.USUBJID", "name": "USUBJID", "label": "Unique Subject Identifier", "dataType": "string", "length": 8, "keySequence": 2},
      {"itemOID": "IT.AGE", "name": "AGE", "label": "Age", "dataType": "integer"}
  ]
}
[1, "CDISCPILOT01", "DM", "CDISC001", 84]
[2, "CDISCPILOT01", "DM", "CDISC002", 76]
[3, "CDISCPILOT01", "DM", "CDISC003", 61]
...

@nicholas-masel
Copy link
Collaborator Author

Yeah, I was talking about variable names on the row-level data. I reached out to Sam and he confirmed this was not included due to file size.

I am trying out reading as a list instead of a df, and it seems to work, but is causing some other type issues downstream that didn't appear when reading this directly to a df.

yyjsonr::read_ndjson_str(
      file,
      type = "list",
      nskip = 1,
      opts = json_opts
    )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants