Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Structures cause error #32

Open
SergeNov opened this issue Aug 23, 2016 · 3 comments
Open

Structures cause error #32

SergeNov opened this issue Aug 23, 2016 · 3 comments

Comments

@SergeNov
Copy link

Hi Joe and others
I am trying to use your module to read a parquet file, and i ran into a problem here:
schema.py, line 21:
assert len(self.schema_elements) == len(self.schema_elements_by_name)
Apparently the init method assumes that my structure has multiple fields with the same name. Module works correctly if you comment out this line though
Originally these files were used by Hive, and here is the list of fields in the table:

fileid bigint,
version bigint,
ip_geocode structcountrycode:string,regionname:string,city:string,postalcode:string,metrocode:string,dmacode:string,
timestamp bigint,
region bigint,
pixel bigint,
uuid bigint,
uuid_exists boolean,
referingurl string,
useragent string,
ip string,
querystring string,
campaignsinfo array<struct<campaign_id:bigint,media_types:array,advertiser_id:bigint,funnel_step_id:bigint,funnel_step_value:bigint,track_conversion:boolean>>,
opted_out boolean,
event_id string

Here is how the list of fields that the module sees:

name=u'hive_schema', field_id=None, repetition_type=None, type_length=None, precision=None, num_children=17, converted_type=None, type=None
name=u'fileid', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2
name=u'version', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2
name=u'ip_geocode', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=6, converted_type=None, type=None
name=u'countrycode', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6
name=u'regionname', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6
name=u'city', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6
name=u'postalcode', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6
name=u'metrocode', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6
name=u'dmacode', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6
name=u'timestamp', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2
name=u'region', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2
name=u'pixel', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2
name=u'uuid', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2
name=u'uuid_exists', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=0
name=u'referingurl', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6
name=u'useragent', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6
name=u'ip', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6
name=u'querystring', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6
name=u'campaignsinfo', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=1, converted_type=3, type=None
name=u'bag', field_id=None, repetition_type=2, type_length=None, precision=None, num_children=1, converted_type=None, type=None
name=u'array_element', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=6, converted_type=None, type=None
name=u'campaign_id', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2
name=u'media_types', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=1, converted_type=3, type=None
name=u'bag', field_id=None, repetition_type=2, type_length=None, precision=None, num_children=1, converted_type=None, type=None
name=u'array_element', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2
name=u'advertiser_id', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2
name=u'funnel_step_id', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2
name=u'funnel_step_value', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2
name=u'track_conversion', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=0
name=u'opted_out', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=0
name=u'event_id', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6
name=u'dt', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=1
name=u'hr', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=1

Apparently there are 2 elements named 'array_element' and 'bag' - i assume these fields just come with structures

@jcrobak
Copy link
Owner

jcrobak commented Aug 28, 2016

@SergeNov thanks for the report. I'll attempt to reproduce and fix the issue.

jcrobak added a commit that referenced this issue Oct 1, 2016
Rather than flattening schemas and adding a '.' between paths
in the schema (e.g. `foo.bar`), support the schema path as a
first-class object (for schema operations, at least). This
is an experimental implementation and likely has bugs. But it
supports some simple cases.

This implementation changes behavior. Specifically:
 * `DictReader()` now has a `flatten` argument that defaults to
   `False`. If Flatten is false, DictReader will read nested data
   as `{'foo': {'bar': 1}}` instead of as `{'foo.bar': 1}`.
 * Likewise, this is the new default behavior for the command-line
   tool with `--format json`. This can be changed with `--flatten`.

Known issues:
 * Repetition-levels still aren't supported. A file with arrays
   will break.
 * nulls aren't interpretted at the level. (e.g.: `{"foo": null}`
   will be interpetted as `{"foo": {"bar": null}}` if `foo` has
   a child of `bar`.

Refs: #32
jcrobak added a commit that referenced this issue Oct 1, 2016
Rather than flattening schemas and adding a '.' between paths
in the schema (e.g. `foo.bar`), support the schema path as a
first-class object (for schema operations, at least). This
is an experimental implementation and likely has bugs. But it
supports some simple cases.

This implementation changes behavior. Specifically:
 * `DictReader()` now has a `flatten` argument that defaults to
   `False`. If Flatten is false, DictReader will read nested data
   as `{'foo': {'bar': 1}}` instead of as `{'foo.bar': 1}`.
 * Likewise, this is the new default behavior for the command-line
   tool with `--format json`. This can be changed with `--flatten`.

Known issues:
 * Repetition-levels still aren't supported. A file with arrays
   will break.
 * nulls aren't interpretted at the level. (e.g.: `{"foo": null}`
   will be interpetted as `{"foo": {"bar": null}}` if `foo` has
   a child of `bar`.

Refs: #32
@jcrobak
Copy link
Owner

jcrobak commented Oct 1, 2016

@SergeNov I've started to work on support for schemas like these. The first step is in #45, if you want to give it a try. Unfortunately, I don't think your schema is fully supported yet because it includes an array.

@halfak
Copy link

halfak commented Jan 10, 2022

Still experiencing this issue in version 1.3.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants