Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataframe is all Nat and None after loading #127

Open
Sondos-Omar opened this issue Feb 3, 2023 · 6 comments
Open

Dataframe is all Nat and None after loading #127

Sondos-Omar opened this issue Feb 3, 2023 · 6 comments
Assignees

Comments

@Sondos-Omar
Copy link

Sondos-Omar commented Feb 3, 2023

I was trying mongo arrow to load a dataset from mongodb, it is loading the selected columns only that's saving space, but the dataframe is all Nat and Nones only. Is this a common issue and how to fix that?
Thanks in advance

df=collection.find_pandas_all(
{ "prop.Start": {'$gte':start_date,'$lte':end_date}} ,
schema=Schema({
'prop.Start': datetime,
'prop.Name':str,
'_id.objectId':str
}))

@ShaneHarvey
Copy link
Collaborator

Hi @Sondos-Omar, could you please provide some sample input and output of the current behavior vs the behavior you would expect?

@Sondos-Omar
Copy link
Author

Hi @ShaneHarvey , thank you for your reply, here is a sample of the output. Loading this data using collections.find followed by appending the dataset to pandas in batches works and loads the expected data (the dates, names and ids). But this takes so much space and time. Double checked the schema on mongodb compass, the prop is object and the prop.Start is datetime and the rest are strings.
image

@juliusgeo
Copy link
Contributor

juliusgeo commented Feb 9, 2023

Hi! Thank you for raising this issue. This is unfortunately because Schemas do not support the "dot" notation used in MongoDB projections. Unfortunately, at this time the best workaround seems to be to flatten the data before ingesting into PyMongoArrow by using an aggregate pipeline. I opened a ticket for the implementation of a real solution to this issue, and a ticket here for us to update our documentation with the correct workaround.

@juliusgeo
Copy link
Contributor

juliusgeo commented Feb 9, 2023

An example aggregation pipeline you can use would be:

db.collection.aggregate([
  {
    "$project": {
      "Name": "$value.prop.Name",
      "Start": "$value.prop.Start"
    }
  },
])

Which would yield something that looks like this:

[
  {
    "Name": "foo",
    "Start": <datetime>,
    "_id": ObjectId("5a934e000102030405000000")
  }
]

@juliusgeo
Copy link
Contributor

juliusgeo commented Feb 10, 2023

@Sondos-Omar here is a more detailed example:

from pymongo import MongoClient
from pymongoarrow.api import Schema, find_pandas_all
from pymongoarrow.types import (
    ObjectIdType,
)
from datetime import datetime
from bson import datetime_ms
coll = MongoClient(username="user", password="password").db.coll
coll.drop()
start_date, end_date = datetime_ms._millis_to_datetime(0, coll.codec_options),datetime_ms._millis_to_datetime(10, coll.codec_options)
coll.insert_many([{
      "prop": {
        "Name": "foo",
        "Start": start_date,
    }
  }, {
      "prop": {
        "Name": "foo",
        "Start": end_date,
    }
  },])

# The code below is likely all that you will need, the code above is just setting up the Database so it contains the right kind of data.
df=find_pandas_all(coll,
{"prop.Start": {'$gte':start_date,'$lte':end_date,}},
projection={
      "Name": "$prop.Name",
      "Start": "$prop.Start"
},
schema=Schema({"_id": ObjectIdType(), "Start": datetime, "Name": str}))
print(df)
>>>
                                             _id                   Start Name
0  b'c\xe6\xd1\xf3\xd3\xea\xac\xf5\x04\xd6\x95c' 1970-01-01 00:00:00.000  foo
1  b'c\xe6\xd1\xf3\xd3\xea\xac\xf5\x04\xd6\x95d' 1970-01-01 00:00:00.010  foo

@juliusgeo
Copy link
Contributor

Hi! @Sondos-Omar We have updated our documentation in this PR to show more examples for using nested data: #130

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants