Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add support for modin #907

Merged
merged 9 commits into from
Feb 6, 2024
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ jobs:
- name: Install poetry
run: pip install poetry==1.4.2
- name: Install dependencies
run: poetry install --all-extras
run: poetry install --all-extras --with dev
- name: Lint with ruff
run: |
make format_diff
Expand Down
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ We use `poetry` as our package manager. You can install poetry by following the
Please DO NOT use pip or conda to install the dependencies. Instead, use poetry:

```bash
poetry install --all-extras
poetry install --all-extras --with dev
```

### 📌 Pre-commit
Expand Down
37 changes: 37 additions & 0 deletions docs/examples.md
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,43 @@ print(response)

Remember that at the moment, you need to make sure that the Google Sheet is public.

## Working with Modin dataframes

Example of using PandasAI with a Modin DataFrame. In order to use Modin dataframes as a data source, you need to install the `pandasai[modin]` extra dependency.

```console
pip install pandasai[polars]
mspronesti marked this conversation as resolved.
Show resolved Hide resolved
```

Then, you can use PandasAI with a Modin DataFrame as follows:

```python
import pandasai
from pandasai import SmartDataframe
import modin.pandas as pd
from pandasai.llm import OpenAI

llm = OpenAI(api_token="YOUR_API_TOKEN")

# You can instantiate a SmartDataframe with a Polars DataFrame

df = pd.DataFrame({
"country": ["United States", "United Kingdom", "France", "Germany", "Italy", "Spain", "Canada", "Australia",
"Japan", "China"],
"gdp": [19294482071552, 2891615567872, 2411255037952, 3435817336832, 1745433788416, 1181205135360, 1607402389504,
1490967855104, 4380756541440, 14631844184064],
"happiness_index": [6.94, 7.16, 6.66, 7.07, 6.38, 6.4, 7.23, 7.22, 5.87, 5.12]
})

pandasai.set_pd_engine("modin")
df = SmartDataframe(df, config={"llm": llm})
response = df.chat("How many loans are from men and have been paid off?")
print(response)

# you can switch back to pandas using
# pandasai.set_pd_engine("pandas")
```
mspronesti marked this conversation as resolved.
Show resolved Hide resolved

## Working with Polars dataframes

Example of using PandasAI with a Polars DataFrame (still in beta). In order to use Polars dataframes as a data source, you need to install the `pandasai[polars]` extra dependency.
Expand Down
1 change: 1 addition & 0 deletions docs/getting-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ You can replace `extra-dependency-name` with any of the following:
- `google-aip`: this extra dependency is required if you want to use Google PaLM as a language model.
- `google-sheet`: this extra dependency is required if you want to use Google Sheets as a data source.
- `excel`: this extra dependency is required if you want to use Excel files as a data source.
- `modin`: this extra dependency is required if you want to use Modin dataframes as a data source.
- `polars`: this extra dependency is required if you want to use Polars dataframes as a data source.
- `langchain`: this extra dependency is required if you want to support the LangChain LLMs.
- `numpy`: this extra dependency is required if you want to support numpy.
Expand Down
3 changes: 3 additions & 0 deletions pandasai/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
import importlib.metadata

from .agent import Agent
from .engine import set_pd_engine
from .helpers.cache import Cache
from .skills import skill
from .smart_dataframe import SmartDataframe
Expand All @@ -25,4 +26,6 @@ def clear_cache(filename: str = None):
"Agent",
"clear_cache",
"skill",
"set_pd_engine",
"pandas",
]
3 changes: 2 additions & 1 deletion pandasai/connectors/airtable.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,10 @@
from functools import cache, cached_property
from typing import Optional, Union

import pandas as pd
import requests

import pandasai.pandas as pd

from ..exceptions import InvalidRequestError
from ..helpers.path import find_project_root
from .base import AirtableConnectorConfig, BaseConnector, BaseConnectorConfig
Expand Down
3 changes: 2 additions & 1 deletion pandasai/connectors/snowflake.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,10 @@
from functools import cache
from typing import Union

import pandas as pd
from sqlalchemy import create_engine

import pandasai.pandas as pd

from .base import BaseConnectorConfig, SnowFlakeConnectorConfig
from .sql import SQLConnector

Expand Down
2 changes: 1 addition & 1 deletion pandasai/connectors/sql.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,10 @@
from functools import cache, cached_property
from typing import Union

import pandas as pd
from sqlalchemy import asc, create_engine, select, text
from sqlalchemy.engine import Connection

import pandasai.pandas as pd
from pandasai.exceptions import MaliciousQueryError

from ..constants import DEFAULT_FILE_PERMISSIONS
Expand Down
2 changes: 1 addition & 1 deletion pandasai/connectors/yahoo_finance.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
import time
from typing import Optional, Union

import pandas as pd
import pandasai.pandas as pd

from ..constants import DEFAULT_FILE_PERMISSIONS
from ..helpers.path import find_project_root
Expand Down
1 change: 1 addition & 0 deletions pandasai/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,4 +94,5 @@
"base64",
"scipy",
"streamlit",
"modin",
]
24 changes: 24 additions & 0 deletions pandasai/engine.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
import threading
from importlib import reload

_engine = "pandas"
_lock: threading.RLock = threading.RLock()


def set_pd_engine(engine: str = "pandas"):
global _engine
if engine.lower() not in ("modin", "pandas"):
raise ValueError(
f"Unknown engine {engine}. Valid options are ('modin', 'pandas')"
)

if engine != _engine:
with _lock:
_engine = engine
_reload_pd()


def _reload_pd():
import pandasai

reload(pandasai.pandas)
2 changes: 1 addition & 1 deletion pandasai/helpers/anonymizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
import re
import string

import pandas as pd
import pandasai.pandas as pd


class Anonymizer:
Expand Down
2 changes: 1 addition & 1 deletion pandasai/helpers/code_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,8 @@
from typing import Any, Generator, List, Union

import astor
import pandas as pd

import pandasai.pandas as pd
from pandasai.helpers.path import find_project_root
from pandasai.helpers.skills_manager import SkillsManager
from pandasai.helpers.sql import extract_table_names
Expand Down
3 changes: 2 additions & 1 deletion pandasai/helpers/data_sampler.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,8 @@
import random

import numpy as np
import pandas as pd

import pandasai.pandas as pd
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I understand correctly, to make it work with modin we need to import pandas like this. While if we import with import pandas as pd it will still work, but just ignore the logic to optionally import it with modin, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mspronesti while if one, for example, import pandas as pd, it will still work, right? What was the issue with import modin.pandas as pd?

Copy link
Contributor Author

@mspronesti mspronesti Feb 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gventuri From the user perspective, pandasai is still supposed to be used with pandas or modin, i.e. one would just import one of the two in their snippet and use pandasai as they were before.

However, internally this import is needed because pandasai.pandas contains the logic to select the engine, which is triggered whenenver pandasai.set_pd_engine("modin|pandas") is called.

To sum up, from the user'sperspective:

  • to use pandas: no changes
  • to use modin: pandasai.set_pd_engine("modin") and that's It
  • to use both in the same snippet but with different dataframes:
...
pandasai.set_pd_engine(modin|pandas)
#  the selected engine will be used
...
pandasai.set_pd_engine(pandas|modin)
# now the new engine will be used
...

By default, the engine is pandas thus if set_pd_engine is not called, pandas will be used. This makes pandasai fully backward compatible.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mspronesti thanks a lot for clarifying, makes a lot of sense! I'll have a final review later today and merge it! Thanks a lot for the great improvement, super helpful for larger datasets!


from .anonymizer import Anonymizer
from .df_info import DataFrameType, df_type
Expand Down
2 changes: 1 addition & 1 deletion pandasai/helpers/df_config_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ def _get_import_path(self):

# Save df if pandas or polar
dataframe_type = df_type(self.original_import)
if dataframe_type == "pandas":
if dataframe_type in ("pandas", "modin"):
file_path = self._create_save_path()
self._sdf.dataframe.to_parquet(file_path)
elif dataframe_type == "polars":
Expand Down
42 changes: 38 additions & 4 deletions pandasai/helpers/df_info.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,28 +2,62 @@

import pandas as pd


def _import_modin():
try:
import modin.pandas as pd
except ImportError as e:
raise ImportError(
"Could not import modin, please install with " "`pip install modin`."
) from e
return pd


def _import_polars():
try:
import polars as pl
except ImportError as e:
raise ImportError(
"Could not import polars, please install with " "`pip install polars`."
) from e
return pl


DataFrameType = Union[pd.DataFrame, str]
mspronesti marked this conversation as resolved.
Show resolved Hide resolved

polars_imported = False
modin_imported = False
try:
import polars as pl
pl = _import_polars()

polars_imported = True
DataFrameType = Union[pd.DataFrame, pl.DataFrame, str]
DataFrameType = Union[DataFrameType, pl.DataFrame]
except ImportError:
pass

try:
mpd = _import_modin()

modin_imported = True
DataFrameType = Union[DataFrameType, mpd.DataFrame]
except ImportError:
DataFrameType = Union[pd.DataFrame, str]
pass


def df_type(df: DataFrameType) -> Union[str, None]:
"""
Returns the type of the dataframe.

Args:
df (DataFrameType): Pandas or Polars dataframe
df (DataFrameType): Pandas, Modin or Polars dataframe

Returns:
str: Type of the dataframe
"""
if polars_imported and isinstance(df, pl.DataFrame):
return "polars"
elif modin_imported and isinstance(df, mpd.DataFrame):
return "modin"
mspronesti marked this conversation as resolved.
Show resolved Hide resolved
elif isinstance(df, pd.DataFrame):
return "pandas"
else:
Expand Down
2 changes: 1 addition & 1 deletion pandasai/helpers/df_validator.py
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ def _df_to_list_of_dict(self, df: DataFrameType, dataframe_type: str) -> List[Di
Returns:
list of dict of dataframe rows
"""
if dataframe_type == "pandas":
if dataframe_type in ("pandas", "modin"):
return df.to_dict(orient="records")
elif dataframe_type == "polars":
return df.to_dicts()
Expand Down
5 changes: 3 additions & 2 deletions pandasai/helpers/from_google_sheets.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
import re

import pandas as pd
import requests

import pandasai.pandas as pd


def get_google_sheet(src) -> list:
"""
Expand Down Expand Up @@ -49,7 +50,7 @@ def sheet_to_df(sheet) -> list:
"""

# A dataframe starts when a header is found
# A header is a the first instance of a set of contiguous alphanumeric columns
# A header is the first instance of a set of contiguous alphanumeric columns
# A dataframe ends when a blank row is found or an empty column is found

num = 0 # The number of the dataframe
Expand Down
15 changes: 15 additions & 0 deletions pandasai/pandas/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
from pandasai.engine import _engine

if _engine == "modin":
try:
from modin.pandas import *

__name__ = "modin.pandas"
except ImportError as e:
raise ImportError(
"Could not import modin. Please install with `pip install modin[ray]`."
) from e
else:
from pandas import *

__name__ = "pandas"
mspronesti marked this conversation as resolved.
Show resolved Hide resolved
2 changes: 1 addition & 1 deletion pandasai/prompts/clarification_questions_prompt.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
import json
from typing import List

import pandas as pd
import pandasai.pandas as pd

from .file_based_prompt import FileBasedPrompt

Expand Down
4 changes: 3 additions & 1 deletion pandasai/prompts/generate_python_code.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@
- return the updated analyze_data function wrapped within ```python ```""" # noqa: E501


import pandasai.pandas as pd

from .file_based_prompt import FileBasedPrompt


Expand Down Expand Up @@ -70,7 +72,7 @@ def setup(self, **kwargs) -> None:
self.set_var("prev_conversation", kwargs.pop("prev_conversation", ""))

def on_prompt_generation(self) -> None:
default_import = "import pandas as pd"
default_import = f"import {pd.__name__} as pd"
engine_df_name = "pd.DataFrame"

self.set_var("default_import", default_import)
Expand Down
2 changes: 1 addition & 1 deletion pandasai/prompts/rephase_query_prompt.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
"""
from typing import List

import pandas as pd
import pandasai.pandas as pd

from .file_based_prompt import FileBasedPrompt

Expand Down
3 changes: 1 addition & 2 deletions pandasai/responses/streamlit_response.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
from typing import Any

import pandas as pd

import pandasai.pandas as pd
from pandasai.responses.response_parser import ResponseParser


Expand Down
Loading
Loading