Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

decimalls are not supported normally #1602

Open
3 tasks done
ibobak opened this issue Jun 7, 2024 · 1 comment
Open
3 tasks done

decimalls are not supported normally #1602

ibobak opened this issue Jun 7, 2024 · 1 comment
Labels
bug 🐛 Something isn't working spark ⚡ PySpark features!

Comments

@ibobak
Copy link

ibobak commented Jun 7, 2024

Current Behaviour

Spark Dataframe structure:

root
 |-- device_id: string (nullable = true)
 |-- device_install_date: date (nullable = true)
 |-- max_device_event_date: date (nullable = true)
 |-- distinct_play_days: long (nullable = true)
 |-- sessions: long (nullable = true)
 |-- playtime_sec_total: decimal(38,6) (nullable = true)
 |-- intersession_sec_sum: decimal(38,6) (nullable = true)
 |-- playtime_sec_per_session: decimal(38,6) (nullable = true)
 |-- playtime_sec_per_playing_day: decimal(38,6) (nullable = true)
 |-- days_since_install: long (nullable = true)
 |-- avg_ses_between_sessions: decimal(38,6) (nullable = true)
 |-- loyalty_index: double (nullable = true)
 |-- install_date: date (nullable = true)

code:

from ydata_profiling import ProfileReport

report = ProfileReport(df_basic_features_3, minimal=True, title=app_code)
report.to_file(f"profiling/{app_code}_features_3.html")  

Look what distribution it produced for playtime_sec_total:
image

Now I converted this dataframe to the Pandas dataframe and here is what I see indeed:
image

So, conclusion is this: the product is totally buggy with this type of fields, and I don't trust it any more.

Expected Behaviour

You need to fix the handling of decimal fields.

Data Description

see above

Code that reproduces the bug

see above

pandas-profiling version

ydata-profiling==4.8.3

Dependencies

a2wsgi==1.10.4
aiohttp==3.9.5
aiosignal==1.3.1
alembic==1.13.1
altair==5.3.0
annotated-types==0.6.0
anyio==4.3.0
apache-airflow==2.7.1
apache-airflow-providers-common-sql==1.13.0
apache-airflow-providers-ftp==3.9.0
apache-airflow-providers-http==4.11.0
apache-airflow-providers-imap==3.6.0
apache-airflow-providers-sqlite==3.8.0
apispec==6.6.1
argcomplete==3.3.0
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
arrow==1.3.0
arviz==0.16.1
asgiref==3.8.1
asn1crypto==1.5.1
asttokens==2.4.1
async-lru==2.0.4
async-timeout==4.0.3
attrs==23.2.0
Babel==2.15.0
backcall==0.2.0
backoff==2.2.1
bcrypt==4.1.3
beautifulsoup4==4.12.3
bleach==6.1.0
blinker==1.8.2
boto3==1.28.29
botocore==1.31.85
build==1.2.1
cachelib==0.9.0
cachetools==5.3.3
cattrs==23.2.3
certifi==2024.2.2
cffi==1.16.0
charset-normalizer==3.3.2
chroma-hnswlib==0.7.3
chromadb==0.4.24
click==8.1.7
cloudpickle==2.2.1
colorama==0.4.6
coloredlogs==15.0.1
colorlog==4.8.0
comm==0.2.2
ConfigUpdater==3.2
connexion==3.0.6
cons==0.4.6
contourpy==1.2.1
cron-descriptor==1.4.3
croniter==2.0.5
cryptography==42.0.7
cycler==0.12.1
dacite==1.8.1
databricks-cli==0.18.0
dataclasses-json==0.6.6
debugpy==1.8.1
decorator==5.1.1
defusedxml==0.7.1
Deprecated==1.2.14
dill==0.3.8
dnspython==2.6.1
docker==6.1.3
docutils==0.21.2
email-validator==1.3.1
entrypoints==0.4
et-xmlfile==1.1.0
etuples==0.3.9
exceptiongroup==1.2.0
executing==2.0.1
fastapi==0.111.0
fastapi-cli==0.0.4
fastjsonschema==2.19.1
fastprogress==1.0.3
filelock==3.14.0
Flask==2.2.5
Flask-AppBuilder==4.3.6
Flask-Babel==2.0.0
Flask-Caching==2.3.0
Flask-JWT-Extended==4.6.0
Flask-Limiter==3.7.0
Flask-Login==0.6.3
Flask-Session==0.8.0
Flask-SQLAlchemy==2.5.1
Flask-WTF==1.2.1
flatbuffers==24.3.25
fonttools==4.51.0
fqdn==1.5.1
frozenlist==1.4.1
fsspec==2024.5.0
gitdb==4.0.11
GitPython==3.1.43
google-auth==2.29.0
google-re2==1.1.20240501
googleapis-common-protos==1.63.0
graphviz==0.20.3
greenlet==3.0.3
grpcio==1.64.0
gunicorn==20.1.0
h11==0.14.0
h5netcdf==1.3.0
h5py==3.11.0
htmlmin==0.1.12
httpcore==1.0.5
httptools==0.6.1
httpx==0.27.0
huggingface-hub==0.23.2
humanfriendly==10.0
idna==3.7
ImageHash==4.3.1
importlib-metadata==6.11.0
importlib_resources==6.4.0
inflection==0.5.1
ipykernel==6.19.2
ipynb-py-convert==0.4.6
ipython==8.10.0
ipython-genutils==0.2.0
ipywidgets==7.6.5
isoduration==20.11.0
itsdangerous==2.2.0
jedi==0.19.1
Jinja2==3.1.4
jmespath==1.0.1
joblib==1.4.2
json5==0.9.25
jsonpatch==1.33
jsonpointer==2.4
jsonschema==4.22.0
jsonschema-specifications==2023.12.1
jupyter-contrib-core==0.4.2
jupyter-contrib-nbextensions==0.7.0
jupyter-events==0.10.0
jupyter-highlight-selected-word==0.2.0
jupyter-lsp==2.2.5
jupyter-nbextensions-configurator==0.6.3
jupyter_client==7.4.4
jupyter_core==5.7.2
jupyter_server==2.14.0
jupyter_server_terminals==0.5.3
jupyterlab==4.2.1
jupyterlab-execute-time==3.1.2
jupyterlab_pygments==0.3.0
jupyterlab_server==2.27.2
jupyterlab_widgets==3.0.10
kiwisolver==1.4.5
kubernetes==29.0.0
langchain==0.1.13
langchain-community==0.0.38
langchain-core==0.1.52
langchain-text-splitters==0.0.2
langsmith==0.1.67
lazy-object-proxy==1.10.0
lazyprofiler==0.1.1
limits==3.12.0
linkify-it-py==2.0.3
llvmlite==0.42.0
lockfile==0.12.2
logical-unification==0.4.6
lxml==5.2.2
Mako==1.3.5
Markdown==3.6
markdown-it-py==3.0.0
MarkupSafe==2.1.5
marshmallow==3.21.2
marshmallow-oneofschema==3.1.1
marshmallow-sqlalchemy==0.26.1
matplotlib==3.8.4
matplotlib-inline==0.1.7
mdit-py-plugins==0.4.1
mdurl==0.1.2
miniKanren==1.0.3
mistune==3.0.2
mlflow==2.5.0
mmh3==4.1.0
more-itertools==10.2.0
mpmath==1.3.0
msgspec==0.18.6
multidict==6.0.5
multimethod==1.11.2
multipledispatch==1.0.0
mypy-extensions==1.0.0
nbclassic==1.0.0
nbclient==0.10.0
nbconvert==7.16.4
nbformat==5.10.4
nest-asyncio==1.6.0
networkx==3.3
notebook==7.2.0
notebook_shim==0.2.4
numba==0.59.1
numpy==1.23.5
oauthlib==3.2.2
onnx==1.15.0
onnxconverter-common==1.14.0
onnxmltools==1.12.0
onnxruntime==1.17.1
openai==1.22.0
openpyxl==3.1.2
opentelemetry-api==1.24.0
opentelemetry-exporter-otlp==1.24.0
opentelemetry-exporter-otlp-proto-common==1.24.0
opentelemetry-exporter-otlp-proto-grpc==1.24.0
opentelemetry-exporter-otlp-proto-http==1.24.0
opentelemetry-instrumentation==0.46b0
opentelemetry-instrumentation-asgi==0.46b0
opentelemetry-instrumentation-fastapi==0.46b0
opentelemetry-proto==1.24.0
opentelemetry-sdk==1.24.0
opentelemetry-semantic-conventions==0.45b0
opentelemetry-util-http==0.46b0
optuna==3.5.0
optuna-fast-fanova==0.0.4
ordered-set==4.1.0
orjson==3.10.3
overrides==7.7.0
packaging==23.2
pandas==1.5.3
pandas-datareader==0.10.0
pandasql==0.7.3
pandocfilters==1.5.1
parso==0.8.4
pathspec==0.12.1
patsy==0.5.6
pendulum==2.1.2
pexpect==4.9.0
pgcopy==1.6.0
phik==0.12.4
pickleshare==0.7.5
pillow==10.3.0
platformdirs==4.2.2
plotly==5.22.0
pluggy==1.5.0
posthog==3.5.0
prison==0.2.1
prometheus_client==0.20.0
prompt-toolkit==3.0.43
protobuf==3.20.2
psutil==5.9.8
psycopg2==2.9.9
psycopg2-binary==2.9.7
ptyprocess==0.7.0
pulsar-client==3.5.0
pure-eval==0.2.2
pyarrow==12.0.1
pyasn1_modules==0.4.0
pycountry==23.12.11
pycparser==2.22
pydantic==2.7.0
pydantic_core==2.18.1
pydeck==0.9.1
Pygments==2.18.0
PyJWT==2.8.0
pymc==5.6.0
pyparsing==3.1.2
pypdf==4.1.0
PyPika==0.48.9
pyproject_hooks==1.1.0
pytensor==2.12.3
python-daemon==3.0.1
python-dateutil==2.9.0.post0
python-dotenv==1.0.1
python-json-logger==2.0.7
python-multipart==0.0.9
python-nvd3==0.16.0
python-slugify==8.0.4
pytz==2023.4
pytzdata==2020.1
PyWavelets==1.6.0
PyYAML==6.0.1
pyzmq==26.0.3
querystring-parser==1.2.4
redshift-connector==2.0.911
referencing==0.35.1
requests==2.32.3
requests-oauthlib==2.0.0
requests-toolbelt==1.0.0
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rich==13.7.1
rich-argparse==1.4.0
rpds-py==0.18.1
s3transfer==0.6.2
scikit-learn==1.3.2
scipy==1.12.0
scramp==1.4.5
seaborn==0.12.2
Send2Trash==1.8.3
setproctitle==1.3.3
shap==0.42.1
shellingham==1.5.4
six==1.16.0
skl2onnx==1.16.0
slicer==0.0.7
smart-open==6.3.0
smmap==5.0.1
sniffio==1.3.1
soupsieve==2.5
spark_framework @ git+https://github.com/ibobak/spark_framework.git@8dcf0f5b29e71721d4d6069a76ae4fde1e7e7bde
SQLAlchemy==1.4.49
SQLAlchemy-JSONField==1.0.2
SQLAlchemy-Utils==0.41.2
sqlparse==0.5.0
stack-data==0.6.3
starlette==0.37.2
statsmodels==0.14.2
streamlit==1.32.2
sympy==1.12
tabulate==0.9.0
tenacity==8.0.1
termcolor==2.4.0
terminado==0.18.1
text-unidecode==1.3
threadpoolctl==3.5.0
tinycss2==1.3.0
tokenizers==0.19.1
tomli==2.0.1
toolz==0.12.1
tornado==6.2
tqdm==4.66.2
traitlets==5.9.0
typeguard==4.3.0
typer==0.12.3
types-python-dateutil==2.9.0.20240316
typing-inspect==0.9.0
typing_extensions==4.12.0
tzdata==2024.1
uc-micro-py==1.0.3
ujson==5.10.0
unicodecsv==0.14.1
uri-template==1.3.0
urllib3==2.0.7
uvicorn==0.30.0
uvloop==0.19.0
visions==0.7.6
watchdog==4.0.1
watchfiles==0.22.0
wcwidth==0.2.13
webcolors==1.13
webencodings==0.5.1
websocket-client==1.8.0
websockets==12.0
Werkzeug==3.0.3
widgetsnbextension==3.5.2
wordcloud==1.9.3
wrapt==1.16.0
WTForms==3.1.2
xarray==2024.3.0
xarray-einstats==0.7.0
xgboost==2.0.2
XlsxWriter==3.2.0
yarl==1.9.4
ydata-profiling==4.8.3
zipp==3.18.2

OS

Ubuntu 22.04

Checklist

  • There is not yet another bug report for this issue in the issue tracker
  • The problem is reproducible from this bug report. This guide can help to craft a minimal bug report.
  • The issue has not been resolved by the entries listed under Common Issues.
@fabclmnt
Copy link
Contributor

fabclmnt commented Jul 9, 2024

Hi @ibobak ,

thank you for reporting the issue. Regarding ydata-profiling for spark it is clear that we have only launched one initial version that not only includes only a small set of functionality but also have some know issues.

We are looking for contributors that are willing to keep evolving the Spark integration, as this was something initiated by the community. If you're open to it, feel free to check the issues labelled with the tag spark.

@fabclmnt fabclmnt added bug 🐛 Something isn't working spark ⚡ PySpark features! and removed needs-triage labels Jul 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐛 Something isn't working spark ⚡ PySpark features!
Projects
None yet
Development

No branches or pull requests

3 participants