Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lib-parquet: how to convert numeric dates #1093

Closed
sergio-paternoster opened this issue Jun 9, 2024 · 6 comments
Closed

lib-parquet: how to convert numeric dates #1093

sergio-paternoster opened this issue Jun 9, 2024 · 6 comments
Assignees
Labels
Developer Experience Resolving this issue should improve development experience for the library users. Help Request lib-parquet

Comments

@sergio-paternoster
Copy link

Not really an issue but more a simple request for clarifications.
I have a parquet file that contains several date columns (in Y-m-d format). When I loop through the rows, I see these dates coming up as integer values (for example, 16227 instead of 2014-06-06).
Is there a way to convert them automatically? In case not, how can I correctly visualize the dates when I read the file? I tried the following but it looks not efficient at all (especially for files with millions of records). Thank you in advance!

$reader          = new Reader();
$file            = $reader->read("0000_part_00.parquet");
$schema         = $file->schema();

$column_metadata = [];
foreach ($schema->columnsFlat() as $column)
{
    $column_name                                            = $column->flatPath();
    $column_metadata[$column_name]['is_date']               = $column->convertedType()->value === ConvertedType::DATE->value;
}

// convert numeric date to formatted date
function convert_date(int $numeric_date, $format = 'Y-m-d') : string
{
    $timestamp  = $numeric_date * 86400;
    $date       = new DateTime("@$timestamp");
    return     $date->format($format);
}


// loop through all rows in the parquet file
foreach ($file->values() as $row)
{
    foreach($row as $column_name => $column_value)
    {
        echo "\n" . $column_name . " -> " . ($column_metadata[$column_name]['is_date'] ? convert_date($column_value) : $column_value);
    }
}

@norberttech
Copy link
Member

hey! DateTimes should be automatically converted by the library to DateTimeImmutable objects, integers are just how parquet stores them.
Let me ask you few questions first:

  • was this file generated by this library or something else (if something else then what was it?)
  • could you add composer require flow-php/parquet-viewer to your project and then use it in a following way:
bin/parquet.php read:metadata path/to_your/file.parquet --columns --page-headers

and past me the output here?

  • could you send me a subset of this file (like literally few rows if possible) so I can look into it?

This should give me a better view of your file structure so I might be able to identify the issue.

@sergio-paternoster
Copy link
Author

I apologize, I had to specify it before. The parquet file comes from Amazon Redshift through the UNLOAD command.
FYI, the mukunku/ParquetViewer sees the dates correctly formatted.

Please, find here attached the whole parquet file. It's very small. I use PHP 8.2.16 (cli) with snappy on Linux Ubuntu Server 22.0.4.
Thank you!

0000_part_00.zip

@norberttech
Copy link
Member

Hey, so it seems that Amazon Redshift is saving datatetimes just as INT32 with converted type Flow\Parquet\ParquetFile\Schema\ConvertedType::DATE.

Support for Converted Type according to parquet format is deprecated and afaik when I was working on this implementation I wasn't sure if this will be needed or not.

Give me a day or two and I will bring a support for converted type date to recognize them as date time objects properly.

Thanks for bringing this up and providing everything I needed to quickly identify the issue, it's a great contribution to this library!

@norberttech norberttech added the Developer Experience Resolving this issue should improve development experience for the library users. label Jun 9, 2024
@norberttech norberttech self-assigned this Jun 9, 2024
@norberttech norberttech moved this from Todo to In Progress in Roadmap Jun 9, 2024
@norberttech
Copy link
Member

It was easier than I expected, once this is merged it should properly return you DateTimeImmutable objects

@norberttech
Copy link
Member

It seems to be working properly now:

<?php

use Flow\Parquet\Reader;

require __DIR__ . '/../../vendor/autoload.php';

$reader = new Reader();

foreach ($reader->read(__DIR__ . '/0000_part_00.parquet')->values() as $row) {
    var_dump(\json_encode($row));die;
    die;
}

output (in json)

{"year":2014,"yearreportdate":{"date":"2014-06-06 00:00:00.000000","timezone_type":3,"timezone":"UTC"},"prioryearreportdate":{"date":"2013-06-08 00:00:00.000000","timezone_type":3,"timezone":"UTC"},"yeardayoffset":4294967294,"ltmstartdate":{"date":"2013-06-09 00:00:00.000000","timezone_type":3,"timezone":"UTC"},"ltmenddate":{"date":"2014-06-06 00:00:00.000000","timezone_type":3,"timezone":"UTC"},"currentfiscalyear":2024,"currentfiscalquarter":2,"currentfiscalquarterlabel":"24 Q2","currentfiscalquarterstartdate":{"date":"2024-03-31 00:00:00.000000","timezone_type":3,"timezone":"UTC"},"currentfiscalquarterenddate":{"date":"2024-06-29 00:00:00.000000","timezone_type":3,"timezone":"UTC"},"currentfiscalmonth":6,"currentfiscalquarterday":69,"currentfiscalquarterweek":10,"currentfiscalweek":23,"currentreportdate":{"date":"2024-06-07 00:00:00.000000","timezone_type":3,"timezone":"UTC"},"currentdayofyear":159,"yesterdaydate":{"date":"2024-06-07 00:00:00.000000","timezone_type":3,"timezone":"UTC"},"yesterdaydayofyear":159,"priorfiscalyear":2023,"priorfiscalyearquarterlabel":"23 Q2","priorfiscalquarterlabel":"24 Q1","nextfiscalquarterlabel":"24 Q3","cmrogflag":"N","currentforecastscenario":""}"

I'm closing this issue, in case of any problems don't hesitate to reach out again!

@github-project-automation github-project-automation bot moved this from In Progress to Done in Roadmap Jun 9, 2024
@sergio-paternoster
Copy link
Author

It works! You're a fine gentleman, thank you! I'll keep testing Amazon Parquet files and come back in case of other questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Developer Experience Resolving this issue should improve development experience for the library users. Help Request lib-parquet
Projects
Archived in project
Development

No branches or pull requests

2 participants