Skip to content

Releases: mjakubowski84/parquet4s

v1.9.0

26 Apr 11:54
Compare
Choose a tag to compare

Many changes in this version. It slowly builds foundation for version 2.x in which you may expect many changes internally and in API.

Parquet schema migrated to version 1.11+

  • Parquet4S no longer uses deprecated schema types internally.
  • Definition of custom types using old types is deprecated in favour for new types.
  • Simplified API of defining custom types is introduced - check the README and examples.
  • Deprecated API got slightly modified but is supposed to be backwards compatible. Some that use it very extensively may encounter small compilation issues after upgrade.
  • Please migrate to new API as it is going to be removed in 2.x version.

Revamped Filtering

  • Ability to filter by UDP - user defined predicate - write your own filter predicate!
  • Simplified rewritten internal API.
  • Old API was mostly private so no backward incompatibility is expected for most of users.
  • New Filter API is public so you can define filters for your own custom types now, too. Just implement FilterCodec[T] for your type T.
  • Allowed to filter by Array[Byte]
  • In predicate in version with multiarg got modified so it expect now at least single parameter - that enforces proper usage of it at compilation time.

Bufixes

  • FS2's viaParquet was processing partitionBy parameters improperly. In effect, one could not partition by non-root field. It is fixed now.

v1.8.3

17 Apr 13:27
Compare
Choose a tag to compare

Multiple small improvements added to the code:

  • Fixed various compilation warnings
  • Fixed several Scaladoc links
  • Reorganised build.sbt
  • CI/CD improvements
  • Created basic benchmarks for testing future code changes
  • Using Blocker when creating writer in FS2's single-file write.

v1.8.2

16 Apr 06:09
Compare
Choose a tag to compare

A critical bug was introduced in version 1.8.0 that made core module usable with nothing but local file system. All should be fixed with this bugfix release.

It was a simple code mistake caused by an improper usage of API of hadoop-client. In effect file:// schema was enforced when listing files on path when preparing Stats component. Moreover, in order to delay this premature file listing Stats component initialisation is now made lazy.

v1.8.1

14 Apr 19:10
Compare
Choose a tag to compare

This release contain a fatal bug in core module. Please use version 1.8.2 or higher.

@malonsocasas reported and fixed an old issue with reading one of many legacy formats of lists. This should be fixed with this release.

v1.8.0

12 Apr 15:52
Compare
Choose a tag to compare

Release 1.8.0 introduces new functionalities and improvements in core library. Besides that each module undergoes multiple upgrades of Scala, Parquet and other dependencies.

This release contain a fatal bug in core module. Please use version 1.8.2 or higher.

New features

  • From now on when calling size on ParquetIterable Parquet4S does not iterate over all records but tries to leverage file's metadata and statistics. It is especially fast in case of unfiltered files. But it is also quite fast when reading with filter as Parquet4S tries to omit row groups which, thanks to statistics, it already knows that don't contain better values.
  • ParquetIterable also receives min and max functions that provide smallest and greatest value of the chosen column. Similarly to the new implementation of size Parquet4S leverages file metadata and works for both filtered and unfiltered files.
  • You can access aforementioned functions also by direct call to Stats.
  • Added custom errors for unresolved implicits for better feedback how to use Parquet4S with custom types.
    Upgrades
  • Scala 2.12 is upgraded to 2.12.13 and Scala 2.13 to 2.13.5
  • Parquet is upgraded to 1.12.0. Please note a change that is not breaking in case of interoperability with older versions of Parquet4S and Spark but might (however shouldn't) be breaking in case of other systems - from now on Map is saved internally using key_value field instead of map.
  • FS2 upgraded to 2.5.4
  • Shapeless upgraded to 2.3.4

v1.7.0

29 Nov 11:41
Compare
Choose a tag to compare

This a next maintenance release that improves stability and functionality of integration with Akka Streams and FS2.

Akka Streams:

  • Thanks to @dkwi viaParquet receives a new functionality: withPostWriteHandler allows to monitor, flush files are take any action based on the current state of the ParquetWriter.
  • Further fixes of resource cleanup in viaParquet. From now on writers are properly closed also on internal and downstream errors.

FS2:

  • viaParquet receives similar PostWriteHandler as in Akka Streams.
  • Redundant synchronisation in viaParquet is removed for better performance.

v1.6.0

13 Oct 16:47
Compare
Choose a tag to compare

Release 1.6.0 brings an important feature of Parquet that was missing so far - an ability to read a subset of columns from Parquet files. This is called schema projection. And now it is available in every module of Parquet4S. Check updated Readme for more.

A new feature implied the need of redesign of API. Core library just got a new function pointing to new reader but in Akka and FS2 module a new reader builders are introduced and those deprecate the old readers.
Moreover, recently introduced FS2 module received several API fixes that may be breaking for some. Unfortunately, those were required for consistency.

Full list of changes:

  • core:
    • ParquetReader.withProjection[YourSchema] that points to the reader that has schema projection applied
  • akka:
    • ParquetStreams.fromParquet[YourSchema](path, options, filter) is deprecated in favour of builder with the same name: ParquetStreams.fromParquet[YourSchema]
  • fs2:
    • function parquet.read is deprecated in favour of builder parquet.fromParquet
    • trait of Builder used in API of parquet.viaParquet is moved to rotatingWriter package
    • withPreWriteTransformation in parquet.viaParquet is replaced by preWriteTransformation for consistency
    • redundant dependency of parquet.viaParquet to implicit Sync[F] is removed as it has already dependency to Concurrent[F]
    • parquet.writeSingleFile returns now Stream[F, fs2.INothing] instead of Stream[F, Unit] in order to emphasise that it doesn't emit anything

v1.5.1

10 Oct 17:01
Compare
Choose a tag to compare

This release contains a bug fix for Akka module. toParquetSingleFile and viaParquet now close underlying file writers in case of stream failure. Thanks to that all writes executed so far are flushed and the probability of data loss is minimised.

v1.5.0

27 Sep 18:17
Compare
Choose a tag to compare

Release 1.5.0 introduces an integration of Parquet4S with FS2. It provides similar functionality as integration with Akka Streams:

  • read a file, directory or partitioned directory, with optional filter
  • write a single file
  • write an indefinite stream, optionally partitioned

Additionally:

  • Scala 2.12 is upgraded to 2.12.12
  • Scala-collection-compat is upgraded to 2.2.0

v1.4.0

14 Jul 19:05
Compare
Choose a tag to compare

Two main feature comes with this release:

  • @mac01021 made that schema names are now by default determined from canonical class name. In case of generic records schema name come from provided original schema. Thanks to that schemas are more descriptive and files created with Parquet4S are more compliant with Avro reader. Signature of Parquet4S is now written into file metadata (instead to schema name as before)
  • viaParquet Akka flow now gets the ability to write generic RowParquetRecord as other writers.