Using Dataset vs. writing to JSON file? #1514

janzheng · 2022-09-03T03:24:28Z

janzheng
Sep 3, 2022

I'm playing around with Crawlee and Dataset and love it!

I'm wondering, what are the benefits of using Dataset vs. just writing to a json file? I'm playing around and my Dataset folder for my example is 100MB, but writing to a JSON file with the same information is only 2.5MB.

When should I choose Dataset, and when should I just write to a JSON file?

B4nan · 2022-09-05T08:54:03Z

B4nan
Sep 5, 2022
Maintainer

Datasets are abstracted, you can provide your own storage implementation, or use existing, e.g. the ApifyClient if you want to run it on apify platform.

So as long as you run things on your own infra and you always want to store JSONs, its quite similar. The main difference is locking support, which is implemented when you use the storage methods. But for dataset that is in general append only, locking does not really matter. We also wait for async storage methods to finish before resolving the crawler, but as long as you would use sync method for writing the file, it should work fine.

I'm playing around and my Dataset folder for my example is 100MB, but writing to a JSON file with the same information is only 2.5MB.

This feels weird, could you debug this a bit or ideally tell me how to reproduce? Storing 2,5mb in a dataset item definitely won't "spread" this way to 100mb. Maybe you just use named storages? You need to clear those up yourself, they are persistent.

2 replies

janzheng Sep 5, 2022
Author

I'm probably using Dataset wrong haha. I'm getting about 50k records of basic product information, and appending each into a Dataset. Maybe this is bc I'm on a Mac? Each file on a Mac takes up like 300kb or something as an empty file, so maybe that's just some file overhead I'm dealing with? Saving all the data into one object and dumping that into a single Dataset results in a tiny file.

B4nan Sep 5, 2022
Maintainer

Can you share some code?

takes up like 300kb or something as an empty file

Thats weird, doesnt sound like an empty file at all, those kilobytes need to be coming from somewhere :]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using Dataset vs. writing to JSON file? #1514

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Using Dataset vs. writing to JSON file? #1514

janzheng Sep 3, 2022

Replies: 1 comment · 2 replies

B4nan Sep 5, 2022 Maintainer

janzheng Sep 5, 2022 Author

B4nan Sep 5, 2022 Maintainer

janzheng
Sep 3, 2022

Replies: 1 comment 2 replies

B4nan
Sep 5, 2022
Maintainer

janzheng Sep 5, 2022
Author

B4nan Sep 5, 2022
Maintainer