Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

change post date-time timezone #416

Closed
Prabesh01 opened this issue Jul 31, 2021 · 24 comments
Closed

change post date-time timezone #416

Prabesh01 opened this issue Jul 31, 2021 · 24 comments

Comments

@Prabesh01
Copy link

Lame question but how do I change the timezone used to give post's timestamp by post["time"]?

If I used as it is, the post time shows tomorrow's date lol

@kevinzg
Copy link
Owner

kevinzg commented Aug 1, 2021

You can do

import pytz
post['time'] = post['time'].replace(tzinfo=pytz.utc).astimezone(pytz.timezone('America/Lima'))

for example.

But I'm not completely sure if all the dates the scraper extract are in UTC, so you might want to double check that.

@neon-ninja
Copy link
Collaborator

The timestamp is local time, based on the timezone of your system. So, check the timezone set on your system.

@Prabesh01
Copy link
Author

Thank you :)

@TowardMyth
Copy link

@neon-ninja Sorry to reopen a closed issue. When pulling posts, is there any way for Facebook Scraper to get the post timestamp's timezone when scraping FB's site?

Issue: I'm assuming that the facebook scraper's timestamps are pulled directly from the FB post, and the timestamps on the FB post are based on the user's IP address. If this is true, then using proxies would mean that the timestamps may not match your system's timezone.

@neon-ninja
Copy link
Collaborator

neon-ninja commented Aug 10, 2021

@TowardMyth depending on how you're scraping, you might get the post time in a slightly different way. Unauthenticated requests can actually receive a UNIX timestamp. In that case, a debug log message "Got exact timestamp from publish_time" would be printed. See #273 for example. The UNIX timestamp is defined as the number of seconds since 00:00:00 UTC on 1 January 1970. So it is the same regardless of timezone. The timestamp served by https://m.facebook.com/story.php?story_fbid=306426417518211&id=100044525645708&_rdr is 1619377200, which equates to Sun Apr 25 2021 19:00:00 GMT+0000. The scraper uses datetime.fromtimestamp(timestamp) which will convert this to local time.

@TowardMyth
Copy link

@neon-ninja Do authenticated requests also receive a unix timestamp? If not, does it at least receive a timezone-aware datetime from FB (or just a naive non-timezone aware one)?

@neon-ninja
Copy link
Collaborator

neon-ninja commented Aug 10, 2021

Sometimes. I tested by adding a field to indicate whether a timestamp was exact or not, and the following code:

for post in get_posts("Nintendo", cookies="cookies.txt", options={"allow_extra_requests": False}):
    print(post["post_id"], post["time"], post.get("time_exact"))

And these were the results (GMT+12):

4377850938965992 2021-08-07 10:09:16 True
4377380082346411 2021-08-07 06:12:37 True
4377121662372253 2021-08-07 04:23:11 True
4371727619578324 2021-08-05 09:33:45 True
4365624383521981 2021-08-03 08:09:04 True
4365050470246039 2021-08-03 04:00:19 True
4329977513753335 2021-07-21 15:51:00 None
4314796888604731 2021-07-16 10:00:00 None
4314503055300781 2021-07-16 07:54:00 None
4311927432225010 2021-07-15 11:00:00 None
4306850432732710 2021-07-15 05:00:05 True
4294380853979668 2021-07-10 04:37:24 True
4291314330952987 2021-07-09 04:27:44 True
4290937374324016 2021-07-09 02:02:31 True
4288189624598791 2021-07-08 03:30:03 True
4284852408265846 2021-07-07 01:09:41 True
4279695485448205 2021-07-05 05:00:01 True
4274331009317986 2021-07-03 10:00:03 True
4273772932707127 2021-07-02 08:34:00 None
4270449279706159 2021-07-01 06:08:00 None
4268327196585034 2021-06-30 11:30:00 None
4268132489937838 2021-06-30 10:03:00 None
4253713764713044 2021-06-25 09:11:00 None
4244168652334222 2021-06-22 04:01:00 None
4239326416151779 2021-06-20 09:00:00 None
4230186287065792 2021-06-17 01:20:00 None
4226613690756385 2021-06-15 18:00:00 None
4226242920793462 2021-06-15 14:25:00 None
4225498680867886 2021-06-15 08:30:00 None
4222851827799238 2021-06-14 09:09:00 None
4217919734959114 2021-06-13 09:37:33 True
4217774244973663 2021-06-13 08:17:58 True
4214518608632560 2021-06-12 05:11:49 True
4214033132014441 2021-06-12 02:00:01 True
4194934713924283 2021-06-04 17:33:00 None
4193752747375813 2021-06-04 08:11:00 None
4191173794300375 2021-06-03 11:00:00 None
4188174317933656 2021-06-02 12:01:00 None

In the absence of a UNIX timestamp, the scraper tries to make a guess by parsing text similar to the examples given in https://github.com/kevinzg/facebook-scraper/blob/master/tests/test_parse_date.py. Note the lack of seconds in that case.

Tweak the code a bit to:

for post in get_posts("Nintendo", pages=2, cookies="cookies.txt", options={"allow_extra_requests": False, "posts_per_page": 50}):
    print(post["post_id"], post["time"], post.get("time_exact"))

and then you get:

4377850938965992 2021-08-07 10:09:16 True
4377380082346411 2021-08-07 06:12:37 True
4377121662372253 2021-08-07 04:23:11 True
4371727619578324 2021-08-05 09:33:45 True
4365624383521981 2021-08-03 08:09:04 True
4365050470246039 2021-08-03 04:00:19 True
4329977513753335 2021-07-22 10:51:17 True
4314796888604731 2021-07-17 05:00:02 True
4314503055300781 2021-07-17 02:54:30 True
4311927432225010 2021-07-16 06:00:03 True
4306850432732710 2021-07-15 05:00:05 True
4294380853979668 2021-07-10 04:37:24 True
4291314330952987 2021-07-09 04:27:44 True
4290937374324016 2021-07-09 02:02:31 True
4288189624598791 2021-07-08 03:30:03 True
4284852408265846 2021-07-07 01:09:41 True
4279695485448205 2021-07-05 05:00:01 True
4274331009317986 2021-07-03 10:00:03 True
4273772932707127 2021-07-03 03:34:57 True
4270449279706159 2021-07-02 01:08:47 True
4268327196585034 2021-07-01 06:30:03 True
4268132489937838 2021-07-01 05:03:44 True
4253713764713044 2021-06-26 04:11:55 True
4244168652334222 2021-06-22 23:01:20 True
4239326416151779 2021-06-21 04:00:20 True
4230186287065792 2021-06-17 20:20:18 True
4226613690756385 2021-06-16 13:00:00 True
4226242920793462 2021-06-16 09:25:25 True
4225498680867886 2021-06-16 03:30:00 True
4222851827799238 2021-06-15 04:09:22 True
4217919734959114 2021-06-13 09:37:33 True
4217774244973663 2021-06-13 08:17:58 True
4214518608632560 2021-06-12 05:11:49 True
4214033132014441 2021-06-12 02:00:01 True
4194934713924283 2021-06-05 12:33:32 True
4193752747375813 2021-06-05 03:11:22 True
4191173794300375 2021-06-04 06:00:35 True
4188174317933656 2021-06-03 07:01:46 True
4188053201279101 2021-06-03 06:00:31 True
4187481031336318 2021-06-03 02:00:01 True
4168993589851729 2021-05-28 01:05:00 True
4151037621647326 2021-05-22 05:29:47 True
4148574141893674 2021-05-21 10:31:31 True
4144965658921189 2021-05-20 04:30:08 True
4141878672563221 2021-05-19 05:00:02 True
4128969113854177 2021-05-15 10:00:01 True
4126036390814116 2021-05-14 08:00:00 True
4112996902118065 2021-05-10 05:00:01 True
4106669112750844 2021-05-08 10:00:38 True
4088623351222087 2021-05-02 05:04:28 True
4085738151510607 2021-05-01 05:20:30 True
4085430424874713 2021-05-01 03:37:20 True

So maybe it's determined by whether the page includes a certain type of post

@TowardMyth
Copy link

TowardMyth commented Aug 10, 2021

@neon-ninja thanks for helping to debug. I've spent the past few hours playing around with this too, and like you, I found that sometimes, FB will return a unix timestamp whereas other times, it won't.

  1. Is there any way to force the scraper to: for every post, it must get a unix timestamp, otherwise, retry pulling this post, until it gets a unix timestamp.

  2. Do you have any further insight around when unix timestamps will be returned by the scraper, and when they won't? (From my experimentation, it seems like this is random, but you may know more).

@neon-ninja
Copy link
Collaborator

In my test above, increasing the posts_per_page improved the chance of getting unix timestamps. Do you not get the same? Which page/profile/group doesn't serve unix timestamps for you?

@TowardMyth
Copy link

TowardMyth commented Aug 11, 2021

Here's my command:

for post in get_posts("nintendo", pages=2, options={"allow_extra_requests": False, "posts_per_page": 10}):
       print(post["post_id"], post["time"], post.get("time_exact"))

Results:

sys:1: UserWarning: A low page limit (<=2) might return no results, try increasing the limit
4388957874521965 2021-08-10 09:00:01 None
4377850938965992 2021-08-06 18:09:16 None
4377380082346411 2021-08-06 14:12:37 None
4377121662372253 2021-08-06 12:23:11 None
4371727619578324 2021-08-04 17:33:45 None
4365624383521981 2021-08-02 16:09:04 None
4365050470246039 2021-08-02 12:00:19 None
4329977513753335 2021-07-21 18:51:17 None
4314796888604731 2021-07-16 13:00:02 None
4314503055300781 2021-07-16 10:54:30 None
4311927432225010 2021-07-15 14:00:03 None
4306850432732710 2021-07-14 13:00:05 None

Interestingly, even when unix timestamp is not being returned, I get the seconds.

I've tried different combinations: having cookies, not having cookies; setting post_per_page to 10/50/other numbers, pages=2/50/100, to no avail: all the timestamps are not unix timestamps.

@neon-ninja
Copy link
Collaborator

time_exact isn't part of the library, it was a one-off modification I did locally to test with. If you're getting seconds, you're getting UNIX timestamps.

@TowardMyth
Copy link

TowardMyth commented Aug 11, 2021

Further to my above comment: I am getting an error running the code I pasted above. Not sure if related?

sys:1: UserWarning: A low page limit (<=2) might return no results, try increasing the limit
4388957874521965 2021-08-10 09:00:01 None
4377850938965992 2021-08-06 18:09:16 None
/home/xxx/.local/lib/python3.8/site-packages/facebook_scraper/facebook_scraper.py:440: UserWarning: Facebook served mbasic/noscript content unexpectedly on https://m.facebook.com/page_content_list_view/more/?page_id=119240841493711&start_cursor=%7B%22timeline_cursor%22:%22AQHRnfQjPoQy0di1sKI1zb49Fl2ip0D6sejyMvw4tMHgKUbDTIojIHkctHtInzTppBj46e3Sf0miBLIpc0p3oS368acKad9gdeTyj8r8jYioN0czrqwnrJ7zQXKR8T8LfMrW%22,%22timeline_section_cursor%22:null,%22has_next_page%22:true%7D&num_to_fetch=10&surface_type=posts_tab
  warnings.warn(
4377380082346411 2021-08-06 14:12:37 None
4377121662372253 2021-08-06 12:23:11 None
4371727619578324 2021-08-04 17:33:45 None
4365624383521981 2021-08-02 16:09:04 None
4365050470246039 2021-08-02 12:00:19 None
4329977513753335 2021-07-21 18:51:17 None
4314796888604731 2021-07-16 13:00:02 None
4314503055300781 2021-07-16 10:54:30 None
4311927432225010 2021-07-15 14:00:03 None
4306850432732710 2021-07-14 13:00:05 None

@neon-ninja
Copy link
Collaborator

That's just a warning, not an error. Probably safe to ignore.

@TowardMyth
Copy link

TowardMyth commented Aug 11, 2021

Thanks.

post["time"] returns a human-readable timestamp / a datetime.datetime object, which I think is not timezone aware. Is there a way to get the actual unix timestamp (i.e. 1628651893) so I can manipulate it easier?

@neon-ninja
Copy link
Collaborator

Sure, d4be429 should do it

@TowardMyth
Copy link

Many many many thanks! One last question if you don't mind (I'm still a beginner): is this change automatically merged into pip library (i.e. I can just do pip install facebook-scraper --upgrade to get it)?

@neon-ninja
Copy link
Collaborator

No. You'd need to install it from github. Like so: pip install git+https://github.com/kevinzg/facebook-scraper.git

@TowardMyth
Copy link

Got it to work with your pip install git command. Thank you!

@TowardMyth
Copy link

Will post["timestamp"] return anything if the scraper could not find a unix timestamp? As well, is there some way to easily check whether a unix timestamp was returned for a post?

@neon-ninja
Copy link
Collaborator

No, it would be None in that case. You can check if it is None.

@TowardMyth
Copy link

One more q: is there any difference in what is returned if you are authenticated vs non-authenticated? I know that if you are non-authenticated, you get the unix timestamps, but if authenticated, then less likely to get unix timestamps. If I only scrape without authentication going forward, is there any content/values/etc I would miss?

@neon-ninja
Copy link
Collaborator

You're more likely to run into a LoginRequired exception if you scrape unauthenticated

@TowardMyth
Copy link

If I use authenticated: is there any chance that any personal data related to the authenticated cookie/FB user will be printed onto some output (ex: maybe the userid of that FB user will be in the posts array somewhere?)

@neon-ninja
Copy link
Collaborator

Probably not

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants