Skip to content

Commit

Permalink
Userprofile fixes (taspinar#173)
Browse files Browse the repository at this point in the history
* move scripts with example code to get twitter user data to examples folder

* user.py:

* user.py: change default values of __init__

* query.py: remove redundant return

* update setup.py, changelog, LICENSE, version number

* README: add section for userprofile scraping info

* main.py: add command line argument for user profile scraping
  • Loading branch information
taspinar authored Jun 15, 2019
1 parent 911f682 commit 0b37442
Show file tree
Hide file tree
Showing 11 changed files with 71 additions and 18 deletions.
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
MIT License

Copyright (c) 2016 by Ahmet Taspinar ([email protected])
Copyright (c) 2016-2019 by Ahmet Taspinar ([email protected])

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
Expand Down
39 changes: 33 additions & 6 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,9 +37,27 @@ access Tweets written in the **past 7 days**. This is a major bottleneck
for anyone looking for older past data to make a model from. With
TwitterScraper there is no such limitation.

Per Tweet it scrapes the following information: + Username and Full Name
+ Tweet-id + Tweet-url + Tweet text + Tweet html + Tweet timestamp + No. of likes +
No. of replies + No. of retweets
Per Tweet it scrapes the following information:
+ Tweet-id
+ Tweet-url
+ Tweet text
+ Tweet html
+ Tweet timestamp
+ Tweet No. of likes
+ Tweet No. of replies
+ Tweet No. of retweets
+ Username
+ User Full Name
+ User ID
+ Date user joined
+ User location (if filled in)
+ User blog (if filled in)
+ User No. of tweets
+ User No. of following
+ User No. of followers
+ User No. of likes
+ User No. of lists


2. Installation and Usage
=========================
Expand Down Expand Up @@ -118,7 +136,6 @@ Below is an example of how twitterscraper can be used:

``twitterscraper Trump -l 100 -bd 2017-01-01 -ed 2017-06-01 -o tweets.json``

``twitterscraper realDonaldTrump -u -o tweets_username.json``


2.2.2 Examples of advanced queries
Expand Down Expand Up @@ -149,14 +166,17 @@ Also see `Twitter's Standard operators <https://developer.twitter.com/en/docs/tw
2.2.3 Examples of scraping user pages
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

You can also scraped all tweets written by retweetet by a specific user. This can be done by adding the boolean argument ``-u / --user`` argument to the query.
You can also scraped all tweets written or retweetet by a specific user. This can be done by adding the boolean argument ``-u / --user`` argument to the query.
If this argument is used, the query should be equal to the username.

Here is an example of scraping a specific user:

``twitterscraper realDonaldTrump -u -o tweets_username.json``

This does not work in combination with ``-p``, ``-bd``, or ``-ed`` but it is the only way to scrape for retweets.
This does not work in combination with ``-p``, ``-bd``, or ``-ed``.

The main difference with the example "search for tweets from a specific user" in section 2.2.2 is that this method really scrapes
all tweets from a profile page (including retweets). The example in 2.2.2 scrapes the results from the search page (excluding retweets).


2.3 From within Python
Expand Down Expand Up @@ -188,6 +208,13 @@ You can easily use TwitterScraper from within python:
A regular search within Twitter will not show you any retweets. Twitterscraper therefore does not contain any retweets in the output. To give an example: If user1 has written a tweet containing ``#trump2020`` and user2 has retweetet this tweet, a search for ``#trump2020`` will only show the original tweet. The only way you can scrape for retweets is if you scrape for all tweets of a specific user with the ``-u / --user`` argument.


2.5 Scraping for User Profile information
----------------------
By adding the argument ``--profiles`` twitterscraper will in addition to the tweets, also scrape for the profile information of the users who have written these tweets.
The results will be saved in the file "userprofiles_<filename>".
Try not to use this argument too much. If you have already scraped profile information for a set of users, there is no need to do it again :)
It is also possible to scrape for profile information without scraping for tweets. Examples of this can be found in the examples folder.


3. Output
=========
Expand Down
8 changes: 8 additions & 0 deletions changelog.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,13 @@
# twitterscraper changelog

# 1.0.0 ( 2019-02-04 )
### Added
- PR #159: scrapes user profile pages for additional information.
### Fixed:
- Moved example scripts demonstrating use of get_user_info() functionality to examples folder
- removed screenshot demonstrating get_user_info() works
- Added command line argument to main.py which calls get_user_info() for all users in list of scraped tweets.

# 0.9.3 ( 2018-11-04 )
### Fixed
- PR #143: cancels query if end-date is earlier than begin-date.
Expand Down
File renamed without changes.
File renamed without changes.
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@

setup(
name='twitterscraper',
version='0.9.3',
version='1.0.0',
description='Tool for scraping Tweets',
url='https://github.com/taspinar/twitterscraper',
author=['Ahmet Taspinar', 'Lasse Schuirmann'],
Expand Down
7 changes: 5 additions & 2 deletions twitterscraper/__init__.py
Original file line number Diff line number Diff line change
@@ -1,15 +1,18 @@
# TwitterScraper
# Copyright 2016-2018 Ahmet Taspinar
# Copyright 2016-2019 Ahmet Taspinar
# See LICENSE for details.
"""
Twitter Scraper tool
"""

__version__ = '0.9.3'
__version__ = '1.0.0'
__author__ = 'Ahmet Taspinar'
__license__ = 'MIT'


from twitterscraper.query import query_tweets
from twitterscraper.query import query_tweets_from_user
from twitterscraper.query import query_user_info
from twitterscraper.tweet import Tweet
from twitterscraper.user import User
from twitterscraper.ts_logger import logger as ts_logger
16 changes: 14 additions & 2 deletions twitterscraper/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,9 @@
import collections
import datetime as dt
from os.path import isfile
from twitterscraper.query import query_tweets, query_tweets_from_user
from twitterscraper.query import query_tweets
from twitterscraper.query import query_tweets_from_user
from twitterscraper.query import query_user_info
from twitterscraper.ts_logger import logger


Expand Down Expand Up @@ -58,6 +60,10 @@ def main():
parser.add_argument("-u", "--user", action='store_true',
help="Set this flag to if you want to scrape tweets from a specific user"
"The query should then consist of the profilename you want to scrape without @")
parser.add_argument("--profiles", action='store_true',
help="Set this flag to if you want to scrape profile info of all the users where you"
"have previously scraped from. After all of the tweets have been scraped it will start"
"a new process of scraping profile pages.")
parser.add_argument("--lang", type=str, default=None,
help="Set this flag if you want to query tweets in \na specific language. You can choose from:\n"
"en (English)\nar (Arabic)\nbn (Bengali)\n"
Expand Down Expand Up @@ -112,5 +118,11 @@ def main():
x.text, x.html])
else:
json.dump(tweets, output, cls=JSONEncoder)
if args.profiles and tweets:
list_users = list(set([tweet.user for tweet in tweets]))
list_users_info = [query_user_info(elem) for elem in list_users]
filename = 'userprofiles_' + args.output
with open(filename, "w", encoding="utf-8") as output:
json.dump(list_users_info, output, cls=JSONEncoder)
except KeyboardInterrupt:
logger.info("Program interrupted by user. Quitting...")
logger.info("Program interrupted by user. Quitting...")
7 changes: 5 additions & 2 deletions twitterscraper/query.py
Original file line number Diff line number Diff line change
Expand Up @@ -180,6 +180,10 @@ def query_tweets_once(*args, **kwargs):

def query_tweets(query, limit=None, begindate=dt.date(2006, 3, 21), enddate=dt.date.today(), poolsize=20, lang=''):
no_days = (enddate - begindate).days

if(no_days < 0):
sys.exit('Begin date must occur before end date.')

if poolsize > no_days:
# Since we are assigning each pool a range of dates to query,
# the number of pools should not exceed the number of dates.
Expand Down Expand Up @@ -253,8 +257,7 @@ def query_user_page(url, retry=10):
response = requests.get(url, headers=HEADER)
html = response.text or ''

user = User()
user_info = user.from_html(html)
user_info = User.from_html(html)
if not user_info:
return None

Expand Down
Binary file not shown.
8 changes: 4 additions & 4 deletions twitterscraper/user.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@


class User:
def __init__(self, user=None, full_name="", location="", blog="", date_joined=None, id=None, tweets=0,
def __init__(self, user="", full_name="", location="", blog="", date_joined="", id="", tweets=0,
following=0, followers=0, likes=0, lists=0):
self.user = user
self.full_name = full_name
Expand All @@ -15,8 +15,8 @@ def __init__(self, user=None, full_name="", location="", blog="", date_joined=No
self.followers = followers
self.likes = likes
self.lists = lists


@classmethod
def from_soup(self, tag_prof_header, tag_prof_nav):
"""
Returns the scraped user data from a twitter user page.
Expand Down Expand Up @@ -85,7 +85,7 @@ def from_soup(self, tag_prof_header, tag_prof_nav):
self.lists = int(lists)
return(self)


@classmethod
def from_html(self, html):
soup = BeautifulSoup(html, "lxml")
user_profile_header = soup.find("div", {"class":'ProfileHeaderCard'})
Expand Down

0 comments on commit 0b37442

Please sign in to comment.