This is a library that builds on top of the requests library, adding quite a bit of functionality. PyGrab is fully interoperable with the requests library and has virtually identical syntax for the shared functions/methods/classes.
This module implements the primary developer interface for pygrab.
Parameters
url
(str): The URL or IP address to get.retries
(int, optional): The number of times to retry the request if it fails. Defaults to 5.enable_js
(bool, optional): Enable Javascript on the request. Defaults to False.*args
: Variable length argument list passed to requests.get.**kwargs
: Arbitrary keyword arguments passed to requests.get.
Returns
requests.Response
: The response from the server.
Raises
TypeError
: If any of the arguments are not of the desired data type.ValueError
: If the URL or IP address is invalid. Useget_local()
for local requests.
Exceptions
RuntimeError
: If an error occurs during runtime, such as a connection error.
Notes
- For handling Javascript-enabled sites, the
enable_js
parameter can be set toTrue
.
Description Gets multiple URLs asynchronously. This function sends HTTP requests to a list of URLs in separate threads, allowing for concurrent HTTP requests. The function returns a list of responses from the grabbed URLs. For each request that had a connection error, a warning will be printed to the console.
Parameters
urls (list)
: A list of URLs to grab.retries (int, optional)
: The number of times to retry the HTTP request in case of failure. Defaults to 5.thread_limit (int, optional)
: The maximum number of threads that will be spawned at one time. Defaults to 800.time_rest (int, optional)
: The time in seconds to wait between starting each thread. Defaults to 0.*args
: Variable length argument list to pass to the get function.**kwargs
: Arbitrary keyword arguments to pass to the get function.
Returns
dict
: A dictionary of responses with the grabbed URLs as keys and their respective responses as values.
Raises
TypeError
: If any of the arguments are not of the desired data type.
Notes
- This function will remove all repeats from the urls list passed in order to prevent accidental DoS attacks.
Description Reads the contents of a file and returns it to the user.
Parameters
filename (str)
: The file to read from.local_read_type (str, optional)
: The read type, 'r' or 'rb' for example. Defaults to 'r'.encoding (str, optional)
: Encoding, 'utf-8' or 'ascii' for example. Defaults to 'utf-8'.
Returns
data
: The contents of the file as a string.
Raises
TypeError
: If any of the arguments are not of the desired data type.
Notes
- This function reads the contents of a file and returns it to the user.
Description Downloads a file from a given URL and saves it locally. This function retrieves a file from a specified URL and saves it to a local directory. The file will be saved with the filename from the URL if no local filename is specified.
Parameters
url (str)
: The URL of the file to be downloaded. Must include a file extension.local_filename (str, recommended)
: The name to be used when saving the file locally. If none is provided, the function uses the filename from the URL. Must include a file extension if provided.retries (int, optional)
: The number of retry attempts for the download in case of failure. Defaults to 5.
Returns
None
Raises
TypeError
: If any of the arguments are not of the desired data type.ValueError
: If 'local_filename' is specified but does not contain a file extension.
Notes
- If a local file name is not specified, it will attempt to download the file under its name on the web.
Description Executes multiple file downloads asynchronously from a list of given URLs and saves them locally. This function uses threading to download multiple files simultaneously. Each file is saved with a filename from the list of local filenames, if provided. If no local filename is provided, the function uses the filename from the corresponding URL.
Parameters
urls (list of str)
: The URLs of the files to be downloaded. Each URL must include a file extension.local_filenames (list of str, recommended)
: A list of names to be used when saving the files locally. If none is provided, the function uses the filename from each corresponding URL. Each filename must include a file extension if provided. Must be of same length as 'urls' if provided. Defaults to None.retries (int, optional)
: The number of retry attempts for the downloads in case of failure. Defaults to 5.thread_limit (int, optional)
: The maximum number of threads that will be spawned. Defaults to 500.time_rest (int, optional)
: The amount of time to rest between the start of each download thread. Defaults to 0 seconds.
Returns
None
Raises
TypeError
: If any of the arguments are not of the desired data type.ValueError
: If a 'local_filenames' is specified but does not contain a file extension.
Description
Essentially a carbon copy of requests.head()
with the exception of the ability to route the request through the Tor network.
Description
Essentially a carbon copy of requests.post()
with the exception of the ability to route the request through the Tor network.
Description Writes data to a local file. This function is used to write or append data to a local file. It can be used in various scenarios such as saving request data, logging, or other local storage needs.
Parameters
filepath (str)
: The path to the file where the data will be written. If the file does not exist, it will be created.data (str)
: The data that will be written to the file.local_save_type (str, optional)
: The mode in which the file is opened. Defaults to 'w' (write mode), and can also be set to 'a' (append mode) or any other valid file mode.encoding (str, optional)
: The encoding to be used when opening the file. Defaults to 'utf-8'.
Returns
None
Description
Essentially a carbon copy of requests.put()
with the exception of the ability to route the request through the Tor network.
Description
Essentially a carbon copy of requests.patch()
with the exception of the ability to route the request through the Tor network.
Description
Essentially a carbon copy of requests.delete()
with the exception of the ability to route the request through the Tor network.
Description
Essentially a carbon copy of requests.options()
with the exception of the ability to route the request through the Tor network.
Description Returns True if the tor service is running and False if otherwise.
Parameters
- N/A
Returns
- A boolean of True of the tor service is running and False if otherwise.
Description Prints out data regarding your tor connection. This includes a boolean value that's True if the tor service is enabled in addition to your public ip address, country, region, and city.
Parameters
- N/A
Returns
None
Description
Starts the tor service if it isn't already running. Configures the library to reconnect to the network every num_req_per_rotation
. This rotates your ip and grants even greater anonymity.
Parameters
num_req_per_rotation (int)
: The maximum number of requests that can be send before a rotation.
Returns
None
Description
Ends the configuration to rotate tor connections. Note that this does not end the tor service itself (run pygrab.Tor.end_tor()
to do that).
Parameters
- N/A
Returns
None
Description
Changins the warning configuration of the entire library. All warnings are shown be defult and calling pygrab.warn_settings(False)
will turn off all warnings.
Parameters
warn (bool)
: Boolean value to enable or disable warnings.
Returns
None
This module implements the primary developer interface for connecting pygrab to the Tor network.
Description Starts the Tor network and configures all requests to be routed through the tor network.
Parameters
verbose (int, optional)
: 0 by default. If set to 1, the logs from starting the tor network will be displayed.force_start (bool, optional)
: False by default. This will cause the program to crash if a service is already running on127.0.0.1:9050
. If this service is an instance of Tor, then you can setforce_start
equal to true and have the program run as normal.
Returns
None
Notes
- If
force_start
is set equal to True, then the program will crash if the service running on127.0.0.1:9050
is not Tor. - Additionally, even if the service running on
127.0.0.1:9050
is Tor, some functionality may be lost (such as rotating tor connections). It is recommended to only have one instance of tor running at any one time. - If you get an error message stating that you are missing the tor.exe dependency, its probably because you are :). If you're on Windows, download it from "https://www.torproject.org/download/tor/". Then copy the path to the
.tar.gz
file and enter it when prompted by pygrab.Tor.start_tor(). If you're on linux, simply runsudo apt-get install tor
.
Description Ends the tor service.
Parameters
- N/A
Returns
None
Description Returns True if the tor status is active and False if otherwise.
Parameters
- N/A
Returns
bool
: True if the tor status is active and False if otherwise.
Description Returns True if a second instance of tor has been overridden and False if otherwise. Note that IP/Tor rotations will not be available if an instance of tor is overridden.
Parameters
- N/A
Returns
bool
: True if a second instance of tor has been overridden and False if otherwise.
Description
Loads the tor.exe
dependency from the .tar.gz
file that is downloaded from "https://www.torproject.org/download/tor/".
Parameters
filepath (str)
: The path to the.tar.gz
file downloaded from Tor's website.
Returns
None
Notes
- This function is for windows only. If you're on Linux, run
sudo apt-get install tor
- Running
pygrab.Tor.start_tor()
without thetor.exe
dependency installed will prompt you for the path to the.tar.gz
file. This method is merely an alternative to that.
This module implements the primary developer interface for Session handling. These sessions also include all of the functionality provided by the main pygrab module with the exception of IP/Tor rotations (coming soon!).
Description
Constructor for the pygrab.Session
object.
Parameters
use_tor (bool, optional)
: Tor service will be enabled for the session if True. Defaults to True if the tor service is running and false if it isn't.
Returns
None
Notes
- Setting
use_tor
to true will start the tor service if it isn't already. - Setting
use_tor
to false disable it from using the tor network, even if the tor service is already running.
Description Enables use of the tor service for the session object. If the tor service is not running, this method will start the service.
Parameters
- N/A
Returns
None
Description
Disables use of the tor service for the session object. However, this method does not end the tor service itself (run pygrab.Tor.end_tor()
to do that).
Parameters
- N/A
Returns
None
Parameters
url
(str): The URL to get.enable_js
(bool, optional): Enable Javascript on the request. Defaults to False.**kwargs
: Arbitrary keyword arguments passed to requests.get.
Returns
requests.Response
: The response from the server.
Raises
TypeError
: If any of the arguments are not of the desired data type.ValueError
: If the URL doesn't start with http. Useget_local()
for local requests.Exception
: If the URL is invalid.
Notes
- For handling Javascript-enabled sites, the
enable_js
parameter can be set toTrue
.
Description Gets multiple URLs asynchronously. This function sends HTTP requests to a list of URLs in separate threads, allowing for concurrent HTTP requests. The function returns a list of responses from the grabbed URLs. For each request that had a connection error, a warning will be printed to the console.
Parameters
urls (list)
: A list of URLs to grab.thread_limit (int, optional)
: The maximum number of threads that will be spawned at one time. Defaults to 800.time_rest (int, optional)
: The time in seconds to wait between starting each thread. Defaults to 0.*args
: Variable length argument list to pass to the get function.**kwargs
: Arbitrary keyword arguments to pass to the get function.
Returns
dict
: A dictionary of responses with the grabbed URLs as keys and their respective responses as values.
Raises
TypeError
: If any of the arguments are not of the desired data type.
Notes
- This function will remove all repeats from the urls list passed in order to prevent accidental DoS attacks.
Description Reads the contents of a file and returns it to the user.
Parameters
filename (str)
: The file to read from.local_read_type (str, optional)
: The read type, 'r' or 'rb' for example. Defaults to 'r'.encoding (str, optional)
: Encoding, 'utf-8' or 'ascii' for example. Defaults to 'utf-8'.
Returns
data
: The contents of the file as a string.
Raises
TypeError
: If any of the arguments are not of the desired data type.
Notes
- This function reads the contents of a file and returns it to the user.
Description
Essentially a carbon copy of requests.Session().head()
with the exception of the ability to route the request through the Tor network.
Description
Essentially a carbon copy of requests.Session().post()
with the exception of the ability to route the request through the Tor network.
Description Writes data to a local file. This function is used to write or append data to a local file. It can be used in various scenarios such as saving request data, logging, or other local storage needs.
Parameters
filepath (str)
: The path to the file where the data will be written. If the file does not exist, it will be created.data (str)
: The data that will be written to the file.local_save_type (str, optional)
: The mode in which the file is opened. Defaults to 'w' (write mode), and can also be set to 'a' (append mode) or any other valid file mode.encoding (str, optional)
: The encoding to be used when opening the file. Defaults to 'utf-8'.
Returns
None
Description
Essentially a carbon copy of requests.Session().put()
with the exception of the ability to route the request through the Tor network.
Description
Essentially a carbon copy of requests.Session().patch()
with the exception of the ability to route the request through the Tor network.
Description
Essentially a carbon copy of requests.Session().delete()
with the exception of the ability to route the request through the Tor network.
Description
Essentially a carbon copy of requests.Session().options()
with the exception of the ability to route the request through the Tor network.
Description Downloads a file from a given URL and saves it locally. This function retrieves a file from a specified URL and saves it to a local directory. The file will be saved with the filename from the URL if no local filename is specified.
Parameters
url (str)
: The URL of the file to be downloaded. Must include a file extension.local_filename (str, recommended)
: The name to be used when saving the file locally. If none is provided, the function uses the filename from the URL. Must include a file extension if provided.
Returns
None
Raises
TypeError
: If any of the arguments are not of the desired data type.ValueError
: If 'local_filename' is specified but does not contain a file extension.
Notes
- If a local file name is not specified, it will attempt to download the file under its name on the web.
Description Executes multiple file downloads asynchronously from a list of given URLs and saves them locally. This function uses threading to download multiple files simultaneously. Each file is saved with a filename from the list of local filenames, if provided. If no local filename is provided, the function uses the filename from the corresponding URL.
Parameters
urls (list of str)
: The URLs of the files to be downloaded. Each URL must include a file extension.local_filenames (list of str, recommended)
: A list of names to be used when saving the files locally. If none is provided, the function uses the filename from each corresponding URL. Each filename must include a file extension if provided. Must be of the same length as 'urls' if provided. Defaults to None.thread_limit (int, optional)
: The maximum number of threads that will be spawned. Defaults to 500.time_rest (int, optional)
: The amount of time to rest between the start of each download thread. Defaults to 0 seconds.
Returns
None
Raises
TypeError
: If any of the arguments are not of the desired data type.
Description Returns True if tor is enabled for the session object and returns False if otherwise.
Parameters
- N/A
Returns
bool
: True if tor is enabled for the session object and returns False if otherwise.
Description Prints out data regarding your tor connection. This includes a boolean value that's True if the tor service is enabled in addition to your public ip address, country, region, and city.
Parameters
- N/A
Returns
None
This module is a carbon copy of the requests.Request object.
This module is a carbon copy of the requests.Response object.
The PyGrab library provides a comprehensive set of tools for web scraping, including session management, Tor integration, and IP rotation. By utilizing these functionalities, developers can efficiently extract data from websites while maintaining anonymity and flexibility. If this library was helpful to you, make my day and leave a star at PyGrab GitHub!