Skip to content

Latest commit

 

History

History
705 lines (429 loc) · 19.7 KB

documentation.md

File metadata and controls

705 lines (429 loc) · 19.7 KB

PyGrab Library Documentation

This is a library that builds on top of the requests library, adding quite a bit of functionality. PyGrab is fully interoperable with the requests library and has virtually identical syntax for the shared functions/methods/classes.





pygrab Module

This module implements the primary developer interface for pygrab.

pygrab.get()

Parameters

  • url (str): The URL or IP address to get.
  • retries (int, optional): The number of times to retry the request if it fails. Defaults to 5.
  • enable_js (bool, optional): Enable Javascript on the request. Defaults to False.
  • *args: Variable length argument list passed to requests.get.
  • **kwargs: Arbitrary keyword arguments passed to requests.get.

Returns

  • requests.Response: The response from the server.

Raises

  • TypeError: If any of the arguments are not of the desired data type.
  • ValueError: If the URL or IP address is invalid. Use get_local() for local requests.

Exceptions

  • RuntimeError: If an error occurs during runtime, such as a connection error.

Notes

  • For handling Javascript-enabled sites, the enable_js parameter can be set to True.


pygrab.get_async()

Description Gets multiple URLs asynchronously. This function sends HTTP requests to a list of URLs in separate threads, allowing for concurrent HTTP requests. The function returns a list of responses from the grabbed URLs. For each request that had a connection error, a warning will be printed to the console.

Parameters

  • urls (list): A list of URLs to grab.
  • retries (int, optional): The number of times to retry the HTTP request in case of failure. Defaults to 5.
  • thread_limit (int, optional): The maximum number of threads that will be spawned at one time. Defaults to 800.
  • time_rest (int, optional): The time in seconds to wait between starting each thread. Defaults to 0.
  • *args: Variable length argument list to pass to the get function.
  • **kwargs: Arbitrary keyword arguments to pass to the get function.

Returns

  • dict: A dictionary of responses with the grabbed URLs as keys and their respective responses as values.

Raises

  • TypeError: If any of the arguments are not of the desired data type.

Notes

  • This function will remove all repeats from the urls list passed in order to prevent accidental DoS attacks.


pygrab.get_local()

Description Reads the contents of a file and returns it to the user.

Parameters

  • filename (str): The file to read from.
  • local_read_type (str, optional): The read type, 'r' or 'rb' for example. Defaults to 'r'.
  • encoding (str, optional): Encoding, 'utf-8' or 'ascii' for example. Defaults to 'utf-8'.

Returns

  • data: The contents of the file as a string.

Raises

  • TypeError: If any of the arguments are not of the desired data type.

Notes

  • This function reads the contents of a file and returns it to the user.


pygrab.download()

Description Downloads a file from a given URL and saves it locally. This function retrieves a file from a specified URL and saves it to a local directory. The file will be saved with the filename from the URL if no local filename is specified.

Parameters

  • url (str): The URL of the file to be downloaded. Must include a file extension.
  • local_filename (str, recommended): The name to be used when saving the file locally. If none is provided, the function uses the filename from the URL. Must include a file extension if provided.
  • retries (int, optional): The number of retry attempts for the download in case of failure. Defaults to 5.

Returns

  • None

Raises

  • TypeError: If any of the arguments are not of the desired data type.
  • ValueError: If 'local_filename' is specified but does not contain a file extension.

Notes

  • If a local file name is not specified, it will attempt to download the file under its name on the web.


pygrab.download_async()

Description Executes multiple file downloads asynchronously from a list of given URLs and saves them locally. This function uses threading to download multiple files simultaneously. Each file is saved with a filename from the list of local filenames, if provided. If no local filename is provided, the function uses the filename from the corresponding URL.

Parameters

  • urls (list of str): The URLs of the files to be downloaded. Each URL must include a file extension.
  • local_filenames (list of str, recommended): A list of names to be used when saving the files locally. If none is provided, the function uses the filename from each corresponding URL. Each filename must include a file extension if provided. Must be of same length as 'urls' if provided. Defaults to None.
  • retries (int, optional): The number of retry attempts for the downloads in case of failure. Defaults to 5.
  • thread_limit (int, optional): The maximum number of threads that will be spawned. Defaults to 500.
  • time_rest (int, optional): The amount of time to rest between the start of each download thread. Defaults to 0 seconds.

Returns

  • None

Raises

  • TypeError: If any of the arguments are not of the desired data type.
  • ValueError: If a 'local_filenames' is specified but does not contain a file extension.


pygrab.head()

Description Essentially a carbon copy of requests.head() with the exception of the ability to route the request through the Tor network.



pygrab.post()

Description Essentially a carbon copy of requests.post() with the exception of the ability to route the request through the Tor network.



pygrab.post_local()

Description Writes data to a local file. This function is used to write or append data to a local file. It can be used in various scenarios such as saving request data, logging, or other local storage needs.

Parameters

  • filepath (str): The path to the file where the data will be written. If the file does not exist, it will be created.
  • data (str): The data that will be written to the file.
  • local_save_type (str, optional): The mode in which the file is opened. Defaults to 'w' (write mode), and can also be set to 'a' (append mode) or any other valid file mode.
  • encoding (str, optional): The encoding to be used when opening the file. Defaults to 'utf-8'.

Returns

  • None


pygrab.put()

Description Essentially a carbon copy of requests.put() with the exception of the ability to route the request through the Tor network.



pygrab.patch()

Description Essentially a carbon copy of requests.patch() with the exception of the ability to route the request through the Tor network.



pygrab.delete()

Description Essentially a carbon copy of requests.delete() with the exception of the ability to route the request through the Tor network.



pygrab.options()

Description Essentially a carbon copy of requests.options() with the exception of the ability to route the request through the Tor network.



pygrab.tor_status()

Description Returns True if the tor service is running and False if otherwise.

Parameters

  • N/A

Returns

  • A boolean of True of the tor service is running and False if otherwise.


pygrab.display_tor_status()

Description Prints out data regarding your tor connection. This includes a boolean value that's True if the tor service is enabled in addition to your public ip address, country, region, and city.

Parameters

  • N/A

Returns

  • None


pygrab.rotate_tor()

Description Starts the tor service if it isn't already running. Configures the library to reconnect to the network every num_req_per_rotation. This rotates your ip and grants even greater anonymity.

Parameters

  • num_req_per_rotation (int): The maximum number of requests that can be send before a rotation.

Returns

  • None


pygrab.end_rotate_tor()

Description Ends the configuration to rotate tor connections. Note that this does not end the tor service itself (run pygrab.Tor.end_tor() to do that).

Parameters

  • N/A

Returns

  • None


pygrab.warn_settings()

Description Changins the warning configuration of the entire library. All warnings are shown be defult and calling pygrab.warn_settings(False) will turn off all warnings.

Parameters

  • warn (bool): Boolean value to enable or disable warnings.

Returns

  • None




Tor Object

This module implements the primary developer interface for connecting pygrab to the Tor network.

pygrab.Tor.start_tor()

Description Starts the Tor network and configures all requests to be routed through the tor network.

Parameters

  • verbose (int, optional): 0 by default. If set to 1, the logs from starting the tor network will be displayed.
  • force_start (bool, optional): False by default. This will cause the program to crash if a service is already running on 127.0.0.1:9050. If this service is an instance of Tor, then you can set force_start equal to true and have the program run as normal.

Returns

  • None

Notes

  • If force_start is set equal to True, then the program will crash if the service running on 127.0.0.1:9050 is not Tor.
  • Additionally, even if the service running on 127.0.0.1:9050 is Tor, some functionality may be lost (such as rotating tor connections). It is recommended to only have one instance of tor running at any one time.
  • If you get an error message stating that you are missing the tor.exe dependency, its probably because you are :). If you're on Windows, download it from "https://www.torproject.org/download/tor/". Then copy the path to the .tar.gz file and enter it when prompted by pygrab.Tor.start_tor(). If you're on linux, simply run sudo apt-get install tor.


pygrab.Tor.end_tor()

Description Ends the tor service.

Parameters

  • N/A

Returns

  • None


pygrab.Tor.tor_status()

Description Returns True if the tor status is active and False if otherwise.

Parameters

  • N/A

Returns

  • bool: True if the tor status is active and False if otherwise.


pygrab.Tor.override_status()

Description Returns True if a second instance of tor has been overridden and False if otherwise. Note that IP/Tor rotations will not be available if an instance of tor is overridden.

Parameters

  • N/A

Returns

  • bool: True if a second instance of tor has been overridden and False if otherwise.


pygrab.Tor.load_tor_dependencies()

Description Loads the tor.exe dependency from the .tar.gz file that is downloaded from "https://www.torproject.org/download/tor/".

Parameters

  • filepath (str): The path to the .tar.gz file downloaded from Tor's website.

Returns

  • None

Notes

  • This function is for windows only. If you're on Linux, run sudo apt-get install tor
  • Running pygrab.Tor.start_tor() without the tor.exe dependency installed will prompt you for the path to the .tar.gz file. This method is merely an alternative to that.




Session Object

This module implements the primary developer interface for Session handling. These sessions also include all of the functionality provided by the main pygrab module with the exception of IP/Tor rotations (coming soon!).

s_obj = pygrab.Session()

Description Constructor for the pygrab.Session object.

Parameters

  • use_tor (bool, optional): Tor service will be enabled for the session if True. Defaults to True if the tor service is running and false if it isn't.

Returns

  • None

Notes

  • Setting use_tor to true will start the tor service if it isn't already.
  • Setting use_tor to false disable it from using the tor network, even if the tor service is already running.


s_obj.start_tor()

Description Enables use of the tor service for the session object. If the tor service is not running, this method will start the service.

Parameters

  • N/A

Returns

  • None


s_obj.end_tor()

Description Disables use of the tor service for the session object. However, this method does not end the tor service itself (run pygrab.Tor.end_tor() to do that).

Parameters

  • N/A

Returns

  • None


s_obj.get()

Parameters

  • url (str): The URL to get.
  • enable_js (bool, optional): Enable Javascript on the request. Defaults to False.
  • **kwargs: Arbitrary keyword arguments passed to requests.get.

Returns

  • requests.Response: The response from the server.

Raises

  • TypeError: If any of the arguments are not of the desired data type.
  • ValueError: If the URL doesn't start with http. Use get_local() for local requests.
  • Exception: If the URL is invalid.

Notes

  • For handling Javascript-enabled sites, the enable_js parameter can be set to True.


s_obj.get_async()

Description Gets multiple URLs asynchronously. This function sends HTTP requests to a list of URLs in separate threads, allowing for concurrent HTTP requests. The function returns a list of responses from the grabbed URLs. For each request that had a connection error, a warning will be printed to the console.

Parameters

  • urls (list): A list of URLs to grab.
  • thread_limit (int, optional): The maximum number of threads that will be spawned at one time. Defaults to 800.
  • time_rest (int, optional): The time in seconds to wait between starting each thread. Defaults to 0.
  • *args: Variable length argument list to pass to the get function.
  • **kwargs: Arbitrary keyword arguments to pass to the get function.

Returns

  • dict: A dictionary of responses with the grabbed URLs as keys and their respective responses as values.

Raises

  • TypeError: If any of the arguments are not of the desired data type.

Notes

  • This function will remove all repeats from the urls list passed in order to prevent accidental DoS attacks.


s_obj.get_local()

Description Reads the contents of a file and returns it to the user.

Parameters

  • filename (str): The file to read from.
  • local_read_type (str, optional): The read type, 'r' or 'rb' for example. Defaults to 'r'.
  • encoding (str, optional): Encoding, 'utf-8' or 'ascii' for example. Defaults to 'utf-8'.

Returns

  • data: The contents of the file as a string.

Raises

  • TypeError: If any of the arguments are not of the desired data type.

Notes

  • This function reads the contents of a file and returns it to the user.


s_obj.head()

Description Essentially a carbon copy of requests.Session().head() with the exception of the ability to route the request through the Tor network.



s_obj.post()

Description Essentially a carbon copy of requests.Session().post() with the exception of the ability to route the request through the Tor network.



s_obj.post_local()

Description Writes data to a local file. This function is used to write or append data to a local file. It can be used in various scenarios such as saving request data, logging, or other local storage needs.

Parameters

  • filepath (str): The path to the file where the data will be written. If the file does not exist, it will be created.
  • data (str): The data that will be written to the file.
  • local_save_type (str, optional): The mode in which the file is opened. Defaults to 'w' (write mode), and can also be set to 'a' (append mode) or any other valid file mode.
  • encoding (str, optional): The encoding to be used when opening the file. Defaults to 'utf-8'.

Returns

  • None


s_obj.put()

Description Essentially a carbon copy of requests.Session().put() with the exception of the ability to route the request through the Tor network.



s_obj.patch()

Description Essentially a carbon copy of requests.Session().patch() with the exception of the ability to route the request through the Tor network.



s_obj.delete()

Description Essentially a carbon copy of requests.Session().delete() with the exception of the ability to route the request through the Tor network.



s_obj.options()

Description Essentially a carbon copy of requests.Session().options() with the exception of the ability to route the request through the Tor network.



s_obj.download()

Description Downloads a file from a given URL and saves it locally. This function retrieves a file from a specified URL and saves it to a local directory. The file will be saved with the filename from the URL if no local filename is specified.

Parameters

  • url (str): The URL of the file to be downloaded. Must include a file extension.
  • local_filename (str, recommended): The name to be used when saving the file locally. If none is provided, the function uses the filename from the URL. Must include a file extension if provided.

Returns

  • None

Raises

  • TypeError: If any of the arguments are not of the desired data type.
  • ValueError: If 'local_filename' is specified but does not contain a file extension.

Notes

  • If a local file name is not specified, it will attempt to download the file under its name on the web.


s_obj.download_async()

Description Executes multiple file downloads asynchronously from a list of given URLs and saves them locally. This function uses threading to download multiple files simultaneously. Each file is saved with a filename from the list of local filenames, if provided. If no local filename is provided, the function uses the filename from the corresponding URL.

Parameters

  • urls (list of str): The URLs of the files to be downloaded. Each URL must include a file extension.
  • local_filenames (list of str, recommended): A list of names to be used when saving the files locally. If none is provided, the function uses the filename from each corresponding URL. Each filename must include a file extension if provided. Must be of the same length as 'urls' if provided. Defaults to None.
  • thread_limit (int, optional): The maximum number of threads that will be spawned. Defaults to 500.
  • time_rest (int, optional): The amount of time to rest between the start of each download thread. Defaults to 0 seconds.

Returns

  • None

Raises

  • TypeError: If any of the arguments are not of the desired data type.


s_obj.tor_status()

Description Returns True if tor is enabled for the session object and returns False if otherwise.

Parameters

  • N/A

Returns

  • bool: True if tor is enabled for the session object and returns False if otherwise.


s_obj.display_tor_status()

Description Prints out data regarding your tor connection. This includes a boolean value that's True if the tor service is enabled in addition to your public ip address, country, region, and city.

Parameters

  • N/A

Returns

  • None




Request Object

This module is a carbon copy of the requests.Request object.





Response Object

This module is a carbon copy of the requests.Response object.





Conclusion

The PyGrab library provides a comprehensive set of tools for web scraping, including session management, Tor integration, and IP rotation. By utilizing these functionalities, developers can efficiently extract data from websites while maintaining anonymity and flexibility. If this library was helpful to you, make my day and leave a star at PyGrab GitHub!