- Added mechanism to use all available options in the
FollowRedirects
Faraday middleware, #355 thanks to @bruno-b-martins and @miguelrod
- Several dependency updates, including Addressable 2.8.1 which fixes invalid_byte_sequence exception.
- Remove support for #feed that was deprecated in 5.9
- Add support for Ruby 3.1
- Update dependencies: rubocop, nokogiri
- Support Ruby 3.0
- Relax dependencies to allow minor releases.
- Upgrade to Nokogiri 1.11.0.
- Upgrade to Faraday 1.1.
- Fix for empty base_href. Makes relative links work when base_href is nil but empty ("").
- Drop support for Ruby 2.4, add support for Ruby 2.7.
- Upgrade to Faraday 1.0.
- Added #feeds method to retrieve all feeds of a page.
- Adds deprecation warning on #feed method.
- Added h1..h6 support.
- New feature:
:encoding
option for force encoding of a parsed document. - Improvement: make
best_title
andbest_author
work by order of preference, rather than length.
- New feature: adds
author
,best_author
. - Bugfix: adds presence validation for empty string on meta tag image values.
- Improves spider and links checker examples.
- Uses WebMock instead of FakeWeb in tests.
- Supports Gzipped responses.
- Adds method
best_description
and makesdescription
return just the meta description. - Removes support for Ruby 2.0.0 and adds support for 2.4.0.
- Returns secondary description if meta description is empty.
- Adds a custom timeout on top of the ones for Faraday, and sets defaults for timeouts.
- Eliminates possible NULL char in HTML which breaks nokogiri.
- Removes the deprecated
html_content_only
option, and replaces it byallow_non_html_content
, by defaultfalse
.
- Deprecates the
html_content_only
option, and turns it on by default.
- Removes the ExceptionLog, all exceptions are now encapsulated in our own exception classes and always raised.
- MetaInspector can be configured to use Faraday::HttpCache to cache page responses. For that you should pass the
faraday_http_cache
option with at least the:store
key, for example:
cache = ActiveSupport::Cache.lookup_store(:file_store, '/tmp/cache')
page = MetaInspector.new('http://example.com', faraday_http_cache: { store: cache })
- Bugfixes:
- Parsing of the document is done as soon as it is initialized (just like we do with the request), so that parsing errors will be catched earlier.
- Rescues from Faraday::SSLError.
- Faraday can be passed options via
:faraday_options
. This is useful in cases where we need to customize the way we request the page, like for example disabling SSL verification, like this:
MetaInspector.new('https://example.com')
# Faraday::SSLError: SSL_connect returned=1 errno=0 state=SSLv3 read server certificate B: certificate verify failed
MetaInpector.new('https://example.com', faraday_options: { ssl: { verify: false } })
# Now we can access the page
-
The Document API now includes access to head/link elements
page.head_links
returns an array of hashes of all head/links.page.stylesheets
returns head/links where rel='stylesheet'page.canonicals
returns head/links where rel='canonical'
-
The URL API can remove common tracking parameters from the querystring
url.tracked?
will tell you if the url contains known tracking parametersurl.untracked_url
will return the url with known tracking parameters removedurl.untrack!
will remove the tracking parameters from the url
-
The images API has been extended:
page.images.with_size
returns a sorted array (by descending area) of [image_url, width, height]
- The default headers now include
'Accept-Encoding' => 'identity'
to minimize trouble with servers that respond with malformed compressed responses, as explained here.
- The Document API has been extended with one new method
page.best_title
that returns the longest text available from a selection of candidates. to_hash
now includesscheme
,host
,root_url
,best_title
anddescription
.
-
The images API has been extended, with two new methods:
page.images.owner_suggested
returns the OG or Twitter image, ornil
if neither are present.page.images.largest
returns the largest image found in the page. This uses the HTML height and width attributes as well as the fastimage gem to return the largest image on the page that has a ratio squarer than 1:10 or 10:1. This usually provides a good alternative to the OG or Twitter images if they are not supplied.
-
The criteria for
page.images.best
has changed slightly, we'll now return the largest image instead of the first image if no owner-suggested image is found.
- Introduces the
:normalize_url
option, which allows to disable URL normalization.
- The links API has been changed, now instead of
page.links
,page.internal_links
andpage.external_links
we have:
page.links.raw # Returns all links found, unprocessed
page.links.all # Returns all links found, unrelavitized and absolutified
page.links.http # Returns all HTTP links found
page.links.non_http # Returns all non-HTTP links found
page.links.internal # Returns all internal HTTP links found
page.links.external # Returns all external HTTP links found
-
The images API has been changed, now instead of
page.image
we havepage.images.best
, and instead ofpage.favicon
we havepage.images.favicon
. -
Now
page.image
will return the first image inpage.images
if no OG or Twitter image found, instead of returningnil
. -
You can now specify 2 different timeouts,
connection_timeout
andread_timeout
, instead of the previous singletimeout
.
- The redirect API has been changed, now the
:allow_redirections
option will expect only a boolean, which by default istrue
. That is, no more specifying:safe
,:unsafe
or:all
. - We've dropped support for Ruby < 2.
Also, we've introduced a new feature:
- Persist cookies across redirects. Now MetaInspector will include the received cookies when following redirects. This fixes some cases where a redirect would fail, sometimes caught in a redirection loop.