Skip to content

Commit

Permalink
Merge Dev-0.9.3 branch (#623)
Browse files Browse the repository at this point in the history
* feat(lang): ⚡ Rework of tokenizer. Additionally implemented new (easier) way of adding languages to the packet
* feat(lang): added language dependecies as optional
* feat(lang): add Bengali, Nepali, Tamil, Georgian, Marathi, Telugu, Latvian, Czech, Slovak, Burmese language support
* refactor(lang): moving all language related files in languages folder
* refactor(lang): added valid_languages function that returns available languages
* refactor(misc): ⚡ removed ParsingCandidate, RawHelper, URLHelper classes. Removed link_hash from article (was never used)
* refactor(parse): article.link_hash is no longer available
* fix(cli): json output in stdout missing []
* feat(parse): 🔥 article is now pickleable
* feat(parse): 🔥 Source object is now pickleable
* refactor(parse): ✨ Tidying up the gravity scoring process. No changes in the final score result
* refactor(parse): 🚀 compute word statistics for a node taking children nodes into account
* fix(parse): ⚡ Bug with auto detecting website language. If no language supplied, the detected language was not used
* fix(parse): ⚡ added figure as a tag to be removed before text generation
* fix(parse): 🔥 better article paragraph detection
* fix(parse): ⚡ get feeds fixed, it was not parsing the main page for possible feeds
* fix(misc): ✨ tydiing up some code in urls.py
* feat(misc): better typing support and type hinting Author: Tom Parker-Shemilt <palfrey@***.net>
* feat(misc): Simplify favicon return Author: Tom Parker-Shemilt <palfrey@***.net>
* feat(misc): Basic mypy support Author: Tom Parker-Shemilt <palfrey@***.net>
* feat(doc): 📝 adding evaluation results
* feat(doc): 🚀 Documentation Update. Added Examples, documented new features
* refactor(core): Minimum Python now 3.8; Also test 3.10/11/12 Author: Tom Parker-Shemilt <palfrey@***.net>
* refactor(core): run gh actions on PR's. Author: Tom Parker-Shemilt <palfrey@***.net>
* refactor(core): Set SETUPTOOLS_USE_DISTUTILS. setuptools as per numpy recommendations. Upgrade numpy and pandas for >= 3.9.Author: Tom Parker-Shemilt <palfrey@***.net>
* refactor(core): Upgrade regex, virtualenv to avoid breaking pre-commit, distutils for everyone. Author: Tom Parker-Shemilt <palfrey@***.net>
* feat(sources): ✨ new option when building sources. You can limit the article parsing to the source home page only. Other categories or feeds are then ignored
* feat(misc): 📈 added cloudscraper as optional dependancy. If installed, it will us cloudscraper as a layer over requests. Cloudscraper tries to bypass cloudflair protection
* feat(lang): ✨ New integration of Google news using GNews module. You can now use GoogleNewsSource to search and parse news based on keywords, topic, location or website
* fix(parse): ⚡ Better title parsing. Added language specific regex for article titles
* feat(parse): ✨ added filter that limits the source.build to a specific category. use source.build(url,only_in_path=True) to scrape only stories that are in the starting url path
* fix(parse): 🔥 better binary content detection
* fix(lang): ⚡ better is_highlink_density for non-latin languages
* feat(lang): 📝 added stopwords for af, br, ca,eo, eu, ga, gl, gu, ha, hy, ku, ms, so, st, tl, ur, yo, zu from https://github.com/stopwords-iso
* refactor(parse): 💥 deprecated text_cleaned, clean_doc. Removed clean_top_node, article.clean_top_node is removed. Failtures if it was accessed
* feat(lang): 🚀 added support for another 13 languages
* fix(misc): 🎨 mypy stubs for gnews and cloudscraper + small typing fixes
* fix(parse): 🐛 better feed discovery in Souce objects
* fix(parse): 🐛 fixed an issue with non latin high density detection
* docs(doc): 🔥 Added typing and docstrings to most of the code
* fix(types): 🎨 added stubs for gnews
* fix(misc): 🚑 python-setup github action version bump

Co-authored-by: Tom Parker-Shemilt <[email protected]>
  • Loading branch information
AndyTheFactory and palfrey authored Mar 17, 2024
1 parent 9d99beb commit 1bf3879
Show file tree
Hide file tree
Showing 169 changed files with 19,273 additions and 2,039 deletions.
1 change: 1 addition & 0 deletions .codespell-dictionary.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
doubleclick
te
shotcut
annonces
2 changes: 1 addition & 1 deletion .github/ISSUE_TEMPLATE/bug_report.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ If applicable, add screenshots to help explain your problem.

**System information**
- OS: [Windows / Linux / Macos]
- Python version [e.g. 3.6, 3.9]
- Python version [e.g. 3.8, 3.9]
- Library version [e.g. 0.9.0]

**Additional context**
Expand Down
21 changes: 13 additions & 8 deletions .github/workflows/pipeline.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ on: # events that trigger our pipeline: push on any branch and release creation
push:
release:
types: [created]
pull_request:

jobs: # jobs. We will have two jobs (test and publish) with multiple steps.
test:
Expand All @@ -13,11 +14,11 @@ jobs: # jobs. We will have two jobs (test and publish) with multiple steps.
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.8", "3.9"]
python-version: ["3.8", "3.9", "3.10", "3.11", "3.12"]

steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v4
- uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
- name: Run image # install poetry
Expand All @@ -28,15 +29,19 @@ jobs: # jobs. We will have two jobs (test and publish) with multiple steps.
run: |
python -m pip install --upgrade pip
poetry config virtualenvs.create false --local
poetry install
pip install pytest pylint coverage mypy coveralls
python -m nltk.downloader punkt stopwords
poetry install --all-extras
pip install pylint coveralls
# python -m nltk.downloader punkt stopwords
env:
SETUPTOOLS_USE_DISTUTILS: local
- name: Pylint # Run pylint static analysis
run: |
poetry run pylint newspaper --fail-under=8.0
# - name: mypy # Run mypy static analysis
# run: |
# poetry run mypy -p newspaper
- name: mypy # Run mypy static analysis
run: |
poetry run mypy -p newspaper --config-file mypy.ini
env:
MYPYPATH: stubs
- name: Pytest # Run pytest
run: |
poetry run coverage run -m --source=newspaper pytest tests
Expand Down
6 changes: 4 additions & 2 deletions .github/workflows/pylint.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
name: Pylint

on: [push]
on:
push:
pull_request:

jobs:
build:
Expand All @@ -11,7 +13,7 @@ jobs:
steps:
- uses: actions/checkout@v3
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
allow-prereleases: false
Expand Down
1 change: 1 addition & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ repos:
.*/stopwords.*\.txt|
tests/data/.*|
newspaper/languages.py|
newspaper/languages/.*|
newspaper/resources/.*
)$
additional_dependencies:
Expand Down
4 changes: 1 addition & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,7 @@ I have duplicated all issues on the original project and will try to fix them. I


## Python compatibility
- Recommended: Python 3.8+
- Python 3.6+ minimum
- Fixes for Python < 3.8 are low priority and might not be merged
- Python 3.8+ minimum

# Quick start

Expand Down
2 changes: 1 addition & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,10 +36,10 @@
# -- General configuration

extensions = [
"sphinx.ext.napoleon",
"sphinx.ext.duration",
"sphinx.ext.doctest",
"sphinx.ext.autodoc",
"sphinx.ext.napoleon",
"sphinx.ext.autosummary",
"sphinx.ext.intersphinx",
"sphinxarg.ext",
Expand Down
4 changes: 1 addition & 3 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,9 +24,7 @@ coding API is kept as much as possible.
Python compatibility
--------------------

- Recommended: Python 3.8+
- Python 3.6+ minimum
- Fixes for Python < 3.8 are low priority and might not be merged
- Python 3.8+ minimum


At a glance:
Expand Down
1 change: 0 additions & 1 deletion docs/user_guide/api_reference.rst
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,6 @@ Source
.. autoclass:: newspaper.Source
.. automethod:: newspaper.Source.__init__
.. automethod:: newspaper.Source.build()
.. automethod:: newspaper.Source.purge_articles()
.. automethod:: newspaper.Source.feeds_to_articles()
.. automethod:: newspaper.Source.categories_to_articles()
.. automethod:: newspaper.Source.generate_articles()
Expand Down
Binary file added docs/user_guide/assets/logo_v1_150.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/user_guide/assets/logo_v1_670.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion docs/user_guide/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ If you want to install the latest version from Github, you can do so::
Requirements
------------

``newspaper4k`` requires Python 3.7 and above to run. It was not tested on
``newspaper4k`` requires Python 3.8 and above to run. It was not tested on
lower versions.

The newspaper4k package has the following dependencies:
Expand Down
1 change: 1 addition & 0 deletions docs/user_guide/languages.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ At the moment newspaper supports 37 languages.
ar Arabic
be Belarusian
bg Bulgarian
bn Bengali
da Danish
de German
el Greek
Expand Down
5 changes: 5 additions & 0 deletions mypy.ini
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
[mypy]
warn_redundant_casts = True
warn_unused_ignores = True
show_error_codes = True
mypy_path = stubs
11 changes: 7 additions & 4 deletions newspaper/__init__.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
# -*- coding: utf-8 -*-

# Copyright (c) [2023] [Andrei Paraschiv]
# Copyright (c) [2023-] [Andrei Paraschiv]
#
# This file is part of [Newspaper4k].
# https://github.com/AndyTheFactory/newspaper4k
#
# [Newspaper4k] includes code from the original project,
# [Newspaper3k], which is licensed under [MIT].
# [newspaper4k], which is licensed under [MIT].
#
# I would like to express gratitude to the creator of [Newspaper3k],
# I would like to express gratitude to the creator of [newspaper4k],
# Lucas Ou-Yang (codelucas) for their valuable work.
# You can find the original project here: https://github.com/codelucas/newspaper

Expand All @@ -29,12 +29,14 @@
import logging
from logging import NullHandler
from .exceptions import ArticleBinaryDataException, ArticleException
from .languages import valid_languages


# Set default logging handler to avoid "No handler found" warnings.
logging.getLogger(__name__).addHandler(NullHandler())


def article(url: str, language: Optional[str] = "en", **kwargs) -> Article:
def article(url: str, language: Optional[str] = None, **kwargs) -> Article:
"""Shortcut function to fetch and parse a newspaper article from a URL.
Args:
Expand Down Expand Up @@ -69,6 +71,7 @@ def article(url: str, language: Optional[str] = "en", **kwargs) -> Article:
"fulltext",
"hot",
"languages",
"valid_languages",
"popular_urls",
"Config",
"Article",
Expand Down
8 changes: 8 additions & 0 deletions newspaper/__main__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,11 @@
"""
Entry point for the newspaper package.
python -m newspaper
This script is used to run the command-line interface (CLI) for the newspaper package.
It imports the `main` function from the `newspaper.cli` module and calls it.
"""

from newspaper.cli import main

if __name__ == "__main__":
Expand Down
51 changes: 37 additions & 14 deletions newspaper/api.py
Original file line number Diff line number Diff line change
@@ -1,20 +1,29 @@
# -*- coding: utf-8 -*-
# Much of the code here was forked from https://github.com/codelucas/newspaper
# Copyright (c) Lucas Ou-Yang (codelucas)

"""Module providing a simple API for the newspaper library, wrapping several
classes and functions into simple calls.
"""

from typing import List
import feedparser

from .article import Article
from .configuration import Configuration
from .settings import POPULAR_URLS, TRENDING_URL
from .source import Source
from .utils import print_available_languages
import newspaper.parsers as parsers


def build(url="", dry=False, config=None, **kwargs) -> Source:
from newspaper.article import Article
from newspaper.configuration import Configuration
from newspaper.settings import POPULAR_URLS, TRENDING_URL
from newspaper.source import Source
from newspaper.utils import print_available_languages


def build(
url="",
dry=False,
only_homepage=False,
only_in_path=False,
input_html=None,
config=None,
**kwargs
) -> Source:
"""Returns a constructed :any:`Source` object without
downloading or parsing the articles
Expand All @@ -23,6 +32,14 @@ def build(url="", dry=False, config=None, **kwargs) -> Source:
`https://www.cnn.com`.
dry (bool): If true, the source object will be constructed but not
downloaded or parsed.
only_homepage (bool): If true, the source object will only parse
the homepage of the source.
only_in_path (bool): If true, the source object will only
parse the articles that are in the same path as the source's
homepage. You can scrape a specific category this way.
Defaults to False.
input_html (str): The HTML of the source to parse. Use this to pass cached
HTML to the source object.
config (Configuration): A configuration object to use for the source.
kwargs: Any other keyword arguments to pass to the Source constructor.
If you omit the config object, you can add any configuration
Expand All @@ -37,7 +54,11 @@ def build(url="", dry=False, config=None, **kwargs) -> Source:
url = url or ""
s = Source(url, config=config)
if not dry:
s.build()
s.build(
only_homepage=only_homepage,
only_in_path=only_in_path,
input_html=input_html,
)
return s


Expand Down Expand Up @@ -77,9 +98,10 @@ def hot():
return None


def fulltext(html, language="en"):
"""Takes article HTML string input and outputs the fulltext
Input string is decoded via UnicodeDammit if needed
def fulltext(html: str, language: str = "en") -> str:
"""Takes article HTML string input and outputs the extracted
article text. No Title, Author, Date parsing is done.
No http requests are performed.
"""
from .cleaners import DocumentCleaner
from .configuration import Configuration
Expand All @@ -88,6 +110,7 @@ def fulltext(html, language="en"):

config = Configuration()
config.language = language
config.fetch_images = False

extractor = ContentExtractor(config)
document_cleaner = DocumentCleaner(config)
Expand Down
Loading

0 comments on commit 1bf3879

Please sign in to comment.