Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

功能需求:过滤重复图片 #244

Closed
zengyufei opened this issue May 24, 2024 · 6 comments · Fixed by #245
Closed

功能需求:过滤重复图片 #244

zengyufei opened this issue May 24, 2024 · 6 comments · Fixed by #245
Labels
enhancement New feature or request

Comments

@zengyufei
Copy link

很多漫画每一话的头尾,会出现重复的图片页。

希望在下载的时候,计算每页MD5值,统计每个MD5值出现的次数,大于一定阈值,如5,大于5次,则后面下载的每一话,都会过滤掉重复的图片。

假设某部漫画200话,每话50张图片,总共10000张图片,下载计算MD5最大增加耗时50ms/张,额外多出500秒,可以接受。

理想情况下,前面5话获得头尾共2张重复5次的图片,则后面195话可以省略195*2张图片。

这个需求的意义:韩漫头部起码3-5张重复上一话,拇指滑动非常累,阈值自定义调整为大于等于2,去掉每话头部重复上一话剧情,提升体验。

希望能接受这个需求并做成插件。

@hect0x7
Copy link
Owner

hect0x7 commented May 24, 2024

好需求,能给个本子id吗,作为测试用例

@zengyufei
Copy link
Author

我想了一下,获取在全部下载完成后,再进行过滤检测,根据检测结果与用户阈值匹配,再决定物理删除文件,也是可以的。

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import os
import hashlib
from collections import defaultdict


def calculate_md5(file_path):
    """计算文件的MD5哈希值"""
    hash_md5 = hashlib.md5()
    with open(file_path, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hash_md5.update(chunk)
    return hash_md5.hexdigest()


def find_duplicate_files(root_folder):
    """递归读取文件夹下所有文件并计算MD5出现次数"""
    md5_dict = defaultdict(list)

    for root, _, files in os.walk(root_folder):
        for file in files:
            file_path = os.path.join(root, file)
            file_md5 = calculate_md5(file_path)
            md5_dict[file_md5].append(file_path)

    # 打印MD5出现次数大于等于2的文件
    for md5, paths in md5_dict.items():
        if len(paths) >= 2:
            print(f"MD5: {md5} 出现次数: {len(paths)}")
            for path in paths:
                print(f"  {path}")

if __name__ == '__main__':
    dir_path = r"G:\Nexon\20240521\故鄉的那些女人"
    find_duplicate_files(dir_path)

执行结果:

MD5: b6bf41a00359961fbf4a97872df3fbe2 出现次数: 2
  G:\Nexon\20240521\故鄉的那些女人\1\00154.webp
  G:\Nexon\20240521\故鄉的那些女人\10\00050.webp
MD5: f48a18948b692b421742e3557d495443 出现次数: 5
  G:\Nexon\20240521\故鄉的那些女人\31\00119.webp
  G:\Nexon\20240521\故鄉的那些女人\32\00124.webp
  G:\Nexon\20240521\故鄉的那些女人\33\00114.webp
  G:\Nexon\20240521\故鄉的那些女人\34\00116.webp
  G:\Nexon\20240521\故鄉的那些女人\35\00133.webp

@zengyufei
Copy link
Author

好需求,能给个本子id吗,作为测试用例

JM212707

@hect0x7
Copy link
Owner

hect0x7 commented May 24, 2024

我想了一下,获取在全部下载完成后,再进行过滤检测,根据检测结果与用户阈值匹配,再决定物理删除文件,也是可以的。

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import os
import hashlib
from collections import defaultdict


def calculate_md5(file_path):
    """计算文件的MD5哈希值"""
    hash_md5 = hashlib.md5()
    with open(file_path, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hash_md5.update(chunk)
    return hash_md5.hexdigest()


def find_duplicate_files(root_folder):
    """递归读取文件夹下所有文件并计算MD5出现次数"""
    md5_dict = defaultdict(list)

    for root, _, files in os.walk(root_folder):
        for file in files:
            file_path = os.path.join(root, file)
            file_md5 = calculate_md5(file_path)
            md5_dict[file_md5].append(file_path)

    # 打印MD5出现次数大于等于2的文件
    for md5, paths in md5_dict.items():
        if len(paths) >= 2:
            print(f"MD5: {md5} 出现次数: {len(paths)}")
            for path in paths:
                print(f"  {path}")

if __name__ == '__main__':
    dir_path = r"G:\Nexon\20240521\故鄉的那些女人"
    find_duplicate_files(dir_path)

执行结果:

MD5: b6bf41a00359961fbf4a97872df3fbe2 出现次数: 2
  G:\Nexon\20240521\故鄉的那些女人\1\00154.webp
  G:\Nexon\20240521\故鄉的那些女人\10\00050.webp
MD5: f48a18948b692b421742e3557d495443 出现次数: 5
  G:\Nexon\20240521\故鄉的那些女人\31\00119.webp
  G:\Nexon\20240521\故鄉的那些女人\32\00124.webp
  G:\Nexon\20240521\故鄉的那些女人\33\00114.webp
  G:\Nexon\20240521\故鄉的那些女人\34\00116.webp
  G:\Nexon\20240521\故鄉的那些女人\35\00133.webp

我的第一想法也是这样,全部下载完再检测,这样一来这个功能的实现和jmcomic可以完全无关。

@hect0x7
Copy link
Owner

hect0x7 commented May 24, 2024

实现示例 + 测试代码

测试环境:使用最新dev的jmcomic代码

from jmcomic import *

# 插件定义
class DeleteDuplicatedFilesPlugin(JmOptionPlugin):
    plugin_key = 'delete_duplicated_files'

    def calculate_md5(self, file_path):
        import hashlib

        """计算文件的MD5哈希值"""
        hash_md5 = hashlib.md5()
        with open(file_path, "rb") as f:
            for chunk in iter(lambda: f.read(4096), b""):
                hash_md5.update(chunk)
        return hash_md5.hexdigest()

    def find_duplicate_files(self, root_folder):
        """递归读取文件夹下所有文件并计算MD5出现次数"""
        import os
        from collections import defaultdict
        md5_dict = defaultdict(list)

        for root, _, files in os.walk(root_folder):
            for file in files:
                file_path = os.path.join(root, file)
                file_md5 = self.calculate_md5(file_path)
                md5_dict[file_md5].append(file_path)

        return md5_dict

    def invoke(self,
               album=None,
               downloader=None,
               limit=2,
               delete_if_exceed_limit=True,
               **kwargs,
               ) -> None:
        if album is None:
            return

        # 获取到下载本子所在根目录
        # 这个方法是 最新dev分支新加的
        root_folder = self.option.dir_rule.decide_album_root_dir(album)
        md5_dict = self.find_duplicate_files(root_folder)

        # 打印MD5出现次数大于等于2的文件
        for md5, paths in md5_dict.items():
            if len(paths) >= limit:
                print(f"MD5: {md5} 出现次数: {len(paths)}")
                for path in paths:
                    print(f"  {path}")

                # 判断参数配置,是否要删除文件
                if delete_if_exceed_limit:
                    self.do_delete(paths)

    def do_delete(self, paths):
        """
        复用父类的删除方法
        """
        self.delete_original_file = True
        # 删除文件
        self.execute_deletion(paths)

# 手动注册插件
JmModuleConfig.register_plugin(DeleteDuplicatedFilesPlugin)
op = create_option_by_env()
op.download_album(123)

option配置

plugins:
  after_album: # 每当一个本子下载完后,调用插件
    - plugin: delete_duplicated_files
      kwargs:
        # 对md5出现次数的限制
        limit: 1
        # 如果文件的md5的出现次数 >= limit,是否要删除
        # 在 limit: 1配置下,效果是删除所有文件 
        delete_if_exceed_limit: true

@hect0x7 hect0x7 pinned this issue May 24, 2024
@hect0x7 hect0x7 linked a pull request May 24, 2024 that will close this issue
@hect0x7 hect0x7 added the enhancement New feature or request label May 24, 2024
@zengyufei
Copy link
Author

测试过,可以用,我个人使用阈值2完全没问题。

如果设置默认值,建议阈值3起步,阈值2感觉不太妥当

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants