Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(table): add table support for md reader and builder #350

Open
wants to merge 53 commits into
base: 0.6.1_dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
2ce212d
add client
royzhao Jan 7, 2025
9bafff6
add law
royzhao Jan 7, 2025
f3f6c16
fix
royzhao Jan 8, 2025
90c4af1
add extra
royzhao Jan 8, 2025
5203220
ok
royzhao Jan 8, 2025
e37a993
add corpus
royzhao Jan 12, 2025
e24d234
Merge remote-tracking branch 'origin/master' into kag_law_test
royzhao Jan 12, 2025
671a9a0
fix mix reader (#270)
zhuzhongshu123 Jan 14, 2025
6494fd2
feat(builder): add Azure Open AI Compatibility (#269)
joseosvaldo16 Jan 14, 2025
248b225
fix(builder): fix markdown reader for id (#273)
northmachine Jan 14, 2025
a40980a
fix(examples): fix qa file name (#251)
zhuzhongshu123 Jan 14, 2025
d4e0094
Merge remote-tracking branch 'origin/master' into kag_law_test
royzhao Jan 15, 2025
ca31351
support custom kag config file (#279)
zhuzhongshu123 Jan 15, 2025
deae277
feat(bridge): spg server bridge supports config check and run solver …
zhuzhongshu123 Jan 17, 2025
7666ca4
feat(kag): catch unexpected exceptions (#298)
zhuzhongshu123 Jan 17, 2025
4ad5bde
delete checkpoint of postprocess (#302)
zhuzhongshu123 Jan 18, 2025
e1e285a
support math and deduce
royzhao Jan 20, 2025
dd3222d
add law
royzhao Jan 20, 2025
1e57016
disable entity linking in postprocess by default (#304)
zhuzhongshu123 Jan 20, 2025
474e9f7
add law qa prompt
royzhao Jan 20, 2025
dd8233c
Merge branch 'master' of github.com:OpenSPG/KAG into kag_law_test
royzhao Jan 20, 2025
cdf0ea3
add retry (#306)
royzhao Jan 20, 2025
8f435ba
增加自定义memory
royzhao Jan 20, 2025
3348dfe
use json repair for llm client (#312)
zhuzhongshu123 Jan 21, 2025
c186871
merge master
royzhao Jan 21, 2025
8323b0e
fix ner
royzhao Feb 8, 2025
319d061
Merge remote-tracking branch 'origin/kag_law_test' into kag_law_test
royzhao Feb 8, 2025
80d2729
table build
youdonghai Feb 18, 2025
c743e56
rename
royzhao Feb 19, 2025
7d2b5f6
update
youdonghai Feb 19, 2025
3bed697
增加Rate limiter
royzhao Feb 19, 2025
8f21d17
增加Rate limiter
royzhao Feb 19, 2025
59aeb4c
增加lawob榜单
royzhao Feb 19, 2025
f42ef3a
merge 0.6.1_dev
royzhao Feb 19, 2025
3e1cfda
fix conflict
royzhao Feb 19, 2025
a8f9ea1
增加解析测试代码
royzhao Feb 19, 2025
a846c8d
修改prompt
royzhao Feb 19, 2025
8470aae
修改prompt
royzhao Feb 19, 2025
b196ac7
fix prompt
royzhao Feb 19, 2025
0fab747
add solver memory
royzhao Feb 19, 2025
13e1343
support math and deduce with rate limiter
royzhao Jan 17, 2025
662d104
tmp
youdonghai Feb 21, 2025
883d27e
tmp
youdonghai Feb 24, 2025
d080dc1
merge
youdonghai Feb 24, 2025
517adf9
update
youdonghai Feb 25, 2025
4b9ff95
update
youdonghai Feb 25, 2025
02449d1
update
youdonghai Feb 25, 2025
e1a02c8
update
youdonghai Feb 25, 2025
3535430
update
youdonghai Feb 25, 2025
3a29ff5
update
youdonghai Feb 25, 2025
adaa905
update
youdonghai Feb 25, 2025
0e217e5
update
youdonghai Feb 25, 2025
f71998b
update
youdonghai Feb 25, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -15,4 +15,6 @@
.idea/
.venv/
__pycache__/
.DS_Store
.DS_Store
.env
**ckpt/
1 change: 1 addition & 0 deletions kag/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -224,6 +224,7 @@
import kag.solver.prompt
import kag.common.vectorize_model
import kag.common.llm
import kag.common.rate_limiter
import kag.common.checkpointer
import kag.solver
import kag.bin.commands
14 changes: 14 additions & 0 deletions kag/benchmark/LawOB/builder/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Copyright 2023 OpenSPG Authors
#
# Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
# in compliance with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software distributed under the License
# is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
# or implied.

"""
Builder Dir.
"""
14 changes: 14 additions & 0 deletions kag/benchmark/LawOB/builder/data/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Copyright 2023 OpenSPG Authors
#
# Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
# in compliance with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software distributed under the License
# is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
# or implied.

"""
Place the files to be used for building the index in this directory.
"""
16,822 changes: 16,822 additions & 0 deletions kag/benchmark/LawOB/builder/data/law_corpus.json

Large diffs are not rendered by default.

14 changes: 14 additions & 0 deletions kag/benchmark/LawOB/builder/data/law_corpus_sub.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
{
"最高人民法院、最高人民检察院关于办理寻衅滋事刑事案件适用法律若干问题的解释 第六条": [
"最高人民法院、最高人民检察院关于办理寻衅滋事刑事案件适用法律若干问题的解释 第六条 -- 纠集他人三次以上实施寻衅滋事犯罪,未经处理的,应当依照刑法第二百九十三条第二款的规定处罚。"
],
"最高人民法院、最高人民检察院关于办理寻衅滋事刑事案件适用法律若干问题的解释\t法条具体内容": [
"法条具体内容 -- 《最高人民法院、最高人民检察院关于办理寻衅滋事刑事案件适用法律若干问题的解释》旨在依法惩治寻衅滋事犯罪,维护社会秩序。该解释明确了寻衅滋事的认定标准,指出行为人无事生非、借故生非或因日常生活矛盾纠纷而实施的特定行为,一般不认定为寻衅滋事,但有特定情形除外。对于随意殴打他人、追逐、拦截、辱骂、恐吓他人以及强拿硬要或任意损毁、占用公私财物等行为,解释详细列举了构成“情节恶劣”或“情节严重”的具体情形。此外,解释还对公共场所起哄闹事、纠集他人实施寻衅滋事犯罪、以及寻衅滋事与其它犯罪的并罚原则进行了规定。最后,对于认罪、悔罪、赔偿损失或取得谅解的行为人,解释提供了从轻处罚或免予刑事处罚的可能性。"
],
"最高人民法院、最高人民检察院关于办理寻衅滋事刑事案件适用法律若干问题的解释 第八条": [
"最高人民法院、最高人民检察院关于办理寻衅滋事刑事案件适用法律若干问题的解释 第八条 -- 行为人认罪、悔罪,积极赔偿被害人损失或者取得被害人谅解的,可以从轻处罚;犯罪情节轻微的,可以不起诉或者免予刑事处罚。"
],
"最高人民法院、最高人民检察院关于办理寻衅滋事刑事案件适用法律若干问题的解释 第一条": [
"最高人民法院、最高人民检察院关于办理寻衅滋事刑事案件适用法律若干问题的解释 第一条 -- 行为人为寻求刺激、发泄情绪、逞强耍横等,无事生非,实施刑法第二百九十三条规定的行为的,应当认定为“寻衅滋事”。行为人因日常生活中的偶发矛盾纠纷,借故生非,实施刑法第二百九十三条规定的行为的,应当认定为“寻衅滋事”,但矛盾系由被害人故意引发或者被害人对矛盾激化负有主要责任的除外。行为人因婚恋、家庭、邻里、债务等纠纷,实施殴打、辱骂、恐吓他人或者损毁、占用他人财物等行为的,一般不认定为“寻衅滋事”,但经有关部门批评制止或者处理处罚后,继续实施前列行为,破坏社会秩序的除外。"
]
}
Empty file.
308 changes: 308 additions & 0 deletions kag/benchmark/LawOB/builder/impl/law_extra.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,308 @@
import copy
import json
import pandas as pd
from typing import Type, Dict, List

from kag.common.utils import processing_phrases, to_camel_case
from kag.solver.logic.core_modules.common.schema_utils import SchemaUtils
from kag.solver.logic.core_modules.common.text_sim_by_vector import TextSimilarity
from kag.solver.logic.core_modules.config import LogicFormConfiguration
from knext.search.client import SearchClient

from kag.builder.model.sub_graph import SubGraph
from knext.common.base.runnable import Input, Output
from kag.common.conf import KAG_PROJECT_CONF
from knext.schema.client import SchemaClient

from kag.interface import LLMClient, PromptABC, VectorizeModelABC, ExtractorABC


@ExtractorABC.register("law_schema_constraint_extractor")
class LawSchemaConstraintExtractor(ExtractorABC):
"""
Perform knowledge extraction for enforcing schema constraints, including entities, events and their edges.
The types of entities and events, along with their respective attributes, are automatically inherited from the project's schema.
"""

def __init__(
self,
vectorize_model: VectorizeModelABC,
llm: LLMClient
):
"""
Initializes the SchemaBasedExtractor instance.

Args:
llm (LLMClient): The language model client used for extraction.
"""
super().__init__()
self.llm = llm
self.vectorize_model = vectorize_model
self.text_similarity = TextSimilarity(vectorize_model)
self.schema: SchemaUtils = SchemaUtils(
LogicFormConfiguration(
{
"KAG_PROJECT_ID": KAG_PROJECT_CONF.project_id,
"KAG_PROJECT_HOST_ADDR": KAG_PROJECT_CONF.host_addr,
}
)
)
self._init_search()
self.df = pd.read_csv('/Users/peilong/Downloads/chargeInLaw.csv')
self.df.set_index("name", inplace=True)
self.charge_law = self.df.to_dict(orient='index')

self.item_2_charge = {}
for k,v in self.charge_law.items():
legal_item = v['legalRepresentative']
if legal_item in self.item_2_charge:
self.item_2_charge[legal_item].append(k)
else:
self.item_2_charge[legal_item] = [k]


def _init_search(self):
"""
Initializes the search client for entity linking.
"""
self._search_client = SearchClient(
KAG_PROJECT_CONF.host_addr, KAG_PROJECT_CONF.project_id
)

@property
def input_types(self) -> Type[Input]:
return Dict

@property
def output_types(self) -> Type[Output]:
return SubGraph

def parse_nodes_and_edges(self, entities: List[Dict], category: str = None):
"""
Parses nodes and edges from a list of entities.

Args:
entities (List[Dict]): The list of entities.

Returns:
Tuple[List[Node], List[Edge]]: The parsed nodes and edges.
"""
graph = SubGraph([], [])
entities = copy.deepcopy(entities)
root_nodes = []
for record in entities:
if record is None:
continue
if isinstance(record, str):
record = {"name": record}
s_name = record.get("name", "")
s_label = record.get("category", category)
linked_entity = self.link_entity(entity_name=s_name, entity_type=s_label)
s_name = linked_entity.get("name", "")
s_label = linked_entity.get("category", category)

properties = record.get("properties", {})
# At times, the name and/or label is placed in the properties.
if not s_name:
s_name = properties.pop("name", "")
if not s_label:
s_label = properties.pop("category", "")
if not s_name or not s_label:
continue
s_name = processing_phrases(s_name)
root_nodes.append((s_name, s_label))
lined_properties = linked_entity.get("properties", {})
lined_properties.update(properties)
tmp_properties = copy.deepcopy(lined_properties)
for prop_name, prop_value in properties.items():
if prop_value is None:
tmp_properties.pop(prop_name)
continue
record["properties"] = tmp_properties
# NOTE: For property converted to nodes/edges, we keep a copy of the original property values.
# Perhaps it is not necessary?
graph.add_node(id=s_name, name=s_name, label=s_label, properties=properties)

if "official_name" in record:
official_name = processing_phrases(record["official_name"])
if official_name != s_name:
graph.add_node(
id=official_name,
name=official_name,
label=s_label,
properties=properties,
)
graph.add_edge(
s_id=s_name,
s_label=s_label,
p="OfficialName",
o_id=official_name,
o_label=s_label,
)

return root_nodes, graph.nodes, graph.edges

@staticmethod
def add_relations_to_graph(
sub_graph: SubGraph, entities: List[Dict], relations: List[list]
):
"""
Add edges to the subgraph based on a list of relations and entities.
Args:
sub_graph (SubGraph): The subgraph to add edges to.
entities (List[Dict]): A list of entities, for looking up category information.
relations (List[list]): A list of relations, each representing a relationship to be added to the subgraph.
Returns:
The constructed subgraph.

"""

for rel in relations:
if len(rel) != 5:
continue
s_name, s_category, predicate, o_name, o_category = rel
s_name = processing_phrases(s_name)
sub_graph.add_node(s_name, s_name, s_category)
o_name = processing_phrases(o_name)
sub_graph.add_node(o_name, o_name, o_category)
edge_type = to_camel_case(predicate)
if edge_type:
sub_graph.add_edge(s_name, s_category, edge_type, o_name, o_category)
return sub_graph

def assemble_subgraph(
self,
entities: List[Dict],
relations: List[list],
):
"""
Assembles a subgraph from the given chunk, entities, events, and relations.

Args:
entities (List[Dict]): The list of entities.

Returns:
The constructed subgraph.
"""
graph = SubGraph([], [])
_, entity_nodes, entity_edges = self.parse_nodes_and_edges(entities)
graph.nodes.extend(entity_nodes)
graph.edges.extend(entity_edges)
self.add_relations_to_graph(graph, entities, relations)
return graph

def link_entity(self, entity_type, entity_name):
return {
"name": entity_name,
"category": entity_type
}
# res = self._search_client.search_vector(self.schema.get_label_within_prefix(entity_type), property_key="name",
# query_vector=self.vectorize_model.vectorize(entity_name), topk=1)
# if len(res) == 0 or res[0]['score'] < 0.95:
# if len(res) and res[0]['score'] > 0.9:
# print(f"{res[0]['node']['name']} not same with {entity_name}")
# return {
# "name": entity_type,
# "category": entity_type
# }
# def extra_label(node):
# labels = node['__labels__']
# for label in labels:
# if label != "Entity":
# return self.schema.get_label_without_prefix(label)
# return None
#
# def extra_properties(node):
# prop = {}
# for k,v in node.items():
# if k.startswith("_"):
# continue
# prop[k]=v
# return prop
# label = extra_label(res[0]['node'])
# prop = extra_properties(res[0]['node'])
# if label is None:
# return {
# "name": entity_type,
# "category": entity_type
# }
# return {
# "name": res[0]['node']['name'],
# "category": label,
# "properties": prop
# }

def _invoke(self, input: Input, **kwargs) -> List[Output]:
"""
Invokes the extractor on the given input.

Args:
input (Input): The input data.
**kwargs: Additional keyword arguments.

Returns:
List[Output]: The list of output results.
"""
"""
"law_name": law_name,
"item_name": item_name,
"item_content": item_content,
"index": i+1,
"""
law_name = input["law_name"]
entities = [{
"name": law_name,
"category": "LegalName"
}]
relations = []
"""
LegalItem-relatedChargeName->ChargeName
LegalItem-belongToLaw->LegalName
LegalItem-belongToItem->ItemIndex
"""

item_name = input['item_name']
item_content = input['item_content']
entities.append({
"name": item_name,
"category": "LegalItem",
"properties": {
"name": item_name,
"content": item_content
}
})
relations.append([
item_name, "LegalItem", "belongToLaw", law_name, "LegalName"
])
entities.append({
"name": item_name.replace(law_name, ''),
"category": "ItemIndex"
})
relations.append([
item_name, "LegalItem", "belongToItem", item_name.replace(law_name, ''), "ItemIndex"
])
entities.append({
"name": f"第{str(input['index'])}条",
"category": "ItemIndex"
})
relations.append([
item_name, "LegalItem", "belongToItem", f"第{str(input['index'])}条", "ItemIndex"
])
if "刑法" in item_name:
charge_name_set = self.text_similarity.text_sim_result(item_name, list(self.item_2_charge.keys()), topk=1,
low_score=0.96, is_cached=False)
if len(charge_name_set):
print(f"charge name {item_name} sim {charge_name_set}")
charge_item = charge_name_set[0][0]
charges = list(set(self.item_2_charge[charge_item]))
for c in charges:
entities.append({
"name": processing_phrases(c),
"category": "ChargeName"
})
relations.append([
item_name, "LegalItem", "relatedChargeName", processing_phrases(c), "ChargeName"
])

subgraph = self.assemble_subgraph(entities, relations)
return [subgraph]
Loading
Loading