Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automated Parser Generation and Serialization #333

Open
wants to merge 18 commits into
base: develop
Choose a base branch
from

Conversation

dllliu
Copy link
Collaborator

@dllliu dllliu commented Jul 9, 2024

  • Package parser/serialization/metrics work into SerializedParserMetrics class
  • Pre processing/clustering for building point labels with DBScan
  • Parsers dynamically ran via dynamic module loading at runtime to obtain emitted tokens
  • Multiple abbreviation list support (tools to merge/sort abbreviations)

Checklist

  • Example notebook showcasing how to use
  • Unit tests for SerializedParserMetrics class
  • Detailed Documentation for relevant functions and SerializedParserMetrics class
  • Exception Handling for file/io errors, not enough points to cluster, invalid llm output
  • Parser and Serializer work has been combined

Notes

  • Schikit-Learn requires Python version of at least 3.9
  • Other added packages are: langchain, langchain-community, pyenchant, scikit-learn
  • Running all unit tests will take longer (SerializedParserMetrics class takes time to populate)

@dllliu dllliu requested a review from gtfierro July 9, 2024 20:56
@gtfierro
Copy link
Collaborator

Hey @dllliu and @TShapinsky -- does this incorporate changes from any outstanding PRs? I want to make sure my review sticks to @dllliu's code

@TShapinsky
Copy link
Member

Hey @dllliu and @TShapinsky -- does this incorporate changes from any outstanding PRs? I want to make sure my review sticks to @dllliu's code

There's a cherrypicked commit from the create parser UI branch which fixes the serialization. Besides that, no

Copy link
Collaborator

@gtfierro gtfierro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, @dllliu ! Let me know if you have any questions on my comments

notebooks/examples/basic.csv Outdated Show resolved Hide resolved
buildingmotif/label_parsing/SerializedParserMetrics.py Outdated Show resolved Hide resolved
buildingmotif/label_parsing/SerializedParserMetrics.py Outdated Show resolved Hide resolved
from buildingmotif.label_parsing.tools import abbreviationsTool, codeLinter


class SerializedParserMetrics:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if you had something else in mind, but I'm wondering why this class is building the parser. I think this metrics class is better suited as an output of another method which builds the parsers.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if I am misinterpreting, but the generate_parsers_for_points() and generate_parsers_for_clusters() methods in __init__ are building the parsers, while the class just populates the relevant instance variables with the appropriate information.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but these are called from this class. I think the entrypoint should be something like a ParserBuilder class which calls those generation methods, and emits the Metrics class. We are not constructing the metrics. We are constructing parsers! The metrics are a side-effect of this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gtfierro The ParserBuilder class now emits a ParserMetrics class which takes care of serialization and gathering parsing related metrics. The ParserBuilder class still keeps track of the clustering and distance metrics, as that is directly related to parser generation. The notebook and tests have also been updated, let me know what you think.

buildingmotif/label_parsing/combinators.py Outdated Show resolved Hide resolved
buildingmotif/label_parsing/combinators.py Outdated Show resolved Hide resolved
buildingmotif/label_parsing/docs/usage.py Outdated Show resolved Hide resolved
buildingmotif/label_parsing/docs/usage.py Outdated Show resolved Hide resolved
pyproject.toml Outdated
Comment on lines 42 to 46
abbreviations = "^0.2.5"
langchain = "^0.2.3"
langchain-community = "^0.2.4"
pyenchant = "^3.2.2"
scikit-learn = "^1.5.0"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we gate your dependencies behind a feature flag? Call it something like label_parsing

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest commit also addresses this, please let me know if this is resolved.

notebooks/examples/usage.md Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants