Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

simplify directory structure for framework/ #50

Closed
jspaaks opened this issue Jul 13, 2020 · 9 comments
Closed

simplify directory structure for framework/ #50

jspaaks opened this issue Jul 13, 2020 · 9 comments
Assignees
Labels
Milestone

Comments

@jspaaks
Copy link
Contributor

jspaaks commented Jul 13, 2020

Below is what I think what's needed for preparing the model to be used in the web interface.

Couple observations:

  • I think we can ditch docker-compose (for /framework, not for /).
  • I've incorpoarted English demo dataset #47 (multi language sample data sets)
  • /preprocessing/Dockerfile is only there to provide hasslefree dependency management, such that people don't have to install conda, Jupyter notebook and what have you.

Update Updated directory structure below, #50 (comment)

.
├── corpora
│   ├── en-us
│   │   └── police-aggression
│   │       ├── file1.txt
│   │       └── file2.txt
│   ├── fr
│   │   ├── baguettes
│   │   │   ├── file1.txt
│   │   │   └── file2.txt
│   │   └── croissants
│   │       └── file1.txt
│   └── nl
│       └── getuigenverhalen
│           ├── file1.txt
│           ├── file2.txt
│           └── file3.txt
├── docker-compose.yml
├── output
│   ├── README.md
│   ├── tag-1
│   │   ├── model
│   │   ├── report.log
│   │   └── vocabulary
│   ├── tag-2
│   │   ├── model
│   │   ├── report.log
│   │   └── vocabulary
│   └── tag-n
│       ├── model
│       ├── report.log
│       └── vocabulary
├── model-generation
│   ├── Dockerfile
│   └── notebooks
│       ├── corpus_reinferral.ipynb
│       ├── filepaths.ipynb
│       ├── generate_doc2vec_model.ipynb
│       ├── preprocess_corpus.ipynb
│       └── tokenizer.ipynb
└── ui
    ├── indexer
    ├── internal
    ├── testdata
    └── ui
@sverhoeven
Copy link
Member

  1. A model (aka doc2vec json) is based on a single corpus. How would you store a model for each corpus in this directory structure?
  2. Is a vocabulary a (pre)processed corpus?
  3. Handling different languages is not part of the milestone. I think we should only introduce a language directory if we are going to use it. Can the language dir be removed?

@sverhoeven
Copy link
Member

output is confusing directory name as it is input for the ui. Can output be renamed/removed?

@sverhoeven
Copy link
Member

The directory structure ties in with how we want a user to use the repo, which is drafted in #41

@jspaaks
Copy link
Contributor Author

jspaaks commented Jul 13, 2020

  1. A model (aka doc2vec json) is based on a single corpus. How would you store a model for each corpus in this directory structure?

I updated the tree to have tag directory, where a tag serves to identify a given analysis

Is a vocabulary a (pre)processed corpus?

Vocabulary is the combination of what is now _ids and _corpus, i.e. whatever is produced by gensim

Handling different languages is not part of the milestone. I think we should only introduce a language directory if we are going to use it. Can the language dir be removed?

Just to be clear, your preference is to have corpora/getuigenverhalen instead of corpora/nl/getuigenverhalen even if nl doesn't have siblings?

@sverhoeven
Copy link
Member

  1. A model (aka doc2vec json) is based on a single corpus. How would you store a model for each corpus in this directory structure?

I updated the tree to have tag directory, where a tag serves to identify a given analysis

How would a output/tag-1 output be traceable to one of the corpus? By using corpus name in tag name?

Is a vocabulary a (pre)processed corpus?

Vocabulary is the combination of what is now _ids and _corpus, i.e. whatever is produced by gensim

Handling different languages is not part of the milestone. I think we should only introduce a language directory if we are going to use it. Can the language dir be removed?

Just to be clear, your preference is to have corpora/getuigenverhalen instead of corpora/nl/getuigenverhalen even if nl doesn't have siblings?

Yes, corpora/getuigenverhalen is my preference

@fdiblen fdiblen removed the standup label Jul 13, 2020
@fdiblen fdiblen added this to the Milestone 1 milestone Jul 13, 2020
@jspaaks jspaaks self-assigned this Jul 13, 2020
@jspaaks
Copy link
Contributor Author

jspaaks commented Jul 13, 2020

(Updated directory structure after some discussion)

.
├── docker-compose.yml
├── experiments
│   ├── demo
│   │   ├── config.env
│   │   ├── corpus
│   │   │   ├── file1.txt
│   │   │   ├── file2.txt
│   │   │   └── file3.txt
│   │   ├── model
│   │   ├── report.log
│   │   └── vocabulary.json
│   ├── getuigenverhalen
│   │   ├── config.env
│   │   ├── corpus
│   │   │   ├── file1.txt
│   │   │   ├── file2.txt
│   │   │   └── file3.txt
│   │   ├── model
│   │   ├── report.log
│   │   └── vocabulary.json
│   └── README.md
├── model-generation
│   ├── Dockerfile
│   └── notebooks
│       ├── corpus_reinferral.ipynb
│       ├── filepaths.ipynb
│       ├── generate_doc2vec_model.ipynb
│       ├── preprocess_corpus.ipynb
│       └── tokenizer.ipynb
└── ui
    ├── indexer
    ├── internal
    ├── testdata
    └── ui

@jspaaks
Copy link
Contributor Author

jspaaks commented Jul 13, 2020

working on it in #58

@jspaaks
Copy link
Contributor Author

jspaaks commented Jul 15, 2020

I think this can be closed.

@fdiblen
Copy link
Contributor

fdiblen commented Jul 15, 2020

implemented in #61

@fdiblen fdiblen closed this as completed Jul 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants