Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Choosing parameters for large dataset of short texts #2

Open
bwang482 opened this issue Mar 1, 2018 · 0 comments
Open

Choosing parameters for large dataset of short texts #2

bwang482 opened this issue Mar 1, 2018 · 0 comments

Comments

@bwang482
Copy link

bwang482 commented Mar 1, 2018

Thanks for your great work Joe!

Following the provided notebook, I have been trying to use hlda to infer topics on a large set (~100,000 docs) of short text docs with vocab size of 15000. The sampling is very slow, took about 11 hours for 10 iterations (n_samples = 10).

From my results as well as your demo It seems level-0 only has one topic which contains all docs. It makes sense since level-0 is at the top of the hierarchy. But I still want to confirm that if I want to have 4 levels of topics with each level containing different topic/cluster assignments, I should set num_levels = 5?

Finally, may I ask how to (or if there is any intuition I can use ) choose values for alpha and gamma? Especially for inferring large set of short text docs?

Thanks again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant