Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pulls Tweets with same text and different id's #15

Open
austinpgraham opened this issue Apr 7, 2017 · 4 comments
Open

Pulls Tweets with same text and different id's #15

austinpgraham opened this issue Apr 7, 2017 · 4 comments

Comments

@austinpgraham
Copy link
Contributor

I found that the issue in the labeler pulling the same tweet is that our database contains tweets with the same text but different id's, sometimes upwards of 20-30 times.

@cegme
Copy link
Member

cegme commented Apr 7, 2017

Can you post some examples? Also post the code that is used to generate the tweets.

@austinpgraham
Copy link
Contributor Author

For example, this query:

Query: SELECT t1.status, t2.status FROM tweets t1, tweets t2 WHERE t1.tweetid = 849076591657353217 AND t2.tweetid = 849076590088736770

The result:
(u'RT @Datosdeunamor: M\xc9DICO JAPON\xc9S REVELA COMO MATAR DE RA\xcdZ LA BACTERIA DE HELICOBACTER PYLORI QUE PROVOCA GASTRITIS, \xdaLCERAS Y M\xc1S!\u2026 ',

u'RT @Datosdeunamor: M\xc9DICO JAPON\xc9S REVELA COMO MATAR DE RA\xcdZ LA BACTERIA DE HELICOBACTER PYLORI QUE PROVOCA GASTRITIS, \xdaLCERAS Y M\xc1S!\u2026 ')

In my test data set (364 tweets pulled from manchester) only 281 of them are distinct.

@cegme
Copy link
Member

cegme commented Apr 7, 2017

Are these tweetids also duplicated in the index?

@austinpgraham
Copy link
Contributor Author

No sir they are not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants