-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/vapaaehtoistyö.fi importer #456
base: master
Are you sure you want to change the base?
The head ref may contain hidden characters: "feature/vapaaehtoisty\u00F6.fi-importer"
Conversation
The first one is based on memcached and regex. The columns of interest are collected in a separate memcached store which is being populated via cron job. The query string is then parsed and transformed into a regex which is applied to the cache and retrieves the event ids. Second search is based on the Postgres full-text search functionality and hence is language specific. Separate tsvector columns are created in the db for Finnish, Swedish, and English. They are populated and kept up-to-date on the db side, see migration 0080. urls.py has a small correction of the imports, not related to the main topic of the commit.
KeywordMatcher takes as input the strings of words, mostly without language specification, and has to transform them into a list of Keywords. This commit suggests the following logic: first an exactly matching KeywordLabel is searched for irrespective of the language. If no exact matches found, Postgres full-text search is used to find a label matched by lexeme. The results are ranked with TrigramSimilarity as ft SearchRank is not suitable for ranking matched individual words. If no language is passed we cycle through all the options as specified in FULLTEXT_SEARCH_LANGUAGES and select the best match according to similarity. If no match is found the string is checked for the possibility that it could be split and search for matches is repeated.
…to make migration succeed on Travis
Codecov Report
@@ Coverage Diff @@
## master #456 +/- ##
==========================================
- Coverage 22.48% 22.13% -0.35%
==========================================
Files 169 177 +8
Lines 11986 12480 +494
==========================================
+ Hits 2695 2763 +68
- Misses 9291 9717 +426
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to request unit tests, because they would have already been for use while reviewing. They would explain the purpose, demo the usage and proof the working.
Some comments added already to #455 which seems to be included in this PR also.
def to_dict(self): | ||
newrecord = { | ||
"id": self.id, | ||
"organization_id": self.organization_id, | ||
"organization_name": self.organization_name, | ||
"title": self.title, | ||
"address": self.address, | ||
"address_coordinates": self.address_coordinates, | ||
|
||
"no_time": self.no_time, | ||
"timestamp_start": self.timestamp_start, | ||
"timestamp_end": self.timestamp_end, | ||
|
||
"description": self.description, | ||
"contact_details": self.contact_details, | ||
"themes": self.themes, | ||
"timestamp_publish": self.timestamp_publish, | ||
|
||
"status": self.status, | ||
"creator_id": self.creator_id, | ||
"timestamp_inserted": self.timestamp_inserted, | ||
"timestamp_updated": self.timestamp_updated | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the only place this to_dict() -method is used?
def _import_event(self, event_obj):
event = event_obj.to_dict()
Doesn't __dict__
do the same?
In [50]: r1.__class__
Out[50]: events.importer.helper.importers.vapaaehtoistyofi.record.Record
In [51]: r1.__dict__
Out[51]:
{'id': 'id',
'organization_id': 'organization_id',
'organization_name': 'organization_name',
'title': 'title',
'address': 'address',
'address_coordinates': 'address_coordinates',
'tags': [],
'no_time': None,
'timestamp_start': None,
'timestamp_end': None,
'description': None,
'contact_details': None,
'themes': None,
'timestamp_publish': None,
'status': None,
'creator_id': None,
'timestamp_inserted': None,
'timestamp_updated': None}
In [52]: len(r1.__dict__)
Out[52]: 18
In [53]: len(r1.to_dict())
Out[53]: 17
Tags is missing from __dict__,
but is it missing in purpose and if it is, it would be better to just delete it from the new instance done with __dict__
.
def __copy__(self) -> object: | ||
newrecord = Record(None) | ||
newrecord.id = self.id | ||
newrecord.organization_id = self.organization_id | ||
newrecord.organization_name = self.organization_name | ||
newrecord.title = self.title | ||
newrecord.address = self.address | ||
newrecord.address_coordinates = self.address_coordinates | ||
newrecord.tags = self.tags | ||
|
||
newrecord.no_time = self.no_time | ||
newrecord.timestamp_start = self.timestamp_start | ||
newrecord.timestamp_end = self.timestamp_end | ||
|
||
newrecord.description = self.description | ||
newrecord.contact_details = self.contact_details | ||
newrecord.themes = self.themes | ||
newrecord.timestamp_publish = self.timestamp_publish | ||
|
||
newrecord.status = self.status | ||
newrecord.creator_id = self.creator_id | ||
newrecord.timestamp_inserted = self.timestamp_inserted | ||
newrecord.timestamp_updated = self.timestamp_updated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed? Doesn't self.__copy__()
already do the same?
In [50]: r1.__class__
Out[50]: events.importer.helper.importers.vapaaehtoistyofi.record.Record
In [51]: r1.__dict__
Out[51]:
{'id': 'id',
'organization_id': 'organization_id',
'organization_name': 'organization_name',
'title': 'title',
'address': 'address',
'address_coordinates': 'address_coordinates',
'tags': [],
'no_time': None,
'timestamp_start': None,
'timestamp_end': None,
'description': None,
'contact_details': None,
'themes': None,
'timestamp_publish': None,
'status': None,
'creator_id': None,
'timestamp_inserted': None,
'timestamp_updated': None}
In [57]: r2 = r1.__copy__()
In [58]: r2.__dict__ == r1.__dict__
Out[58]: True
I could have tried if there would be unit tests for this. ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My Python:
AttributeError: 'Record' object has no attribute '__copy__'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one solved by dropping __copy()
.
def __init__(self, json_dict): | ||
if json_dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it better to destruct the json before initializing a new class instance or at least use another constructor method like from_json(json_object)
. So I would map the json values to the class instance with a json serializer. Then the class instance creation would be as simple as Record(**record_dict)
.
I'll start working with the requested changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some changes in this PR that belongs to another feature so it's quite hard to understand the whole thing without looking at #443
return self.entries[id] | ||
|
||
def _load_entry_api(self, id): | ||
http_client = self._setup_client() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO It'd be better to add the requests.Session()
client to the Reader class property, then set it up in the constructor function init(), that way you don't have to call the _setup_client()
before making every HTTP call, just use self.client_session
. Isn't it the goal of using Session 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in 6bd3a28
http_client = self._setup_client() | ||
page = 1 | ||
total_records = None | ||
batch_size = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd get rid of batch_size
here, change the while loop to True
, then break the loop if len(data['data']['records']) == 0
. This could make this shorter and a bit easier to understand
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This suggestion is heavily opinionated.
I typically tend not to use while True
-stucture as it can yield a forever-loop.
response.status_code) | ||
|
||
data = response.json() | ||
if 'status' not in data or data['status'] != "ok": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part repeats quite many times, maybe consider moving it to a separated private function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added check-function in 6bd3a28
raise ValueError("Unknown locale '%s'!" % locale) | ||
return self.TASK_URL_FORMAT % (locale, self.id) | ||
|
||
def __copy__(self) -> object: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the use of this function in the PR?
For copying objects, there are a copy.copy()
or copy.deepcopy()
already available in python copy
module
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
__copy__()
is already dropped in 65c526d.
The reason #443 files have been copied here is to get database migration to work at all. Exactly the same is happening with unit tests, they do not work. At all. Assumption is for
My only conclusion is, that codebase isn't in a shape for unit testing to work for this branch. |
Adding new importer for Vapaaehtoistyö.fi volunteering tasks.
Note: Setting-variable
VAPAAEHTOISTYOFI_API_KEY
is required for successful access to the resource.