-
Notifications
You must be signed in to change notification settings - Fork 7
Reorganising cluster data #144
base: develop
Are you sure you want to change the base?
Conversation
…ers in the form <cluster_name>:<cluster_id>. The current set of clusters have been renamed from 'cluster_i2' to 'markov_i2' as they were created using Markov clustering with inflation set to 2.
docker-compose down | ||
docker-compose run spec sh /app/test/run_tests.sh | ||
docker-compose down |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sure that any docker containers that are hanging around don't accidentally contaminate test runs.
# prefix: markov_i2 | ||
# title: Markov clustering, inflation = 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll formalise this with the Jacobson group some time in the coming week or two. Right now, the cluster prefix is hardcoded in the parser, but ideally I'd get this information from the cluster file or from a metadata file in the clusters directory.
def test_fetch_phenotypes_no_results(self): | ||
|
||
resp = self.submit_query('djornl_fetch_phenotypes', { | ||
# gene node | ||
"keys": ["AT1G01010"], | ||
}) | ||
self.assertEqual(resp['results'][0], self.no_results) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Combined the tests for fetch calls with no results into the calls with results
|
||
@unittest.skip('This test is disabled until automated view loading is possible') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enabled this test
Looks good, only had one comment above |
…es that make up the data, plus code to validate the manifest. Created manifests for all test files and updated tests accordingly Added djornl data source (github repo)
"oneOf": [ | ||
{ | ||
"properties": { | ||
"data_type": { "enum": ["cluster"] } | ||
}, | ||
"required": [ "prefix" ] | ||
}, | ||
{ | ||
"properties": { | ||
"data_type": { "enum": [ "node", "edge" ] } | ||
} | ||
} | ||
], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the data type is "cluster", there must be a "prefix" field, which dictates the cluster label
This pull request introduces 4 alerts when merging 93205a2 into 53eecfe - view on LGTM.com new alerts:
|
def check_deltas(self, edge_data={}, node_metadata={}, cluster_data={}): | ||
|
||
edge_nodes = set([e['_key'] for e in edge_data['nodes']]) | ||
node_metadata_nodes = set([e['_key'] for e in node_metadata['nodes']]) | ||
cluster_nodes = set([e['_key'] for e in cluster_data['nodes']]) | ||
all_nodes = edge_nodes.union(node_metadata_nodes).union(cluster_nodes) | ||
|
||
# check all nodes in cluster_data have node_metadata | ||
clstr_no_node_md_set = cluster_nodes.difference(node_metadata_nodes) | ||
if clstr_no_node_md_set: | ||
print({'clusters with no node metadata': clstr_no_node_md_set}) | ||
|
||
# check all nodes in the edge_data have node_metadata | ||
edge_no_node_md_set = edge_nodes.difference(node_metadata_nodes) | ||
if edge_no_node_md_set: | ||
print({'edges with no node metadata': edge_no_node_md_set}) | ||
|
||
# count all edges | ||
print("Dataset contains " + str(len(edge_data['edges'])) + " edges") | ||
# count all nodes | ||
print("Dataset contains " + str(len(all_nodes)) + " nodes") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some basic sanity checks on the parsed data, plus a very high level overview of what is being added
# self.json_data[query][primary_param][distance_param] | ||
# if primary_param is an array, join the array entities with "__" | ||
# self.json_data[query_name][param_name][param_value]["distance"][distance_param] | ||
# e.g. for fetch_clusters data: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reorganised the results data, so retrieving the appropriate results has changed further down in this file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had one comment, otherwise it is looking good
'user_notes': cols[19], | ||
} | ||
nodes.append(doc) | ||
headers = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there is a bunch of repeated parsing code between the node and edge loaders we could have a generator function for reuse:
def iterate_rows(path):
expected_col_count = 0
with open(path) as fd:
csv_reader = csv.reader(fd, delimiter='\t')
line_no = 0
for row in csv_reader:
line_no += 1
if len(row) <= 1 or row[0][0] == '#':
# comment / metadata
continue
cols = [c.strip() for c in row]
if line_no == 0:
expected_col_count = len(cols)
line_no += 1
yield cols
if len(cols) != expected_col_count:
raise RuntimeError(f"{path} line {line_no}: expected {expected_col_count} cols, found {len(cols)}")
line_no += 1
yield cols
I just wrote this out without testing, but it should have the general idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I may as well make the separator also configurable/guessable from info in the manifest file.
Reorganising cluster data to be a single field with an array of clusters in the form
<cluster_name>:<cluster_id>
. The current set of clusters have been renamed from 'cluster_i2' to 'markov_i2' as they were created using Markov clustering with inflation set to 2.