-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathatom.xml
1203 lines (830 loc) · 81.7 KB
/
atom.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<title><![CDATA[bio4j]]></title>
<link href="http://bio4j.com/atom.xml" rel="self"/>
<link href="http://bio4j.com/"/>
<updated>2015-03-23T09:56:09+01:00</updated>
<id>http://bio4j.com/</id>
<author>
<name><![CDATA[oh no sequences!]]></name>
</author>
<generator uri="http://octopress.org/">Octopress</generator>
<entry>
<title type="html"><![CDATA[Bio4j preprint available]]></title>
<link href="http://bio4j.com/blog/2015/03/bio4j-preprint-available/"/>
<updated>2015-03-22T18:20:00+01:00</updated>
<id>http://bio4j.com/blog/2015/03/bio4j-preprint-available</id>
<content type="html"><![CDATA[<p>A citable preprint in the <a href="http://biorxiv.org/">bioRxiv</a> describing Bio4j went online yesterday:</p>
<ul>
<li><strong><a href="http://biorxiv.org/content/early/2015/03/20/016758">Bio4j: a high-performance cloud-enabled graph-based data platform</a></strong></li>
</ul>
<p>It serves (we hope) as a good introduction to what is Bio4j, and what it has to offer; especially so if, for getting a general idea of Bio4j, you would rather read prose than code. If you are using Bio4j for something that you want to publish, citing it is much easier now: all bioRxiv preprints are assigned a DOI. Comments, thoughts, opinions are all more than welcome! We will submit a paper based on this preprint to an open access journal. For completeness, here’s the citation info and the abstract:</p>
<hr />
<p><br /></p>
<h3 id="bio4j-a-high-performance-cloud-enabled-graph-based-data-platform">Bio4j: a high-performance cloud-enabled graph-based data platform</h3>
<p><em>Pablo Pareja-Tobes, Raquel Tobes, Marina Manrique, Eduardo Pareja, Eduardo Pareja-Tobes</em> <br />
<strong>bioRxiv</strong> – <strong>doi</strong>: <a href="http://dx.doi.org/10.1101/016758">10.1101/016758</a></p>
<!-- ### Abstract -->
<p><strong>Background.</strong> Next Generation Sequencing and other high-throughput technologies have brought a revolution to the bioinformatics landscape, by offering sheer amounts of data about previously unaccessible domains in a cheap and scalable way. However, fast, reproducible, and cost-effective data analysis at such scale remains elusive. A key need for achieving it is being able to access and query the vast amount of publicly available data, specially so in the case of knowledge-intensive, semantically rich data: incredibly valuable information about proteins and their functions, genes, pathways, or all sort of biological knowledge encoded in ontologies remains scattered, semantically and physically fragmented.</p>
<p><strong>Methods and Results.</strong> Guided by this, we have designed and developed Bio4j. It aims to offer a platform for the integration of semantically rich biological data using typed graph models. We have modeled and integrated most publicly available data linked with proteins into a set of interdependent graphs. Data querying is possible through a data model aware Domain Specific Language implemented in Java, letting the user write typed graph traversals over the integrated data. A ready to use cloud-based data distribution, based on the Titan graph database engine is provided; generic data import code can also be used for in-house deployment.</p>
<p><strong>Conclusion.</strong> Bio4j represents a unique resource for the current Bioinformatician, providing at once a solution for several key problems: data integration; expressive, high performance data access; and a cost-effective scalable cloud deployment model.</p>
]]></content>
</entry>
<entry>
<title type="html"><![CDATA[Bio4j: updates]]></title>
<link href="http://bio4j.com/blog/2015/03/bio4j-updates/"/>
<updated>2015-03-11T17:18:00+01:00</updated>
<id>http://bio4j.com/blog/2015/03/bio4j-updates</id>
<content type="html"><![CDATA[<p>We’ve spent the past few months working <em>really</em> hard on Bio4j. There has not been a lot of updates here basically because there were too many new things happening :) </p>
<p>But now things are stabilizing and it’s about time we start to introduce all the new features and improvements we have in store. In this first post I just want to give an overview of Bio4j’s current state, going into more detail in subsequent posts.</p>
<h2 id="bio4j-now">Bio4j now</h2>
<h3 id="a-new-graph-schema-and-api">A new graph schema and API</h3>
<p>We have now a strongly typed graph schema and traversal API in <strong><a href="https://github.com/bio4j/bio4j">bio4j/bio4j</a></strong>, based on <strong><a href="https://github.com/bio4j/angulillos">angulillos</a></strong> (more about angulillos later). With it, you can write traversals over Bio4j data abstractly, and then execute them over any implementation. These queries are checked to be correct both structurally (no source of a vertex) and with respect to the Bio4j schema. Vertices and edges are now part of graphs, which can declare dependencies; writing your own extensions to the model is now much easier than before. As part of these changes we did a thorough graph-per-graph review of the Bio4j model, which resulted in some significant improvements.</p>
<p>Of course a schema is not that useful without actual data conforming to it; we also wrote generic importers for all graphs. These importers can be executed using any implementation of the angulillos API.</p>
<h3 id="a-titan-based-implementation-and-data-distribution">A Titan-based implementation and data distribution</h3>
<p>With much of the work already done at the level of bio4j/bio4j, providing a data distribution of Bio4j becomes pretty simple; you just need to</p>
<ol>
<li>implement angulillos for your database technology of choice; this is what you have for <a href="http://thinkaurelius.github.io/titan/">Titan</a> in <strong><a href="https://github.com/bio4j/angulillos-titan">angulillos-titan</a></strong>.</li>
<li>if your database has support for type definitions and schemas, create those corresponding to the Bio4j schema; what we do for each graph in <strong><a href="https://github.com/bio4j/bio4j-titan">bio4j-titan</a></strong></li>
</ol>
<p>We finished running the importing process for all graphs just a few hours ago. A pretty sizable <code>.tar</code> containing all the Titan files is available from an S3 bucket. With that you just need to spin an EC2 instance, download and extract that and start using Bio4j. Or, if you don’t want to use AWS, you can of course run the import process on your own infrastructure.</p>
<h3 id="angulillos-generic-typed-property-graphs-in-java">Angulillos: generic typed property graphs in Java</h3>
<p>Writing <em>correct</em> queries for Bio4j was becoming harder and harder as we integrated more databases and resources, and we had no way of expressing the graph schemas, even for documentation purposes. That is what <strong><a href="https://github.com/bio4j/angulillos">angulillos</a></strong> strives to solve. You can think of angulillos as a strongly typed version of the property graph model: first you describe a graph schema in terms of types, and then you can write generic traversals over it, which are guaranteed to be well-typed. This means that for example</p>
<ul>
<li>you cannot retrieve the outgoing edges of and edge</li>
<li>and you can get the tweets that a user tweeted, but not the users that a tweet follows!</li>
</ul>
<p>The API is really straightforward to implement, and its only dependency is Java 8 (for Streams and lambdas). <strong><a href="https://github.com/bio4j/angulillos-titan">angulillos-titan</a></strong> can serve as an example of how this can be done.</p>
<h3 id="the-future">The future</h3>
<p>The next post will be dedicated to a tentative roadmap, explaining what we are working on now; A (really nice) Scala API, data distribution and AWS deployment improvements, and new integrations of genomic data sources are coming in the following months!</p>
]]></content>
</entry>
<entry>
<title type="html"><![CDATA[Bio4j goes to GSoC mentor summit 2014]]></title>
<link href="http://bio4j.com/blog/2014/10/bio4j-goes-to-gsoc-mentor-summit-2014/"/>
<updated>2014-10-29T17:18:00+01:00</updated>
<id>http://bio4j.com/blog/2014/10/bio4j-goes-to-gsoc-mentor-summit-2014</id>
<content type="html"><![CDATA[<p><img src="http://bio4j.com/images/bio4jGsoc.png" /></p>
<p>I just got home yesterday from San Francisco after attending together with <a href="https://twitter.com/eparejatobes">@eparejatobes</a> to the 10th edition of the Google Summer of Code mentor summit. It’s been a great experience that I would like to share with you all in this blog post ;)
For those of you who still don’t know what <a href="https://developers.google.com/open-source/soc/?csw=1">GSoC</a> is, here’s a debrief:</p>
<blockquote>
<p>Google Summer of Code is a program that offers student developers stipends to write code for various open source projects. We work with many open source, free software, and technology-related groups to identify and fund projects over a three month period. </p>
</blockquote>
<p>This was Bio4j’s first year as a GSoC organization and we got three students who worked in the following projects:</p>
<ul>
<li><a href="https://github.com/bio4j/dynamograph">dynamograph</a></li>
<li><a href="https://github.com/bio4j/exporter">exporter</a></li>
<li><a href="https://github.com/bio4j/el-grafo">el-grafo</a></li>
</ul>
<p>It also was my first experience as a mentor and I must say that I both learned and enjoyed it a lot during the process.</p>
<p>The events started on Friday with a complimentary visit to the theme park <em>Great America</em>, nice! followed by a really cool dinner reception at the <a href="http://www.thetech.org/">San Jose Tech Museum of Innovation</a> where we had surprise speakers such as Linus Torvals plus the opportunity of exploring the geeky exhibits from the museum while having some drinks.</p>
<p>We were supposed to dress smart for a change, which was interesting, seeing all these people wearing nice clothes :)</p>
<p><img class="right" src="http://bio4j.com/images/fotoTechMuseum.jpg" width="280" /></p>
<blockquote>
<p>I must say that I had to watch around 20 minutes of youtube videos before I managed to get the knot tie right… xD</p>
</blockquote>
<p>Sessions started early the next day with more than eight simultaneous rooms <em>(without taking into account the impromptu sessions that were organized at the ballroom from time to time)</em> and went on till the evening.</p>
<p>It was the first time that I went to an <strong><a href="http://en.wikipedia.org/wiki/Unconference">unconference</a></strong> and I just loved it.
It is actually great to have the opportunity to explore the different sessions and meet up with people on the way spontaneously, without all the rigidity that so many times comes with <em>“standard”</em> conferences. </p>
<p><img class="left" src="http://bio4j.com/images/stickers.jpg" width="260" /></p>
<p>Meeting in person people from the <a href="http://www.reactome.org/">Reactome database</a> project was cool since we plan to include this data source into Bio4j in the near future. It was also nice to see in person some of the guys that I’ve been following on twitter for a while like <a href="https://twitter.com/braincode">@braincode</a> among others.
I also found a good idea the fact of having both the sticker exchange table and the tea-room filled with chocolates from all over the world! The day ended with a quiz show that I unfortunately couldn’t join but, I read on twitter that it was quite funny.</p>
<p>On Sunday we opened the day with a trip to <a href="http://en.wikipedia.org/wiki/Googleplex">Googleplex</a> where we could see the actual place where the Google folks work on.</p>
<p><img class="right" src="http://bio4j.com/images/chocolates.png" width="240" /></p>
<p>There was some time left for a couple more sessions and then we unfortunately had to say bye to all the new acquaintances we made after attending the closing session at the hotel. </p>
<p>I would like to end this post by thanking all the people that helped out on the organization of this awesome summit.
Also a special thanks to <a href="https://twitter.com/fossygrl">@fossygirl</a>, great job!</p>
<p>Stay tuned for the next post, we will be releasing a shiny new version of Bio4j based on Titan very soon ;)</p>
<p><img src="http://bio4j.com/images/fotoGoogleAndroid.png" /></p>
<p><a href="https://twitter.com/pablopareja">@pablopareja</a></p>
]]></content>
</entry>
<entry>
<title type="html"><![CDATA[Bio4j accepted for Google Summer of Code 2014]]></title>
<link href="http://bio4j.com/blog/2014/02/bio4j-accepted-for-google-summer-of-code-2014/"/>
<updated>2014-02-25T17:18:00+01:00</updated>
<id>http://bio4j.com/blog/2014/02/bio4j-accepted-for-google-summer-of-code-2014</id>
<content type="html"><![CDATA[<p><img class="right" src="http://bio4j.com/images/GoogleSummer_2014logo.jpg" width="300" height="270" /></p>
<p>We are really excited to announce that <strong>Bio4j</strong> has been <strong>accepted</strong> as a <a href="https://www.google-melange.com/gsoc/org2/google/gsoc2014/bio4j">mentoring organization</a> for <strong><a href="https://www.google-melange.com/gsoc/homepage/google/gsoc2014">Google Summer of Code 2014</a></strong>. This was the first year we applied for it, and it feels just great being part of this inititative!</p>
<p>We think this is a great opportunity for students, giving them the opportunity to hack on pretty cool stuff around graph databases, bio big data and cloud computing.</p>
<h2 id="how-to-participate">how to participate</h2>
<p>If this sounds amazing and you are a student (PhD, masters, undergraduate, <a href="https://www.google-melange.com/gsoc/document/show/gsoc_program/google/gsoc2014/help_page#2._Whos_eligible_to_participate_as_a">whatever</a>) or know someone who is,</p>
<ol>
<li><strong><a href="https://github.com/bio4j/gsoc14/wiki/ideas">check our ideas list</a></strong> and then</li>
<li><strong>contact a potential mentor</strong> or if you don’t know who just <a href="https://github.com/eparejatobes">@eparejatobes</a> or <a href="https://github.com/pablopareja">@pablopareja</a></li>
</ol>
<p>You can read more about it in the <a href="https://github.com/bio4j/gsoc14/wiki">bio4j/gsoc14 wiki</a>.</p>
]]></content>
</entry>
<entry>
<title type="html"><![CDATA[Berkeley Phylogenomics Group receives an NSF grant to develop a graph DB for Big Data challenges in genomics building on Bio4j]]></title>
<link href="http://bio4j.com/blog/2013/11/new-bio4j-success-berkeley-phylogenomics-grant/"/>
<updated>2013-11-12T12:00:00+01:00</updated>
<id>http://bio4j.com/blog/2013/11/new-bio4j-success-berkeley-phylogenomics-grant</id>
<content type="html"><![CDATA[<p>The <a href="http://phylogenomics.berkeley.edu/">Sjölander Lab</a> at the <a href="http://www.berkeley.edu/index.html">University of California, Berkeley</a>, has recently been awarded a <strong>250K</strong> US dollars <em>EAGER</em> grant from the National Science Foundation to build a graph database for Big Data challenges in genomics. Naturally, <strong>they’re building on Bio4j</strong>.</p>
<p>The project “<strong>EAGER: Towards a self-organizing map and hyper-dimensional information network for the human genome</strong>” aims to create a graph database of genome and proteome data for the human genome and related species to allow biologists and computational biologists to mine the information in gene family trees, biological networks and other graph data that cannot be represented effectively in relational databases. For these goals, they will develop on top of the pioneering graph-based bioinformatics platform <strong>Bio4j</strong>. </p>
<p>”<em>We are excited to see how Bio4j is used by top research groups to build cutting-edge bioinformatics solutions</em>” said <strong>Eduardo Pareja</strong>, <strong><a href="http://www.era7bioinformatics.com">Era7 Bioinformatics</a> CEO</strong>. “<em>To reach an even broader user base, we are pleased to announce that we now provide versions for both Neo4j and Titan graph databases, for which we have developed another layer of abstraction for the domain model using Blueprints</em>.”</p>
<p>”<em>EAGER stands for Early-concept Grants for Exploratory Research</em>”, explained <strong>Professor Kimmen Sjölander</strong>, <strong>head of the <a href="http://phylogenomics.berkeley.edu/">Berkeley Phylogenomics Group</a></strong>: “<em>NSF awards these grants to support exploratory work in its early stages on untested, but potentially transformative, research ideas or approaches</em>”. “<em>My lab’s focus is on machine learning methods for Big Data challenges in biology, particularly for graphical data such as gene trees, networks, pathways and protein structures. The limitations of relational database technologies for graph data, particularly BIG graph data, restrict scientists’ ability to get any real information from that data. When we decided to switch to a graph database, we did a lot of research into the options. When we found out about Bio4j, we knew we’d found our solution. The Bio4j team has made our development tasks so much easier, and we look forward to a long and fruitful collaboration in this open-source project</em>”.</p>
<p>You can find more information here:</p>
<ul>
<li><a href="http://era7bioinformatics.com/en/download_file.cfm?file=1695&news=17"><strong>PHYLOGENOMICS_BERKELEY_BIO4J_ERA7_BIOINFORMATICS.pdf</strong></a></li>
</ul>
<p><a href="http://twitter.com/pablopareja"><strong>@pablopareja</strong></a></p>
]]></content>
</entry>
<entry>
<title type="html"><![CDATA[Bio4j 0.9 the billion relationships are here!]]></title>
<link href="http://bio4j.com/blog/2013/10/bio4j-09-the-billion-relationships-is-here/"/>
<updated>2013-10-15T06:33:27+02:00</updated>
<id>http://bio4j.com/blog/2013/10/bio4j-09-the-billion-relationships-is-here</id>
<content type="html"><![CDATA[<p>Hi everyone!</p>
<p>So <a href="https://github.com/bio4j/Bio4j/wiki/Bio4j-0.9"><strong>Bio4j 0.9</strong></a> finally made its way out and it’s here bringing you more than 1 billion relationships. These are approximately its main numbers:</p>
<ul>
<li><strong>1.216.993.547</strong> relationships</li>
<li><strong>190.625.351</strong> nodes</li>
<li><strong>584.436.429</strong> properties</li>
</ul>
<p>A lot of new features and improvements have been incorporated including the following, <em>(I will go into more detail in later posts specifically dedicated to each of them)</em></p>
<h2 id="refurbishing-the-domain-model">Refurbishing the domain model</h2>
<p><img src="http://bio4j.com/images/domainModelThumbnail.png" style="float:right" />We have introduced a new level of abstraction for the domain model by decoupling the inner database implementation from the relationships among entities themselves. An interface has been developed for each node and relationship present in the database, including methods to access both the properties of the entity it represents and utility methods that allow to easily navigate to the entities that will be linked to it.
All this can be found under the package <em>com.era7.bioinfo.bio4j.model</em></p>
<h2 id="new-blueprints-layer">New Blueprints layer</h2>
<p><img src="http://bio4j.com/images/blueprints.png" style="float:left" /> Apart from the set of interfaces we’ve developed another layer for the domain model using <a href="http://blueprints.tinkerpop.com/"><strong>Blueprints</strong></a>. This way we’re going one step further for making the domain model independent from the choice of database technology.</p>
<h2 id="new-titan-implementation">New Titan implementation</h2>
<p><img src="http://bio4j.com/images/titan.png" style="float:right" /> After the problems we had with the so called <a href="http://thinkaurelius.com/2012/10/25/a-solution-to-the-supernode-problem/"><em><strong>supernodes</strong></em></a> - which are quite common indeed, we decided to give a try to <a href="http://thinkaurelius.github.io/titan/"><strong>Titan Graph Database</strong></a> technology and see how it behaves in such situation. Both wrapper classes for each entity and importing programs have already been implemented. This new prototype needs however some testing but be sure you’ll be hearing more about this pretty soon! ;)</p>
<h2 id="bye-bye-reference-node">Bye bye reference node</h2>
<p>We decided to finally stop using the reference node for indexing purposes <em>(actually there’s no use for it anymore in Bio4j)</em>.
I have to admit it, I never was a fan of it and it was about time to do it. So now auxiliary relationships such as, for instance, <em>MainTaxonRel</em> or <em>MainDatasetRel</em> have been replaced by a standard node index.</p>
<h2 id="bug-fixes">Bug fixes</h2>
<p>This new release comes with many fixes including:</p>
<ol>
<li><strong>EnzymeNode</strong>: The node type property was not stored in previous releases.</li>
<li><strong>DatasetNode</strong>: Name property was not properly indexed. </li>
<li><strong>OrganismNode</strong>: NCBI tax-id property was not stored in some scenarios.</li>
<li>Redundant sequence conflict feature relationships have been fixed.</li>
<li>Duplicated submissions fixed</li>
<li>ProteinUnpublishedObservationCitation relationship was missing</li>
<li>The following node types were not properly indexed by their type till now: <em>BookNode, ArticleNode, OnlineArticleNode, SubmissionNode, PatentNode, PublisherNode, OnlineJournalNode, JournalNode</em></li>
</ol>
<h2 id="java-7">Java 7</h2>
<p>Bio4j uses Java 7 now ;)</p>
<p>OK, so that’s all for now, I’ll be posting much more information about this new release soon.</p>
<p>Cheers!</p>
<p><a href="http://twitter.com/pablopareja"><strong>@pablopareja</strong></a></p>
]]></content>
</entry>
<entry>
<title type="html"><![CDATA[Bio4j modules, adapt the database to your own needs]]></title>
<link href="http://bio4j.com/blog/2012/10/bio4j-modules-adapt-the-database-to-your-own-needs/"/>
<updated>2012-10-30T05:33:27+01:00</updated>
<id>http://bio4j.com/blog/2012/10/bio4j-modules-adapt-the-database-to-your-own-needs</id>
<content type="html"><![CDATA[<p>Hi!</p>
<p><strong>Bio4j 0.8 includes</strong> a few <strong>different data sources</strong> and you may not always be interested in having all of them. For example you might be interested in playing around with the Gene Ontology DAG alone and let’s face it, having to import a ~105 GB database to do that wouldn’t make much sense…</p>
<p>That’s why <strong>the importing process is modular and customizable, allowing you to import just the data you are interested in</strong>.
Here’s the big picture of where do entities and relationships come from in the general domain model:</p>
<p><a href="https://raw.github.com/bio4j/Bio4j/master/Bio4jDomainModelWithCardinality.jpg"><img src="http://bio4j.com/images/DomainModelWithDataSourceView.png" /></a></p>
<p>There’s however one thing that you have to <strong>keep in mind, you must be coherent when choosing the data sources</strong> you want to have included in your database; that’s to say, you cannot have for example the Uniref clusters without previously importing Uniprot KB, otherwise there wouldn’t be proteins to connect to when importing the clusters!</p>
<p>Here you have a basic schema showing the dependencies among the different modules:</p>
<p><a href="http://bio4j.com/images/ModuleDependencies.png"><img src="http://bio4j.com/images/ModuleDependencies.png" /></a></p>
<p><em>(Let me remind you that having here two data sources which are not connected by an arrow does NOT mean that they are not related/connected, but rather if it’s possible to import them alone or instead they need other data sources to be already present in the database )</em></p>
<p>I’m going to create a wiki page where I will be going into more detail regarding database size and importing process time depending on your modules choice, but meanwhile you can find some more information about how to do this in the <a href="https://github.com/bio4j/Bio4j/wiki/Importing-bio4j">Importing Bio4j wiki page</a>.</p>
<p>Have a good day!</p>
<p><a href="http://twitter.com/pablopareja"><strong>@pablopareja</strong></a></p>
]]></content>
</entry>
<entry>
<title type="html"><![CDATA[Bio4j 0.8, some numbers]]></title>
<link href="http://bio4j.com/blog/2012/10/bio4j-0-8-some-numbers/"/>
<updated>2012-10-18T06:33:27+02:00</updated>
<id>http://bio4j.com/blog/2012/10/bio4j-0-8-some-numbers</id>
<content type="html"><![CDATA[<p>Hi everyone!</p>
<p>Bio4j 0.8 was recently released and now it’s time to have a deeper look at its numbers <em>(as you can see we are quickly approaching the 1 billion relationships and 100M nodes)</em>:</p>
<ul>
<li>Number of Relationships: <strong>717.484.649</strong></li>
<li>Number of Nodes: <strong>92.667.745</strong></li>
<li>Relationship types: <strong>144</strong></li>
<li>Node types: <strong>42</strong></li>
</ul>
<p>Ok, but how are those relationships and nodes distributed among the different types? In this chart you can see the <strong>first 20 Relationship types</strong>:</p>
<p><a href="http://bio4j.com/images/bio4j08first20RelTypes.png"><img src="http://bio4j.com/images/bio4j08first20RelTypes.png" /></a></p>
<p>Here, the same thing but for the <strong>first 20 Node types</strong>:</p>
<p><a href="http://bio4j.com/images/bio4j08first20NodeTypes.png"><img src="http://bio4j.com/images/bio4j08first20NodeTypes.png" /></a></p>
<p>You can also check these two files including the numbers for all the existing types:</p>
<ul>
<li><a href="https://s3-eu-west-1.amazonaws.com/bio4j-public/releases/0.8/statistics/Bio4j08NodeStatistics.txt">Node statistics</a></li>
<li><a href="https://s3-eu-west-1.amazonaws.com/bio4j-public/releases/0.8/statistics/Bio4j08RelStatistics.txt">Relationship statiscis</a></li>
</ul>
<p>All this data was obtained with the program <a href="https://github.com/bio4j/Bio4jTools/blob/master/src/com/era7/bioinfo/bio4j/tools/GetNodeAndRelsStatistics.java"><strong>GetNodeAndRelsStatistics</strong></a>.</p>
<p>Have a good weekend!</p>
<p><a href="http://twitter.com/pablopareja"><strong>@pablopareja</strong></a></p>
]]></content>
</entry>
<entry>
<title type="html"><![CDATA[Bio4j 0.8 is here!]]></title>
<link href="http://bio4j.com/blog/2012/09/bio4j-0-8-is-here/"/>
<updated>2012-09-22T18:50:52+02:00</updated>
<id>http://bio4j.com/blog/2012/09/bio4j-0-8-is-here</id>
<content type="html"><![CDATA[<p>Hi everyone!</p>
<p>I’m glad to announce the release of <a href="https://github.com/bio4j/Bio4j/wiki/Bio4j-0.8"><strong>Bio4j 0.8</strong></a> including more than <strong>5.488.000 new proteins</strong> and <strong>3.233.000 genes</strong> among others, plus the following improvements and features:</p>
<h2 id="pfam-families">Pfam families</h2>
<p>Bio4j includes now all Pfam families included in Uniprot KB (both Swiss-Prot and TrEMBL). For that, both a new node type and relationship type have been created: </p>
<ul>
<li>
<p><a href="http://www.bio4j.com/docs/bio4j/apidocs/com/era7/bioinfo/bio4j/model/nodes/PfamNode.html">PfamNode</a></p>
</li>
<li>
<p><a href="http://www.bio4j.com/docs/bio4j/apidocs/com/era7/bioinfo/bio4j/model/relationships/protein/ProteinPfamRel.html">ProteinPfamRel</a> (this relationship connects a protein and the respective Pfam families associated to it)</p>
</li>
</ul>
<p>The following properties have been added to the Pfam node including:</p>
<ul>
<li>ID</li>
<li>Name</li>
</ul>
<p>Besides, an exact index for the Pfam family ID property has also been created <em>( pfam_id_index ).</em></p>
<h2 id="ncbi-taxonomy-tree-gi-index-improved">NCBI taxonomy tree GI index improved</h2>
<p>Old merged node IDs have been incorporated to the Gene Identifier <–> Taxonomy units index. That means that now all the pairs GI-TaxID which included old merged Tax-ID are also part of the index, resulting on a higher rate of hits when using the index.
For that we used the file <strong>meged.dmp</strong> provided in the <a href="ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz">official tax dump file</a> provided by the NCBI.</p>
<h2 id="bio4j-and-bio4jmodel-projects-unification">Bio4j and Bio4jModel projects unification</h2>
<p><a href="https://github.com/bio4j/Bio4j">Bio4j</a> project has absorbed <a href="https://github.com/bio4j/Bio4jModel">Bio4jModel</a> project from this release on.</p>
<p>Until now, Bio4jModel library included the core classes for the manipulation and traversal of the graph while Bio4j project only included the importing programs. I’ve been thinking for a while that this could be confusing and, since there was no real need to keep them as independent projects, I decided to put it all under Bio4j <em>(you just need one jar file now ;) ).</em> </p>
<h2 id="new-script-for-the-importing-process">New script for the importing process</h2>
<p>You don’t have to worry anymore about manually downloading/decompressing/etc… the sources for the DB in case you want to import Bio4j in your own cluster/machine. Just run the script <strong><a href="https://github.com/bio4j/Bio4j/blob/master/DownloadAndPrepareBio4jSources.sh">DownloadAndPrepareBio4jSources.sh</a></strong> and it will do it all for you.</p>
<h2 id="bug-fixes">Bug fixes</h2>
<ol>
<li><strong>MetalIonBindingSiteFeature</strong> This feature relationship had an erroneous name assigned and it’s been fixed.</li>
</ol>
<p>Well, that’s all for now, I’ll be posting more information about this new release soon ;)</p>
<p>Cheers,</p>
<p><a href="http://www.twitter.com/pablopareja"><strong>@pablopareja</strong></a></p>
]]></content>
</entry>
<entry>
<title type="html"><![CDATA[New Bio4j general domain model schema available]]></title>
<link href="http://bio4j.com/blog/2012/05/new-bio4j-general-domain-model-schema-available/"/>
<updated>2012-05-17T23:11:50+02:00</updated>
<id>http://bio4j.com/blog/2012/05/new-bio4j-general-domain-model-schema-available</id>
<content type="html"><![CDATA[<p>Hi everyone!</p>
<p>It’s been a few months already since I published the last post but that doesn’t mean that the development process of Bio4j was stopped, but rather, on the contrary, I have been working in the integration of Bio4j with other DB-related projects as well as pipelines and tools. Actually, I’m right now staying in the US for a couple of months working on the implementation and integration of a new database around Bio4j including grasses genomic data as part of a collaboration with the Ohio State University, (I promise to give more details about this and more in next posts).</p>
<p>Ok, but let’s get to the point of this post. Even though there already is available a web tool to explore Bio4j data structure (<a href="http://gotools.bio4j.com:8080/Bio4jExplorerServer/Bio4jExplorer.html"><strong>Bio4jExplorer</strong></a>), I was feeling that something else was missing in order to get the big picture of all the data included and how it’s interrelated. So I got to work and created this general domain model including all node types and relationships (also specifying their cardinality).</p>
<p><a href="https://raw.github.com/bio4j/Bio4j/master/Bio4jDomainModelWithCardinality.jpg"><img src="http://bio4j.com/images/Bio4jDomainModelWithCardinality.png" /></a></p>
<p>I didn’t include “auxiliary” relationships linked to the reference node in order to not pollute the schema with relationships that don’t have any semantic meaning but rather indexing purposes. Also, the text included in both boxes represents different relationships all linking the same nodes -specifically Protein with CommentType and FeatureType. I could have drawn them as the rest but then I would have ended up with a hairball instead of a meaningful schema.</p>
<p>As always, any feedback is welcome!</p>
<p><a href="http://twitter.com/pablopareja"><strong>@pablopareja</strong></a></p>
]]></content>
</entry>
<entry>
<title type="html"><![CDATA[Bio4jExplorer, new features and design!]]></title>
<link href="http://bio4j.com/blog/2012/03/bio4jexplorer-new-features-and-design/"/>
<updated>2012-03-09T21:57:56+01:00</updated>
<id>http://bio4j.com/blog/2012/03/bio4jexplorer-new-features-and-design</id>
<content type="html"><![CDATA[<p>Hello everyone,</p>
<p>I’m happy to announce a new set of features for our tool Bio4jExplorer plus some changes in its design. I hope this may help both potential and current users to get a better understanding of Bio4j DB structure and contents.</p>
<p><a href="http://gotools.bio4j.com:8080/Bio4jExplorerServer/Bio4jExplorer.html"><img src="http://bio4j.com/images/bio4jExplorerScreenshot-1024x712.png" /></a></p>
<h3 id="node--relationship-properties">Node & Relationship properties</h3>
<p>You can now check with Bio4jExplorer the properties that has either a node or relationship in the table situated on the lower part of the interface. Five columns are included:</p>
<ul>
<li><strong>Name:</strong> property name</li>
<li><strong>Type:</strong> property type (<code>String</code>, <code>int</code>, <code>float</code>, <code>String[]</code>, …)</li>
<li><strong>Indexed:</strong> either the property is indexed or not (yes/no)</li>
<li><strong>Index name</strong>: name of the index associated to this property -if there’s any
<strong>Index name</strong>: type of the index associated to this property -if there’s any </li>
</ul>
<p><img src="http://bio4j.com/images/bio4jExplorerPropertiesTable.png" /></p>
<h3 id="node--relationship-data-source">Node & Relationship Data source</h3>
<p>You can also see now from which source a Node or Relationship was imported, <em>some examples would be Uniprot, Uniref, GO, RefSeq…</em></p>
<p><img src="http://bio4j.com/images/bio4jExplorerDataSourceLabel.png" /></p>
<h3 id="relationships-name-property">Relationships Name property</h3>
<p>With this new version you can directly check here the “internal” name of relationships without having to go to the respective javadoc documentation. </p>
<p><img src="h/images/bio4jExplorerRelationshipsNameProperty.png" /></p>
<p>This is quite useful when you are writing your Cypher or Gremlin queries, just check it, copy it, and paste it in your query. An example using the relationship shown in the picture would be this query included in the <a href="https://github.com/bio4j/Bio4j/wiki/Bio4j-cypher-cheat-sheet">Bio4j Cypher cheatsheet</a>:</p>
<p><strong><em>Get proteins (accession and names) associated to an interpro motif (limited to 10 results)</em></strong></p>
<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class=""><span class="line">>
</span><span class="line">START i=node:interpro_id_index(interpro_id_index = "IPR023306")
</span><span class="line"> MATCH i <-[:**PROTEIN_INTERPRO**]- p
</span><span class="line"> return p.accession, p.fullname, p.name, p.short_name
</span><span class="line"> limit 10</span></code></pre></td></tr></table></div></figure></notextile></div>
<p>The url for Bio4jExplorer is the same as before:</p>
<ul>
<li><a href="http://gotools.bio4j.com:8080/Bio4jExplorerServer/Bio4jExplorer.html"><strong>http://gotools.bio4j.com:8080/Bio4jExplorerServer/Bio4jExplorer.html</strong></a></li>
</ul>
<p>In case you are interested on how the tool is implemented, please go to <a href="blog//2011/10/bio4jexplorer-familiarize-yourself-with-bio4j-nodes-and-relationships">the previous post about Bio4jExplorer</a> where you can find information about the different code repos and more info.</p>
<p><strong>If you want to check the files including the hard-coded information regarding how nodes, relationships, and indexes are organized</strong>, and which is the input for the program which creates the AWS SimpleDB domain, I just uploaded them to the bio4j-public S3 bucket. Please click on their names to download them:</p>
<ul>
<li><a href="https://s3-eu-west-1.amazonaws.com/bio4j-public/simple-db-files/NodesBio4j.txt"><strong>NodesBio4j.txt</strong></a></li>
<li><a href="https://s3-eu-west-1.amazonaws.com/bio4j-public/simple-db-files/NodeIndexesBio4j.txt"><strong>NodeIndexesBio4j.txt</strong></a></li>
<li><a href="https://s3-eu-west-1.amazonaws.com/bio4j-public/simple-db-files/NodePropertiesBio4j.txt"><strong>NodePropertiesBio4j.txt</strong></a></li>
<li><a href="https://s3-eu-west-1.amazonaws.com/bio4j-public/simple-db-files/RelationshipsBio4j.txt"><strong>RelationshipsBio4j.txt</strong></a></li>
<li><a href="https://s3-eu-west-1.amazonaws.com/bio4j-public/simple-db-files/RelationshipPropertiesBio4j.txt"><strong>RelationshipPropertiesBio4j.txt</strong></a></li>
<li><a href="https://s3-eu-west-1.amazonaws.com/bio4j-public/simple-db-files/RelationshipIndexesBio4j.txt"><strong>RelationshipIndexesBio4j.txt</strong></a></li>
</ul>
<p>I wish you all a great weekend!</p>
<p>I’ll have mine at the beach enjoying our great springy weather with lots of sun down here in Andalucia ;)</p>
<p><a href="http://www.twitter.com/pablopareja"><strong>@pablopareja</strong></a></p>
]]></content>
</entry>
<entry>
<title type="html"><![CDATA[Bio4j 0.7, some numbers]]></title>
<link href="http://bio4j.com/blog/2012/03/bio4j-0-7-some-numbers/"/>
<updated>2012-03-05T13:28:27+01:00</updated>
<id>http://bio4j.com/blog/2012/03/bio4j-0-7-some-numbers</id>
<content type="html"><![CDATA[<p>Hi everyone!</p>
<p>There have already been a good few posts showing different uses and applications of Bio4j, but what about Bio4j data itself?
Today I’m going to show you some <strong>basic statistics</strong> about the different types of nodes and relationships Bio4j is made up of.
Just as a heads up, here are the <strong>general numbers of Bio4j 0.7</strong> :</p>
<ul>
<li>Number of Relationships: <strong>530.642.683</strong></li>
<li>Number of Nodes: <strong>76.071.411</strong></li>
<li>Relationship types: <strong>139</strong></li>
<li>Node types: <strong>38</strong></li>
</ul>
<p>Ok, but how are those relationships and nodes distributed among the different types? In this chart you can see the <strong>first 20 Relationship types</strong> (click on the image below to check the interactive chart):</p>
<p><a href="http://bio4j.com/imgs/release07/relsBarChart.html"><img src="http://bio4j.com/images/first20RelTypesChart-1024x797.png" /></a></p>
<p>Here, the same thing but for the <strong>first 20 Node types</strong> (click on the image below to check the interactive chart):</p>
<p><a href="http://bio4j.com/imgs/release07/nodesBarChart.html"><img src="http://bio4j.com/images/first20NodeTypesChart-1024x794.png" /></a></p>
<p>You can also check these two files including the numbers from all existing types:</p>
<ul>
<li><a href="https://s3-eu-west-1.amazonaws.com/bio4j-public/releases/0.7/statistics/Bio4j07NodeStatistics.txt">Node statistics</a></li>
<li><a href="https://s3-eu-west-1.amazonaws.com/bio4j-public/releases/0.7/statistics/Bio4j07RelStatistics.txt">Relationship statiscis</a></li>
</ul>
<p>All this data was obtained with the program <a href="https://github.com/bio4j/Bio4jTools/blob/develop/src/com/era7/bioinfo/bio4j/tools/GetNodeAndRelsStatistics.java"><strong>GetNodeAndRelsStatistics</strong></a>.</p>
<p>Have a good day!</p>
<p><a href="http://twitter.com/pablopareja"><strong>@pablopareja</strong></a></p>
<h2 id="comments">comments</h2>
<ul>
<li>
<p><strong>Patrick Durusau</strong>
Excellent!
Question: When I checked at PubMed, I did not find Neo4j cited in any of the medical literature. I am not a medical professional but am interested in what might promote Bio4j in the medical research community?
It is too good of a resource to be unnoticed.
Patrick</p>
<ul>
<li><strong>ppareja</strong>
Hi Patrick,
I’m glad you liked the post.
It’s true that Bio4j may not have caught the attention of many people yet who could definitely make a good use out of it. What are the reasons for that? Well, I think it could be a mixture of factors.
Some people don’t like too much learning new technologies/strategies/workflows… and tend to stick to things they already know as long as possible – which is totally respectable and undestandable. Other people though, may simply not have found about it yet… It’s also possible that due to the lack of a well structured project documentation, potential users get lost in their way when trying to figure out what’s Bio4j about and/or miss the parts they could be interested in.
I could keep on going with more possible reasons that are coming to my mind but still, couldn’t be really objective – it’s me who created this project :D
The point you bring up is actually one of the reasons why we value so much any sort of feedback for the project, (specially constructive ‘bad’ feedback that help us realize its weaknesses)
Let me know if you come up with an idea to let more people know about Bio4j !
Pablo</li>
</ul>
</li>
</ul>
]]></content>
</entry>
<entry>
<title type="html"><![CDATA[Bio4j REST Server configures itself now thanks to the updated CF template]]></title>
<link href="http://bio4j.com/blog/2012/02/bio4j-rest-server-configures-itself-now-thanks-to-the-updated-cf-template/"/>
<updated>2012-02-24T14:21:39+01:00</updated>
<id>http://bio4j.com/blog/2012/02/bio4j-rest-server-configures-itself-now-thanks-to-the-updated-cf-template</id>
<content type="html"><![CDATA[<p>Hi all,</p>
<p>I just wanted to write a very short post informing about the changes in the <a href="https://s3-eu-west-1.amazonaws.com/bio4j-public/Bio4jBasicRestServerTemplate.txt"><strong>Bio4jBasicRestServerTemplate</strong></a>. </p>
<blockquote>
<p>Template what!? </p>
</blockquote>
<p>If that’s what you’re thinking, please go <a href="http://blog.ohnosequences.com/2011/12/neo4j-server-and-aws-become-good-friends/">here</a> to get an idea of what’s this all about.</p>
<p>From now on, this CloudFormation template adapts the server configuration files:</p>
<ul>
<li><code>neo4j-wrapper.conf</code></li>
<li><code>neo4j.properties</code></li>
</ul>
<p>to the characteristics of the instance type the server is running in, so that it can make the best out of it.</p>
<blockquote>
<p>These configurations assume that the server is running alone in the machine.</p>
</blockquote>
<p>For that I created these two new mappings in the template:</p>
<ul>
<li><code>AWSInstanceType2WrapperConfFile</code></li>
<li><code>AWSInstanceType2Neo4jPropertiesFile</code></li>
</ul>
<p>Default configuration values are available in the <strong>bio4j-public S3 bucket</strong>. For example in order to have access to the server configuration files of a <code>m1.xlarge</code> instance, just go to this url:</p>
<ul>
<li><a href="https://s3-eu-west-1.amazonaws.com/bio4j-public/server/conf-files/m1.xlarge/neo4j-wrapper.conf">neo4j-wrapper.conf - m1.xlarge</a></li>
</ul>
<p>same thing for the other file:</p>
<ul>
<li><a href="https://s3-eu-west-1.amazonaws.com/bio4j-public/server/conf-files/m1.xlarge/neo4j.properties">neo4j.properties - m1.xlarge</a></li>
</ul>
<p>If you want to check the conf files for any other instance type, you just have to change the <strong>instance type name</strong> in the urls linked above.</p>
<p>Have a good weekend!</p>
<p><strong><a href="http://www.twitter.com/pablopareja">@pablopareja</a></strong></p>
]]></content>
</entry>
<entry>
<title type="html"><![CDATA[Finding the lowest common ancestor of a set of NCBI taxonomy nodes with Bio4j]]></title>
<link href="http://bio4j.com/blog/2012/02/finding-the-lowest-common-ancestor-of-a-set-of-ncbi-taxonomy-nodes-with-bio4j/"/>
<updated>2012-02-08T21:20:53+01:00</updated>
<id>http://bio4j.com/blog/2012/02/finding-the-lowest-common-ancestor-of-a-set-of-ncbi-taxonomy-nodes-with-bio4j</id>
<content type="html"><![CDATA[<p>I don’t know if you have ever heard of the <a href="http://en.wikipedia.org/wiki/Lowest_common_ancestor"><strong>lowest common ancestor problem</strong></a> in graph theory and computer science but it’s actually pretty simple. As its name says, it consists of finding the common ancestor for two different nodes which has the lowest level possible in the tree/graph.</p>
<p>Even though it is normally defined for only two nodes given <strong>it can easily be extended for a set of nodes with an arbitrary size</strong>. This is a quite common scenario that can be found across multiple fields and **taxonomy **is one of them.</p>
<p>The reason I’m talking about all this is because today I ran into the need to make use of such algorithm as part of some improvements in our <strong>metagenomics</strong> <a href="http://www.era7bioinformatics.com/en/metagenomics_mg7.html">MG7 method</a>. After doing some research looking for existing solutions, I came to the conclusion that I should implement my own, I couldn’t find any applicable implementation that was thought for more than just <strong>two</strong> nodes.</p>
<p>Ok, but let’s get into detail and see my algorithm:</p>
<p>We start from a set of nodes with an arbitrary length -<em>4 in this sample</em>, which are spread through the taxonomy tree:</p>
<p><img src="http://bio4j.com/images/LCAfirstStep.png" /></p>
<p>We fetch then the first node from the set and calculate its whole ancestor list to the main root of the taxonomy.</p>
<p><img src="http://bio4j.com/images/LCAsecondStep.png" /></p>
<p>Now that we have the list, we take the second node of the set and check if it’s contained in it, if not, we keep going up through its ancestors until we find a hit. Once the hit has been found, we get rid of the previous elements in the list (if any) so that they are not taken into account for the next iterations in the algorithm.</p>
<p><img src="http://bio4j.com/images/LCAthirdStep.png" /></p>
<p>We keep going trough our node set, and C also removes some elements of the list…</p>
<p><img src="http://bio4j.com/images/LCAfourthStep.png" /></p>
<p>Finally we reach the last node of our set, but no element is removed from our list as a result.</p>
<p><img src="http://bio4j.com/images/LCAfifthStep.png" /></p>
<p>The last thing we have to do is simply get the first element of the resulting list and there we have our lowest common ancestor!</p>
<p><img src="http://bio4j.com/images/LCAsixthStep.png" /></p>
<p>This algorithm is encapsulated in the class <a href="https://github.com/bio4j/Bio4jTools/blob/develop/src/com/era7/bioinfo/bio4j/tools/algo/TaxonomyAlgo.java"><strong>TaxonomyAlgo</strong></a>, specifically in the static method <code>lowestCommonAncestor()</code> that expects a list of <strong>NCBITaxonNode</strong> as parameter and returns their LCA node.</p>
<p>You can also check the class <a href="https://github.com/bio4j/Bio4jTools/blob/develop/src/com/era7/bioinfo/bio4j/tools/taxonomy/LowestCommonAncestorTest.java"><strong>LowestCommonAncestorTest</strong></a> where a simple test program that makes use of this method is implemented. </p>
<p>This program expects as parameters:</p>
<ol>
<li>Bio4j DB folder</li>
<li>An arbitrary number of NCBI taxonomy IDs representing the node set</li>
</ol>
<p>The Scientific name and the NCBI tax ID of the LCA are printed in the console as result.</p>
<p>Enjoy!</p>
<p><a href="http://twitter.com/pablopareja"><strong>@pablopareja</strong></a></p>
<h2 id="comments">comments</h2>
<ul>
<li>
<p><strong>Paul Agapow</strong>
Oddly enough, I had to solve this exact problem a few years ago (to see how much of a tree is left after an extinction, for calculating the biodiversity impact) and then just a few weeks ago (but for the unrooted case). Both times I was sure this had to be a solved problem, but there were no obvious solution out there.</p>
<ul>
<li><strong>Pablo Pareja</strong>
Hi Paul,
I was also quite surprised there wasn’t any ‘official’/obvious solution for this, specially when I’d say it’s quite a common problem.
Now that you mention it, I think I’ll extend the implementation for the unrooted case as well.
By the way, just out of curiosity, what kind of solution did you come up with in the end?</li>
</ul>
</li>
<li>
<p><strong>Victor de Jager</strong>
Hi Pablo,
interesting post. I solved a very similar problem a few years ago using an early version of the ETE toolkit. http://ete.cgenomics.org/
It’s a well documented with plenty of examples.</p>
<ul>
<li><strong>ppareja</strong>
Hi Victor,
Thanks for the link. Just a quick question, is it open-source?</li>
</ul>
</li>
<li>
<p><strong>Jaime</strong>
Hi,
You may be interested in this python script based on the ETE library: https://github.com/jhcepas/ncbi_taxonomy
BTW, ETE is free software</p>
</li>
<li>
<p><strong>Miguel</strong>
The LCA problem is closely related to the Range Minimum Query problem in graph theory. Working on metagenomics I had to implement a fast algorithm to search for LCA of an arbitrary number of leafs in a taxonomic tree. Given that the tree is always the same, you can pre-process it for fast searches. I ended up implemented the Sparse table algorithm for RMQ explained here:
[](http://community.topcoder.com/tc?module=Static&d1=tutorials&d2=lowestCommonAncestor)
You say in your post that you couldn’t find any solution out there for more than 2 nodes. The reason is simple: the LCA of N nodes can be decomposed to N-1 times the LCAs of 2 nodes (for example, the LCA of 3 nodes is the LCA of one of them and the LCA of the other 2).</p>
<ul>
<li><strong>ppareja</strong>
Hi Miguel,
Thanks for the link ;)
In my case though I didn’t want to do any pre-processing on purpose. Having everything stored as a graph gives you a great advantage both in terms of speed and scalability and I didn’t want to throw that away. On the other hand this sort of algorithm is one that could be applied to other sub-graphs of Bio4j, not only the taxonomy tree, so once you implement it in this way it’d be trivial to adapt it to other such cases.
I know that the problem can be decomposed so that you end up with a set of 2-nodes problems, what I meant however was that I expected to find algorithms for this problem with some sort of specific optimizations when dealing with a big set of nodes, not only two. For example somehow not passing again through nodes already visited, which will happen when you do decomposing the problem in “isolated” pairs of nodes.</li>
</ul>
</li>
</ul>
]]></content>
</entry>
<entry>
<title type="html"><![CDATA[Mining Bio4j data: finding topological patterns in PPI networks]]></title>
<link href="http://bio4j.com/blog/2012/01/mining-bio4j-data-finding-topological-patterns-in-ppi-networks/"/>
<updated>2012-01-24T16:42:56+01:00</updated>
<id>http://bio4j.com/blog/2012/01/mining-bio4j-data-finding-topological-patterns-in-ppi-networks</id>
<content type="html"><![CDATA[<p>Hi everyone!</p>
<p>After writing <a href="http://blog.bio4j.com/2011/12/using-bio4j-neo4j-graph-algo-component-for-finding-protein-protein-interaction-paths/"><strong>this post</strong></a> on December, I’ve been thinking of doing something similar, yet different, using Neo4j Cypher query language.</p>
<p>That’s where I came up with the idea of looking for <strong>topological patterns</strong> through a large <strong>sub-set of the Protein-Protein interactions network</strong> included in Bio4j; -rather than focusing in a few proteins selected a priori.</p>
<p>I decided to mine the data in order to find <strong>circuits/simple cycles of length 3</strong> where <strong>at least one protein is from Swiss-Prot dataset</strong>:</p>
<p><img src="http://bio4j.com/images/PPICircuit.png" /></p>
<p>I would like to point out that the <strong>direction</strong> here <strong>is important</strong> and these two cycles:</p>
<ul>
<li><code>A --> B --> C --> A</code></li>
<li><code>A --> C --> B --> A</code></li>
</ul>
<p>are <strong>not</strong> the same. Ok, so once this has been said, let’s see how the Cypher query looks like:</p>
<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class=""><span class="line">START d=node:dataset_name_index(dataset_name_index = "Swiss-Prot")
</span><span class="line">MATCH d <-[r:PROTEIN_DATASET]- p,
</span><span class="line">circuit = (p) -[:PROTEIN_PROTEIN_INTERACTION]-> (p2) -[:PROTEIN_PROTEIN_INTERACTION]-> (p3) -[:PROTEIN_PROTEIN_INTERACTION]-> (p)
</span><span class="line"> return p.accession, p2.accession, p3.accession</span></code></pre></td></tr></table></div></figure></notextile></div>
<p>As you can see it’s really simple and straightforward. In the first two lines we match the proteins from Swiss-Prot dataset for later retrieving the ones which form a 3-length cycle as described before. Once the query has finished, you should be getting something like this:</p>
<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
</pre></td><td class="code"><pre><code class=""><span class="line">cypher>
</span><span class="line">==> +---------------------------------------------------------+
</span><span class="line">p.accession | p2.accession | p3.accession |
</span><span class="line">==> +---------------------------------------------------------+
</span><span class="line">Q08465 P35189 P3421
</span><span class="line">Q08465 P34218 P35189
</span><span class="line">Q8GXA4 Q8L7E5 Q9LE82
</span><span class="line">Q8GXA4 Q9FH18 Q8L7E5
</span><span class="line">....
</span><span class="line">==> +---------------------------------------------------------+
</span><span class="line">==> 6632 rows, 1019211 ms</span></code></pre></td></tr></table></div></figure></notextile></div>
<p>As you can see the query took <strong>about 17 minutes</strong> to be completed <strong>in a 100% fresh DB</strong> -there was no information cached at all yet; with a <a href="http://aws.amazon.com/ec2/instance-types/"><strong>m1.large</strong> AWS machine</a> -this machine has <strong>7.5GB</strong> of <strong>RAM</strong>.</p>
<p>Not bad, right!? </p>
<p>We have to beware of something though, this query returns cycles such as:</p>
<ul>
<li><code>A --> B --> C --> A</code></li>
<li><code>B --> C --> A --> B</code></li>
</ul>
<p>as different cycles when they are actually not. That’s why I developed a <a href="https://github.com/bio4j/Bio4jTools/blob/develop/src/com/era7/bioinfo/bio4j/tools/RemoveRepetitionsFromPPICircuits.java"><strong>simple program</strong></a> to remove these repetitions as well as for fetching some statistics information.
After running the program you get two files:</p>
<ol>
<li><strong>PPICircuitsLength3NoRepeats</strong> file: download it <a href="https://s3-eu-west-1.amazonaws.com/bio4j-public/PPICircuitsBlogPost/PPICircuitsL3SwissProtNoRepeats.txt">here</a></li>
<li><strong>PPICircuitsProteinsFreq</strong> file: download it <a href="https://s3-eu-west-1.amazonaws.com/bio4j-public/PPICircuitsBlogPost/PPICircuitsL3SwissProtProteinsFreq.txt">here</a>.</li>
</ol>
<p>The <strong>final circuits found</strong> were reduced after performing the filtering to <strong>2226 records</strong>.</p>
<p>Finally, I also created a really simple chart including the absolute frequency of the first 20 proteins with more occurrences in the cycles that were found.</p>
<p><img src="http://bio4j.com/images/proteinsFrequencyChart.png" /></p>
<p>Well, that’s all for now. Have a good day!</p>
<p><a href="http://www.twitter.com/pablopareja"><strong>@pablopareja</strong></a></p>
]]></content>
</entry>
<entry>
<title type="html"><![CDATA[Bio4j release 0.7 is out!]]></title>
<link href="http://bio4j.com/blog/2012/01/bio4j-release-0-7-is-out/"/>
<updated>2012-01-11T17:50:52+01:00</updated>
<id>http://bio4j.com/blog/2012/01/bio4j-release-0-7-is-out</id>
<content type="html"><![CDATA[<p>Hi!</p>
<p>I’m happy to announce that the version 0.7 of Bio4j has been released. Check out the wide set of new features, tools and improvements:</p>
<h2 id="expasy-enzyme-database-integration">Expasy Enzyme database integration</h2>
<p>From now on you have the whole <a href="http://enzyme.expasy.org"><strong>Enzyme DB</strong></a> included in Bio4j. For that, both a new node type and relationship type have been created: </p>
<ul>
<li><a href="http://www.bio4j.com/docs/bio4jmodel/apidocs/com/era7/bioinfo/bio4jmodel/nodes/EnzymeNode.html">EnzymeNode</a>
-<a href="http://www.bio4j.com/docs/bio4jmodel/apidocs/com/era7/bioinfo/bio4jmodel/relationships/protein/ProteinEnzymaticActivityRel.html">ProteinEnzymaticActivityRel</a> (this relationship connects a protein and the respective enzyme nodes associated to it)</li>
</ul>
<p>All properties found have been incorporated to the enzyme node including:</p>
<ul>
<li>ID</li>
<li>Official name</li>
<li>Alternate names</li>
<li>Cofactors</li>
<li>Comments</li>
<li>Catalytic activity</li>
<li>Prosite cross-references</li>
</ul>
<h2 id="node-type-indexing">Node type indexing</h2>
<p>From now on, every node present in the database has a property <em><strong>nodeType</strong></em> including its type which has been indexed. That way you can now access all nodes belonging to an specific type really easily. </p>
<h2 id="availability-in-all-regions">Availability in all Regions</h2>
<p><a href="http://aws.amazon.com"><img class="right" src="http://d36cz9buwru1tt.cloudfront.net/logo_aws.gif" /></a></p>
<p>The AWS region you are based in won’t be a problem for using Bio4j anymore. EBS Snapshots have been created in all regions as well as CloudFormation templates have been updated so that they can now be used regardless the region where you want to create the stack. </p>
<blockquote>
<p>Only Asia Pacific (Singapore) <code>ap-southeast-1</code> region is not ready due to ongoing issues from AWS side regarding extremely slow S3 object downloading. Hope we can find a work around for this soon!</p>
</blockquote>
<h2 id="new-cloudformation-templates">New CloudFormation templates</h2>
<h3 id="basic-bio4j-instance-updated">Basic Bio4j instance (updated)</h3>
<p>The basic Bio4j instance template has been updated so that now you can use it from all zones. Check out more info about this in the <a href="http://blog.bio4j.com/2011/12/bio4j-aws-cloudformation-your-own-fresh-baked-db-in-less-than-a-minute/"><strong>updated blog post</strong></a></p>
<h3 id="basic-bio4j-rest-server">Basic Bio4j REST server</h3>
<p>A new template has been developed so that you can easily deploy your Neo4j-Bio4j REST server in less than a minute.</p>
<p>This template is available in the following address:</p>
<ul>
<li><a href="https://s3-eu-west-1.amazonaws.com/bio4j-public/Bio4jBasicRestServerTemplate.txt"><strong>https://s3-eu-west-1.amazonaws.com/bio4j-public/Bio4jBasicRestServerTemplate.txt</strong></a></li>
</ul>
<p>The steps you should follow to create the stack are really simple. Actually, you can follow as a guide <a href="http://blog.ohnosequences.com/2011/12/neo4j-server-and-aws-become-good-friends/"><strong>this blog post</strong></a> about the template I created for deploying a general Neo4j server, <em>only one or two parameters vary</em></p>
<h2 id="bio4j-rest-server">Bio4j REST server</h2>
<p>Once you get your server running thanks to the useful template I just mentioned before, using Neo4j WebAdmin with Bio4j as source you will be able to:</p>
<h3 id="explore-you-database-with-the-data-browser">Explore you database with the Data browser</h3>
<p>Using the data browser tab of the Web administration tool you can explore in real-time the contents of Bio4j!</p>
<p><img src="http://bio4j.com/images/Bio4jDataBrowser-1024x699.png" /></p>
<p>In order to get visualizations like the one shown above, you should make use of <strong>visualization profiles</strong>. There you can specify different styles associated to customizable rules which can be expressed in terms of the node properties. Here’s a screenshot showing how the visualization profile I used for the visualization above looks like:</p>
<p><img src="http://bio4j.com/images/Bio4jDataBrowserVizProfile-1024x752.png" /></p>
<blockquote>
<p>Just beware of one thing, the behavior of the tool is such that it does not distinguish between highly connected nodes and more isolated ones. Because of this, clicking nodes such as <strong>Trembl</strong> dataset node is not advisable unless you want to see it freeze forever -<em>this node has more than 15 million relationships linking it to proteins</em>.</p>
</blockquote>
<h2 id="run-queries-with-cypher">Run queries with Cypher</h2>
<p>Cypher what?!</p>
<blockquote>
<p><a href="http://docs.neo4j.org/chunked/milestone/cypher-query-lang.html"><img class="right" src="http://a1.twimg.com/profile_images/195275920/square-logo-no-text-2_normal.png" /></a>
<strong>Cypher **is a **declarative language</strong> which allows for expressive and efficient querying of the graph store without having to write traversers in code. It <strong>focuses on the clarity of expressing what to retrieve</strong> from a graph, <strong>not how to do it</strong>, in contrast to imperative languages like Java, and scripting languages like Gremlin.</p>
</blockquote>
<p>A query to retrieve protein interaction circuits of length 3 with proteins belonging to Swiss-Prot dataset (limited to 5 results) would look like this in Cypher:</p>
<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class=""><span class="line">START d=node:dataset_name_index(dataset_name_index = "Swiss-Prot")
</span><span class="line"> MATCH d <-[r:PROTEIN_DATASET]- p,
</span><span class="line"> circuit = (p) -[:PROTEIN_PROTEIN_INTERACTION]-> (p2) -[:PROTEIN_PROTEIN_INTERACTION]-> (p3) -[:PROTEIN_PROTEIN_INTERACTION]-> (p)
</span><span class="line"> return p.accession, p2.accession, p3.accession, p.accession
</span><span class="line"> limit 5</span></code></pre></td></tr></table></div></figure></notextile></div>
<p>If you want to check out more examples of Bio4j + Cypher, check our <a href="https://github.com/bio4j/Bio4j/wiki/Bio4j-cypher-cheat-sheet"><strong>Bio4j cypher cheat sheet</strong></a> that we will be updating from time to time.</p>
<h2 id="querying-bio4j-with-gremlin">Querying Bio4j with Gremlin</h2>
<p>Gremlins? What do they have to do with Bio4j!?</p>
<blockquote>
<p><a href="https://github.com/tinkerpop/gremlin/wiki"><img class="right" src="https://raw.github.com/tinkerpop/gremlin/master/doc/images/gremlin-standing-small.png" /></a>
<strong>Gremlin is a graph traversal language that can be natively used in various JVM languages</strong> - it currently provides native support for Java, Groovy, and Scala. However, it can express in a few lines of code what it would take many, many lines of code in Java to express.</p>
</blockquote>
<p>Querying proteins associated to the interpro motif with id <code>IPR023306</code> in Bio4j with Gremlin would look like this: (limited to 5 results)</p>
<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
</pre></td><td class="code"><pre><code class=""><span class="line">gremlin> g.idx('interpro_id_index')[['interpro_id_index':'IPR023306']].inE('PROTEIN_INTERPRO').outV.accession[0..4]
</span><span class="line">==> E2GK26
</span><span class="line">==> G3PMS4
</span><span class="line">==> G3Q865
</span><span class="line">==> G3PIL8
</span><span class="line">==> G3NNA4
</span><span class="line">gremlin> </span></code></pre></td></tr></table></div></figure></notextile></div>
<p>If you want to check out more examples of Bio4j + Gremlin, check our <a href="https://github.com/bio4j/Bio4j/wiki/Bio4j-gremlin-cheat-sheet"><strong>Bio4j gremlin cheat sheet</strong></a> that we will be updating from time to time.</p>
<h2 id="bug-fixes">Bug fixes</h2>
<ol>
<li><strong>Dataset nodes</strong> There was a bug in the importing process which resulted in the creation of a new dataset node everytime a new Uniprot entry was stored. Now everything’s fine!</li>
</ol>
<p>So that’s all for now! Hope you enjoy all this changes and new features I’ve been working on in the last couple of months. As always, feel free to give any feedback you may have, I’m looking forward to it ;)</p>
<p><a href="http://www.twitter.com/pablopareja"><strong>@pablopareja</strong></a></p>
]]></content>
</entry>
<entry>
<title type="html"><![CDATA[Using Bio4j + Neo4j Graph-algo component for finding protein-protein interaction paths]]></title>
<link href="http://bio4j.com/blog/2011/12/using-bio4j-neo4j-graph-algo-component-for-finding-protein-protein-interaction-paths/"/>
<updated>2011-12-19T22:35:41+01:00</updated>
<id>http://bio4j.com/blog/2011/12/using-bio4j-neo4j-graph-algo-component-for-finding-protein-protein-interaction-paths</id>
<content type="html"><![CDATA[<p>Hi all!</p>
<p>Today I managed to find some time to check out the <a href="http://wiki.neo4j.org/content/Graph-algo"><strong>Graph-algo component</strong></a> from Neo4j and after playing with it plus Bio4j a bit, I have to say it seems pretty cool.
For those who don’t know what I’m talking about, here you have the description you can find in Neo4j wiki:</p>
<blockquote>
<p>This is a component that offers implementations of common graph algorithms on top of Neo4j. It is mostly focused around finding paths, like finding the shortest path between two nodes, but it also contains a few different centrality measures, like betweenness centrality for nodes.</p>
</blockquote>
<p>The algorithm for finding the <strong>shortest path between two nodes</strong> caught my attention and I started to wonder how could I give it a try applying it to the data included in Bio4j. I realized then that <strong>protein-protein interactions</strong> could be a good candidate so I got down to work and created the utility method:</p>
<ul>
<li><code>findShortestInteractionPath(ProteinNode proteinSource, ProteinNode proteinTarget, int maxDepth, int maxResultsNumber)</code></li>
</ul>
<p>for getting at most <code>maxResultsNumber</code> paths between <code>proteinSource</code> and <code>proteinTarget</code> with a maximum path depth of <code>maxDepth</code>.
You can check the <a href="https://github.com/bio4j/Bio4jTools/blob/develop/src/com/era7/bioinfo/bio4j/tools/algo/InteractionsPathFinder.java"><strong>source code here</strong> </a></p>
<p>I also did a <strong><a href="https://github.com/bio4j/Bio4jTools/blob/develop/src/com/era7/bioinfo/bio4j/tools/algo/FindInteractionPaths.java">small test program</a></strong> which prints out the paths found between two proteins.</p>
<p>Even though I’ve missed having a wider choice of algorithms, it’s really cool having at least this small set of algorithms already implemented, abstracting you from the low level coding.
Apart from that, I’ve been thinking how <strong>Bio4j could open a lot of doors for topology/network analysis around all the data it includes</strong>. Such analysis could otherwise be quite hard to perform due to several reasons like the lack of data-integration between different datasources and the inner storage paradigm limiting topology/network analysis among others… </p>
<p><strong>With Bio4j however, you just have to move around the nodes and get the information you’re looking for!</strong> ;)</p>
<p><a href="http://www.twitter.com/pablopareja"><strong>@pablopareja</strong></a></p>
<h2 id="comments">comments</h2>
<ul>
<li>
<p><strong>alper yilmaz</strong>
it’s getting more interesting.. :)
that’s what I meant by “data-mining” during our skype conference.. cool..</p>
</li>
<li>
<p><strong>Roji</strong>
I follow neo4j which much itrneest. It is a novel approach, however i think property searches are very important and neo4j is not very good at this.So for example, implementing a complete social website with millions of users would not be feasible with neo4j i think. I am not sure off course.What is also itrneesting is the upcoming of native XML database. They also solve the imdependace mismatch to a certain expend. However their model are trees not graphs, graphs are more general in this sense, but i think more optimizations are possible if you choose trees.</p>
<ul>
<li><strong>ppareja</strong>
Hi Roji,