ArangoDB Datasets

Package for loading pre-configured Graph datasets into an ArangoDB Instance.

Installation

pip install arango-datasets

Usage

from arango import ArangoClient
from arango_datasets import Datasets

# Connect to database
db = ArangoClient(hosts=...).db(username=..., password=..., verify=True)

# Connect to datasets
datasets = Datasets(db)

# List datasets
print(datasets.list_datasets())

# List more information about a particular dataset
print(datasets.dataset_info("FLIGHTS")

# Load a dataset
datasets.load("FLIGHTS")

Notable Datasets

Synthea P100

Synthea is an open-source synthetic patient dataset that simulates health records for a diverse set of fictional individuals. It includes demographic, clinical, and social data such as diagnoses, medications, procedures, and encounters over a patient’s lifetime. The data is generated using realistic patterns derived from real-world healthcare statistics, enabling its use in research, development, and testing of health IT systems while preserving patient privacy.

Source: https://synthea.mitre.org/

Size: 145514 nodes, 311701 edges

print(datasets.dataset_info("SYNTHEA_P100")

datasets.load("SYNTHEA_P100")

Common Vulnerability Exposures

This dataset contains information on Common Vulnerabilities and Exposures (CVE), providing details on known security vulnerabilities in software and hardware. It includes fields such as CVE ID, descriptions, severity scores (CVSS), affected products, and references. The dataset is useful for cybersecurity research, threat analysis, and vulnerability management, helping organizations track and mitigate security risks.

Source: https://www.kaggle.com/datasets/andrewkronser/cve-common-vulnerabilities-and-exposures

Size: 145506 nodes, 316967 edges

print(datasets.dataset_info("CVE")

datasets.load("CVE")

Flights

The Flights dataset in contains flight-related data, including information on routes, airports, and airlines. It is structured as a graph dataset, where airports act as nodes and flights between them as edges. This dataset is useful for demonstrating graph queries, shortest path analysis, and network connectivity.

Source: https://github.com/arangodb/example-datasets/tree/master/Data%20Loader

Size: 3375 nodes, 286463 edges

print(datasets.dataset_info("FLIGHTS")

datasets.load("FLIGHTS")

GDELT Open Intelligence

The GDELT Project (Global Database of Events, Language, and Tone) is an open dataset that monitors global news media in real-time. It captures and analyzes events, themes, emotions, and relationships across countries, organizations, and people. Covering millions of articles from various sources, GDELT provides insights into geopolitical trends, conflicts, and societal changes. The dataset is widely used in research, journalism, and AI applications for tracking global events and sentiment analysis.

Source: https://www.gdeltproject.org/

Size: 80047 nodes, 321819 edges

print(datasets.dataset_info("OPEN_INTELLIGENCE")

datasets.load("OPEN_INTELLIGENCE")

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
.github/workflows		.github/workflows
arango_datasets		arango_datasets
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
AUTHORS.rst		AUTHORS.rst
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.rst		CONTRIBUTING.rst
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ArangoDB Datasets

Notable Datasets

Synthea P100

Common Vulnerability Exposures

Flights

GDELT Open Intelligence

About

Releases 7

Packages

Contributors 4

Languages

arangoml/arangodb_datasets

Folders and files

Latest commit

History

Repository files navigation

ArangoDB Datasets

Notable Datasets

Synthea P100

Common Vulnerability Exposures

Flights

GDELT Open Intelligence

About

Resources

Stars

Watchers

Forks

Releases 7

Packages 0

Contributors 4

Languages

Packages