Skip to content

Commit

Permalink
HATS renaming (#443)
Browse files Browse the repository at this point in the history
* Initial renaming hipscat -> hats (#418)

* Initial renaming

* Add the data back...

* Rename test file.

* Update notebooks.

* Update requirement in branch

* Fiiiine

* Initial work toward properties file. (#422)

* Initial work toward properties file.

* Responses to code review comment.

* Fix reference to partition info constant (#423)

* Fix reference to partition info constant

* Remove unused import (and unused typedef)

* Fix tests for table properties, and add additional on creation.

* catalog name in margin generation

* Un-skip tests. Fix data so tests can pass. (#425)

* Fix reference to partition info constant

* Remove unused import (and unused typedef)

* Fix tests for table properties, and add additional on creation.

* catalog name in margin generation

* Un-skip tests. Fix data so tests can pass.

* change to spatial index

* regen test data

* fix unit tests

* regenerate test files with point map files

* unskip test

* fix mypy

* update index type

* fix review

* Update test data for dataset insertion (#440)

* Update test data for dataset insertion

* Update dependency.

* Fix nan references for numpy 2

* Update repo links

---------

Co-authored-by: Sean McGuire <[email protected]>
Co-authored-by: Sean McGuire <[email protected]>
  • Loading branch information
3 people authored Oct 17, 2024
1 parent c478401 commit aa1780a
Show file tree
Hide file tree
Showing 413 changed files with 2,801 additions and 3,106 deletions.
6 changes: 2 additions & 4 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -156,8 +156,6 @@ _html/
# Project initialization script
.initialize_new_project.sh

# large, unused fits files
point_map.fits

# test notebook
dev/test.ipynb
dev/test.ipynb
docs/tutorials/pre_executed/data
10 changes: 5 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,17 +17,17 @@

A framework to facilitate and enable spatial analysis for extremely large astronomical databases
(i.e. querying and crossmatching O(1B) sources). This package uses dask to parallelize operations across
multiple HiPSCat partitioned surveys.
multiple HATS partitioned surveys.

Check out our [ReadTheDocs site](https://lsdb.readthedocs.io/en/stable/)
for more information on partitioning, installation, and contributing.

See related projects:

* HiPSCat ([on GitHub](https://github.com/astronomy-commons/hipscat))
([on ReadTheDocs](https://hipscat.readthedocs.io/en/stable/))
* HiPSCat Import ([on GitHub](https://github.com/astronomy-commons/hipscat-import))
([on ReadTheDocs](https://hipscat-import.readthedocs.io/en/stable/))
* HATS ([on GitHub](https://github.com/astronomy-commons/hats))
([on ReadTheDocs](https://hats.readthedocs.io/en/stable/))
* HATS Import ([on GitHub](https://github.com/astronomy-commons/hats-import))
([on ReadTheDocs](https://hats-import.readthedocs.io/en/stable/))

## Contributing

Expand Down
10 changes: 5 additions & 5 deletions benchmarks/benchmarks.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,15 +22,15 @@


def load_small_sky():
return lsdb.read_hipscat(TEST_DIR / DATA_DIR_NAME / SMALL_SKY_DIR_NAME, catalog_type=lsdb.Catalog)
return lsdb.read_hats(TEST_DIR / DATA_DIR_NAME / SMALL_SKY_DIR_NAME, catalog_type=lsdb.Catalog)


def load_small_sky_order1():
return lsdb.read_hipscat(TEST_DIR / DATA_DIR_NAME / SMALL_SKY_ORDER1, catalog_type=lsdb.Catalog)
return lsdb.read_hats(TEST_DIR / DATA_DIR_NAME / SMALL_SKY_ORDER1, catalog_type=lsdb.Catalog)


def load_small_sky_xmatch():
return lsdb.read_hipscat(TEST_DIR / DATA_DIR_NAME / SMALL_SKY_XMATCH_NAME, catalog_type=lsdb.Catalog)
return lsdb.read_hats(TEST_DIR / DATA_DIR_NAME / SMALL_SKY_XMATCH_NAME, catalog_type=lsdb.Catalog)


def time_kdtree_crossmatch():
Expand Down Expand Up @@ -63,8 +63,8 @@ def time_box_filter_on_partition():


def time_create_midsize_catalog():
return lsdb.read_hipscat(BENCH_DATA_DIR / "midsize_catalog")
return lsdb.read_hats(BENCH_DATA_DIR / "midsize_catalog")


def time_create_large_catalog():
return lsdb.read_hipscat(BENCH_DATA_DIR / "large_catalog")
return lsdb.read_hats(BENCH_DATA_DIR / "large_catalog")
12 changes: 6 additions & 6 deletions docs/_static/lazy_diagram.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 2 additions & 2 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,8 +74,8 @@

pygments_style = "sphinx"

# Cross-link hipscat documentation from the API reference:
# Cross-link hats documentation from the API reference:
# https://docs.readthedocs.io/en/stable/guides/intersphinx.html
intersphinx_mapping = {
"hipscat": ("http://hipscat.readthedocs.io/en/stable/", None),
"hats": ("http://hats.readthedocs.io/en/stable/", None),
}
2 changes: 1 addition & 1 deletion docs/developer/contributing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ the GitHub repository. The next steps assume the creation of branches and PRs ar
If you are (or expect to be) a frequent contributor, you should consider requesting
access to the `hipscat-friends <https://github.com/orgs/astronomy-commons/teams/hipscat-friends>`_
working group. Members of this GitHub group should be able to create branches and PRs directly
on LSDB, hipscat and hipscat-import, without the need of a fork.
on LSDB, hats and hats-import, without the need of a fork.

Create a branch
-------------------------------------------------------------------------------
Expand Down
23 changes: 11 additions & 12 deletions docs/getting-started.rst
Original file line number Diff line number Diff line change
Expand Up @@ -62,14 +62,14 @@ for more information.
Loading a Catalog
~~~~~~~~~~~~~~~~~~~~~~~~~~

Let's start by loading a HiPSCat formatted Catalog into LSDB. Use the :func:`lsdb.read_hipscat` function to
Let's start by loading a HATS formatted Catalog into LSDB. Use the :func:`lsdb.read_hats` function to
lazy load a catalog object. We'll pass in the URL to load the Zwicky Transient Facility Data Release 14
Catalog, and specify which columns we want to use from it.

.. code-block:: python
import lsdb
ztf = lsdb.read_hipscat(
ztf = lsdb.read_hats(
'https://data.lsdb.io/unstable/ztf/ztf_dr14/',
columns=["ra", "dec", "ps1_objid", "nobs_r", "mean_mag_r"],
)
Expand All @@ -94,7 +94,7 @@ usually see values).

Where to get Catalogs
~~~~~~~~~~~~~~~~~~~~~~~~~~
LSDB can load any catalogs in the HiPSCat format, locally or from remote sources. There are a number of
LSDB can load any catalogs in the HATS format, locally or from remote sources. There are a number of
catalogs available publicly to use from the cloud. You can see them with their URLs to load in LSDB at our
website `data.lsdb.io <https://data.lsdb.io>`_

Expand All @@ -107,7 +107,7 @@ If you have your own data not in this format, you can import it by following the
Performing Filters
~~~~~~~~~~~~~~~~~~~~~~~~~~

LSDB can perform spatial filters fast, taking advantage of HiPSCat's spatial partitioning. These optimized
LSDB can perform spatial filters fast, taking advantage of HATS's spatial partitioning. These optimized
filters have their own methods, such as :func:`cone_search <lsdb.catalog.Catalog.cone_search>`. For the list
of these methods see the full docs for the :func:`Catalog <lsdb.catalog.Catalog>` class.

Expand All @@ -132,7 +132,7 @@ get accurate results. This should be provided with the catalog by the catalog's

.. code-block:: python
gaia = lsdb.read_hipscat(
gaia = lsdb.read_hats(
'https://data.lsdb.io/unstable/gaia_dr3/gaia/',
columns=["ra", "dec", "phot_g_n_obs", "phot_g_mean_flux", "pm"],
margin_cache="https://data.lsdb.io/unstable/gaia_dr3/gaia_10arcs/",
Expand Down Expand Up @@ -166,13 +166,13 @@ Saving the Result
~~~~~~~~~~~~~~~~~~~~~~~~~~

For large results, it won't be possible to ``compute()`` since the full result won't be able to fit into memory.
So instead, we can run the computation and save the results directly to disk in hipscat format.
So instead, we can run the computation and save the results directly to disk in hats format.

.. code-block:: python
ztf_x_gaia.to_hipscat("./ztf_x_gaia")
ztf_x_gaia.to_hats("./ztf_x_gaia")
This creates the following HiPSCat Catalog on disk:
This creates the following HATS Catalog on disk:

.. code-block::
Expand All @@ -182,11 +182,10 @@ This creates the following HiPSCat Catalog on disk:
│ │ ├── Npix=57.parquet
│ │ └── ...
│ └── ...
├── _metadata
├── _common_metadata
├── catalog_info.json
├── partition_info.csv
└── provenance_info.json
├── _metadata
├── properties
└── partition_info.csv
Creation of Jupyter Kernel
--------------------------
Expand Down
2 changes: 1 addition & 1 deletion docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ large astronomical catalogs (e.g. querying and crossmatching O(1B) sources). It
data processing challenges, in particular those brought up by `LSST <https://www.lsst.org/about>`_.

Built on top of Dask to efficiently scale and parallelize operations across multiple distributed workers, it
uses the `HiPSCat <https://hipscat.readthedocs.io/en/stable/>`_ data format to efficiently perform spatial
uses the `HATS <https://hats.readthedocs.io/en/stable/>`_ data format to efficiently perform spatial
operations.

.. figure:: _static/gaia.png
Expand Down
3 changes: 2 additions & 1 deletion docs/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -10,4 +10,5 @@ sphinx-autoapi
sphinx-copybutton
sphinx-book-theme
sphinx-design
git+https://github.com/astronomy-commons/hipscat.git@main
git+https://github.com/astronomy-commons/hats.git@main
git+https://github.com/astronomy-commons/hats-import.git@main
6 changes: 3 additions & 3 deletions docs/tutorials/exporting_results.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -6,16 +6,16 @@
"source": [
"# Exporting results\n",
"\n",
"You can save the catalogs that result from running your workflow to disk, in parquet format, using the `to_hipscat` call. \n",
"You can save the catalogs that result from running your workflow to disk, in parquet format, using the `to_hats` call. \n",
"\n",
"You must provide a `base_catalog_path`, which is the output path for your catalog directory, and (optionally) a name for your catalog, `catalog_name`. The `catalog_name` is the catalog's internal name and therefore may differ from the catalog's base directory name. If the directory already exists and you want to overwrite its content set the `overwrite` flag to True. Do not forget to provide the necessary credentials, as `storage_options` to the UPath construction, when trying to export the catalog to protected remote storage.\n",
"\n",
"For example, to save a catalog that contains the results of crossmatching Gaia with ZTF to `\"./my_catalogs/gaia_x_ztf\"` one could run:\n",
"```python\n",
"gaia_x_ztf_catalog.to_hipscat(base_catalog_path=\"./my_catalogs/gaia_x_ztf\", catalog_name=\"gaia_x_ztf\")\n",
"gaia_x_ztf_catalog.to_hats(base_catalog_path=\"./my_catalogs/gaia_x_ztf\", catalog_name=\"gaia_x_ztf\")\n",
"```\n",
"\n",
"The HiPSCat catalogs on disk follow a well-defined directory structure:\n",
"The HATS catalogs on disk follow a well-defined directory structure:\n",
"\n",
"```\n",
"gaia_x_ztf/\n",
Expand Down
6 changes: 3 additions & 3 deletions docs/tutorials/filtering_large_catalogs.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
"source": [
"# Filtering large catalogs\n",
"\n",
"Large astronomical surveys contain a massive volume of data. Billion object, multi-terabyte sized catalogs are challenging to store and manipulate because they demand state-of-the-art hardware. Processing them is expensive, both in terms of runtime and memory consumption, and performing it in a single machine has become impractical. LSDB is a solution that enables scalable algorithm execution. It handles loading, querying, filtering and crossmatching astronomical data (of HiPSCat format) in a distributed environment. \n",
"Large astronomical surveys contain a massive volume of data. Billion object, multi-terabyte sized catalogs are challenging to store and manipulate because they demand state-of-the-art hardware. Processing them is expensive, both in terms of runtime and memory consumption, and performing it in a single machine has become impractical. LSDB is a solution that enables scalable algorithm execution. It handles loading, querying, filtering and crossmatching astronomical data (of HATS format) in a distributed environment. \n",
"\n",
"In this tutorial, we will demonstrate how to:\n",
"\n",
Expand Down Expand Up @@ -93,7 +93,7 @@
"outputs": [],
"source": [
"ztf_object_path = f\"{surveys_path}/ztf/ztf_dr14\"\n",
"ztf_object = lsdb.read_hipscat(ztf_object_path, columns=[\"ps1_objid\", \"ra\", \"dec\"])\n",
"ztf_object = lsdb.read_hats(ztf_object_path, columns=[\"ps1_objid\", \"ra\", \"dec\"])\n",
"ztf_object"
]
},
Expand Down Expand Up @@ -318,7 +318,7 @@
"id": "9a887b31",
"metadata": {},
"source": [
"We can stack a several number of filters, which are applied in sequence. For example, `catalog.box_search().polygon_search()` should result in a perfectly valid HiPSCat catalog containing the objects that match both filters."
"We can stack a several number of filters, which are applied in sequence. For example, `catalog.box_search().polygon_search()` should result in a perfectly valid HATS catalog containing the objects that match both filters."
]
},
{
Expand Down
14 changes: 7 additions & 7 deletions docs/tutorials/getting_data.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
"source": [
"# Getting data into LSDB\n",
"\n",
"The most practical way to load data into LSDB is from catalogs in HiPSCat format, hosted locally or on a remote source. We recommend you to visit our own cloud repository, [data.lsdb.io](https://data.lsdb.io), where you are able to find large surveys publicly available to use."
"The most practical way to load data into LSDB is from catalogs in HATS format, hosted locally or on a remote source. We recommend you to visit our own cloud repository, [data.lsdb.io](https://data.lsdb.io), where you are able to find large surveys publicly available to use."
]
},
{
Expand All @@ -24,7 +24,7 @@
"source": [
"### Example: Loading Gaia DR3\n",
"\n",
"Let's get Gaia DR3 into our workflow, as an example. It is as simple as invoking `read_hipscat` with the respective catalog URL, which you can copy directly from our website."
"Let's get Gaia DR3 into our workflow, as an example. It is as simple as invoking `read_hats` with the respective catalog URL, which you can copy directly from our website."
]
},
{
Expand All @@ -33,7 +33,7 @@
"metadata": {},
"outputs": [],
"source": [
"gaia_dr3 = lsdb.read_hipscat(\"https://data.lsdb.io/unstable/gaia_dr3/gaia/\")\n",
"gaia_dr3 = lsdb.read_hats(\"https://data.lsdb.io/unstable/gaia_dr3/gaia/\")\n",
"gaia_dr3"
]
},
Expand All @@ -59,11 +59,11 @@
"source": [
"Note that it's important (and highly recommended) to:\n",
"\n",
"- **Pre-select a small subset of columns** that satisfies your scientific needs. Loading an unnecessarily large amount of data leads to computationally expensive and inefficient workflows. To see which columns are available before even having to invoke `read_hipscat`, please refer to the column descriptions in each catalog's section on [data.lsdb.io](https://data.lsdb.io).\n",
"- **Pre-select a small subset of columns** that satisfies your scientific needs. Loading an unnecessarily large amount of data leads to computationally expensive and inefficient workflows. To see which columns are available before even having to invoke `read_hats`, please refer to the column descriptions in each catalog's section on [data.lsdb.io](https://data.lsdb.io).\n",
"\n",
"- **Load catalogs with their respective margin caches**, when available. These margins are necessary to obtain accurate results in several operations such as joining and crossmatching. For more information about margins please visit our [Margins](margins.ipynb) topic notebook.\n",
"\n",
"Let's define the set of columns we need and add the margin catalog's path to our `read_hipscat` call."
"Let's define the set of columns we need and add the margin catalog's path to our `read_hats` call."
]
},
{
Expand All @@ -72,7 +72,7 @@
"metadata": {},
"outputs": [],
"source": [
"gaia_dr3 = lsdb.read_hipscat(\n",
"gaia_dr3 = lsdb.read_hats(\n",
" \"https://data.lsdb.io/unstable/gaia_dr3/gaia/\",\n",
" margin_cache=\"https://data.lsdb.io/unstable/gaia_dr3/gaia_10arcs/\",\n",
" columns=[\n",
Expand All @@ -99,7 +99,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"When invoking `read_hipscat` only metadata information about that catalog (e.g. sky coverage, number of total rows and column schema) is loaded into memory! Notice that the ellipses in the previous catalog representation are just placeholders.\n",
"When invoking `read_hats` only metadata information about that catalog (e.g. sky coverage, number of total rows and column schema) is loaded into memory! Notice that the ellipses in the previous catalog representation are just placeholders.\n",
"\n",
"You will find that most use cases start with **LAZY** loading and planning operations, followed by more expensive **COMPUTE** operations. The data is only loaded into memory when we trigger the workflow computations, usually with a `compute` call.\n",
"\n",
Expand Down
24 changes: 12 additions & 12 deletions docs/tutorials/import_catalogs.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,12 @@
"collapsed": false
},
"source": [
"# Importing catalogs to HiPSCat format\n",
"# Importing catalogs to HATS format\n",
"\n",
"This notebook presents two modes of importing catalogs to HiPSCat format:\n",
"This notebook presents two modes of importing catalogs to HATS format:\n",
"\n",
"1. `lsdb.from_dataframe()` method: helpful to load smaller catalogs from a single dataframe. data should have fewer than 1-2 million rows and the pandas dataframe should be less than 1-2G in-memory. if your data is larger, the format is complicated, you need more flexibility, or you notice any performance issues when importing with this mode, use the next mode.\n",
"2. `hipscat-import` package: for large datasets (1G - 100s of TB). this is a purpose-built map-reduce pipeline for creating hipscat catalogs from various datasets. in this notebook, we use a very basic dataset and basic import options. please see [the full package documentation](https://hipscat-import.readthedocs.io/) if you need to do anything more complicated."
"2. `hats-import` package: for large datasets (1G - 100s of TB). this is a purpose-built map-reduce pipeline for creating HATS catalogs from various datasets. in this notebook, we use a very basic dataset and basic import options. please see [the full package documentation](https://hats-import.readthedocs.io/) if you need to do anything more complicated."
]
},
{
Expand Down Expand Up @@ -119,8 +119,8 @@
" threshold=100,\n",
")\n",
"\n",
"# Save it to disk in HiPSCat format\n",
"catalog.to_hipscat(f\"{tmp_dir.name}/from_dataframe\")"
"# Save it to disk in HATS format\n",
"catalog.to_hats(f\"{tmp_dir.name}/from_dataframe\")"
]
},
{
Expand All @@ -130,15 +130,15 @@
"collapsed": false
},
"source": [
"## HiPSCat import pipeline"
"## HATS import pipeline"
]
},
{
"cell_type": "markdown",
"id": "3842520c",
"metadata": {},
"source": [
"Let's install the latest release of hipscat-import:"
"Let's install the latest release of hats-import:"
]
},
{
Expand All @@ -153,7 +153,7 @@
},
"outputs": [],
"source": [
"!pip install git+https://github.com/astronomy-commons/hipscat-import.git@main --quiet"
"!pip install git+https://github.com/astronomy-commons/hats-import.git@main --quiet"
]
},
{
Expand All @@ -169,8 +169,8 @@
"outputs": [],
"source": [
"from dask.distributed import Client\n",
"from hipscat_import.catalog.arguments import ImportArguments\n",
"from hipscat_import.pipeline import pipeline_with_client"
"from hats_import.catalog.arguments import ImportArguments\n",
"from hats_import.pipeline import pipeline_with_client"
]
},
{
Expand Down Expand Up @@ -226,7 +226,7 @@
},
"outputs": [],
"source": [
"from_dataframe_catalog = lsdb.read_hipscat(f\"{tmp_dir.name}/from_dataframe\")\n",
"from_dataframe_catalog = lsdb.read_hats(f\"{tmp_dir.name}/from_dataframe\")\n",
"from_dataframe_catalog"
]
},
Expand All @@ -242,7 +242,7 @@
},
"outputs": [],
"source": [
"from_import_pipeline_catalog = lsdb.read_hipscat(f\"{tmp_dir.name}/from_import_pipeline\")\n",
"from_import_pipeline_catalog = lsdb.read_hats(f\"{tmp_dir.name}/from_import_pipeline\")\n",
"from_import_pipeline_catalog"
]
},
Expand Down
Loading

0 comments on commit aa1780a

Please sign in to comment.