Automatic generation of Meta-Data for a dataset
Table of Contents
- Motivation
- How to run the application
- Project Structure
- Methodology
- Where this Library fits in the overall architecture
- Approach to determine Meta-Data
- Column Names
- File Path
- Units
- Temporal Coverage
- Granularity
- Spatial Coverage
- File Formats Available
- Is Public Dataset
poetry run uvicorn app.main:app --reload --port 8005
docker compose up --build
Files related to application are in the app
or tests
directories.
Application parts are:
app
├── api - web related stuff.
│ └── routes - web routes.
├── core - application configuration, startup events, logging.
├── models - pydantic models for this application.
├── services - logic that is not just crud related.
└── main.py - FastAPI application creation and configuration.
- How are columns categorised?
- The library categorises columns into Following Categories:
Column Entity Columns Date-Time non_calendar_year
calender_year
other_year
quarter
month
dateGeography country
state
districtUnit unit Note note Unmapped Any unmapped columns
- The library categorises columns into Following Categories:
-
General Workflow
graph LR; A[Dataset]-->B{Unit Column Exists ?}; B -- NO --> C(RETURN Null String); B -- Yes --> D[Get all unique units from UNIT Column]; D --> E[Prepare List of all separate units]; E --> F(RETURN all units as STRING SEPARATED WITH COMMAS)
-
General Workflow
flowchart LR A(Dataset) --> B{Year column exists ?} B -- NO --> C(RETURN Null String) B -- Yes --> D[Calender / Non-Calender Year Columns] D --> E{Years are in Sequence ?} E -- YES --> F(RETURN string represntation of range \n example : 2012 to 2020 or \n 2012-13 to 2020-21) E -- NO --> G(RETURN comma separated values for all years, \n exmaple : 2012,2015,2018 or \n 2012-13, 2015-16, 2018-19)
Notes:
- Determination of Temporal coverage is based on the presence of year column.
- If both Calender year and Non-Calender year are presnet in dataset then priority will be given to Calender year.
-
General Workflow
flowchart LR A(Dataset) --> B{If any of Date-time or \nGeography columns exists ?} B -- No --> C(RETURN Null String) B -- YES --> D[Map all Columns levels in \nSorted Order for respective Domains] D --> E[Map the columns groups according to \nproper naming convention Granularity] E --> F(RETURN Comma Separated Values of all Granularitues \n example : Quarterly, District)
Notes:
- Granularity is calculated for 2 domains.
- Geography
- Date-Time
- In
config.py
There are granularity ranks mentioned for each domain. - In
config.py
there are Keywords also present for Granularity if found in Datasets.
- Granularity is calculated for 2 domains.
Mentioned below are the Cases for Spatial Covererage :
Spatial Location | Dataset with categories as | Methodology | Spatial Coverage |
---|---|---|---|
Countries | India, Pakisthan, China, etc | Country | |
Specific Country | India | represent it with the specific Country Name | India |
States of a Country | Andhra Pradesh, Assam, etc | States of India | |
Regions of a country | South India, NE states etc | Regions of India | |
Specific State of a country | Andhra Pradesh | represent it with the specific State Name | Andhra Pradesh |
Districts of a State/ States | Adilabad, Hyderabad etc | Districts of Telangana or Districts of India | |
Specific District of a state | Hyderabad | represent it with specific District Name | Hyderabad |
-
General Workflow
flowchart LR A(Dataset) --> B{If Geographical Columns exists ?} B -- NO --> C(RETURN Default Value as INDIA) B -- YES --> D[Sort the order of different \nGeographical Level] D --> E(RETURN Value of biggest order of Geographical Column \nwith proper naming convention)
Notes:
- This library currently facilitates only for Country, State and District level of Spatial Coverage.
- Mapping of levels of Geographic Columns is decided by corresponding column names and not the values, hence change in Column names will impact the mapping.
- If there is no Geographic column , then the result would be default for INDIA.
- Spatial coverage order, keyword Mapping and Naming Convention are mentioned in
config.py
.
Notes:
- Reads the format of file from the file name.