diff --git a/docs/public/assets/pages/docs/cli/export-cypher.png b/docs/public/assets/pages/docs/cli/export-cypher.png new file mode 100644 index 000000000..942d4823e Binary files /dev/null and b/docs/public/assets/pages/docs/cli/export-cypher.png differ diff --git a/docs/public/assets/pages/docs/cli/export-excel.png b/docs/public/assets/pages/docs/cli/export-excel.png new file mode 100644 index 000000000..41d0d35f9 Binary files /dev/null and b/docs/public/assets/pages/docs/cli/export-excel.png differ diff --git a/docs/public/assets/pages/docs/cli/export-gephi.png b/docs/public/assets/pages/docs/cli/export-gephi.png new file mode 100644 index 000000000..334d9cf4e Binary files /dev/null and b/docs/public/assets/pages/docs/cli/export-gephi.png differ diff --git a/docs/public/assets/pages/docs/cli/mapping-result.png b/docs/public/assets/pages/docs/cli/mapping-result.png new file mode 100644 index 000000000..ca2296fb6 Binary files /dev/null and b/docs/public/assets/pages/docs/cli/mapping-result.png differ diff --git a/docs/src/pages/docs/cli.mdx b/docs/src/pages/docs/cli.mdx index c4176caa0..b52deb155 100644 --- a/docs/src/pages/docs/cli.mdx +++ b/docs/src/pages/docs/cli.mdx @@ -3,60 +3,250 @@ layout: '@layouts/DocsLayout.astro' title: CLI --- -# Command-Line Functions +# CLI -Many of the functions of _followthemoney_ can be used interactively or in scripts via the command line. Please first refer to the [Aleph documentation](https://docs.aleph.occrp.org/developers/followthemoney/ftm) for an intro to the `ftm` utility. +The `ftm` command-line tool can be used to generate, process and export streams of entities in a line-based JSON format. Typical uses would include: -Key to understanding the `ftm` tool is the notion of [streams](/docs#streams): entities can be transferred between programs and processing steps as a series of JSON objects, one per line. This notion is supported by the related [alephclient](https://docs.aleph.occrp.org/developers/alephclient) command, which can serve as a source, and a sink for entity streams, backed by the Aleph API. +* Generating FollowTheMoney entities by applying an [entity mapping to structured data tables](/docs/mappings) (CSV, SQL). +* Converting an existing stream of FollowTheMoney entities into another format, such as CSV, Excel, Gephi GEXF or Neo4J's Cypher language. +* Converting data in complex formats, such as the Open Contracting Data Standard, into FollowTheMoney entities. -## Examples +## Installation -The command line sequence below uses shell pipes to a) [map data](/docs/mappings) into entities from a database, b) apply a [namespace](/docs/namespace) to the entity IDs, c) aggregate [entity fragments](/docs/fragments) created by the mapping, and d) export the resulting entity stream into a sequence of CYPHER statements that can be executed on a Neo4J database to generate a property graph: +To install `ftm`, you need to have Python 3 installed and working on your computer. You may also want to create a virtual environment using virtualenv or pyenv. With that done, type: ```bash -ftm map companies_from_db.yml | \ - ftm sign -s my_namespace | \ - ftm aggregate | \ - ftm export-cypher -o graph.cypher +pip install followthemoney +ftm --help ``` -Here's another example that fetches pre-generated entities from a URL and loads -them into a local Aleph instance: +### Optional: Enhanced transliteration support + +One of the jobs of followthemoney is to transliterate text from various alphabets into the latin script to support the comparison of names. The normal tool used for this is prone to fail with certain alphabets, e.g. the Azeri language. For that reason, we recommend also installing ICU (International components for Unicode). + +On a Debian-based Linux system, installing ICU is relatively simple: ```bash -export URL=https://public.data.occrp.org/datasets/icij/panama_papers.ijson -curl -s $URL | \ - ftm validate | \ - alephclient write-entities -f icij_panama_papers +apt install libicu-dev +pip install pyicu ``` -## Reference +The Mac OS version of installing ICU is a bit complicated, and requires you to have Homebrew installed: -Please refer to the output of `ftm --help` for a detailed reference of the `ftm` CLI: +```bash +brew install icu4c +env CFLAGS=-I/usr/local/opt/icu4c/include +env LDFLAGS=-L/usr/local/opt/icu4c/lib +PATH=$PATH:/usr/local/opt/icu4c/bin +pip install pyicu +``` +## Executing a data mapping + +Probably the most common task for `ftm` is to generate FollowTheMoney entities from some structured data source. This is done using a YAML-formatted mapping file, [described here](/docs/mappings). With such a YAML file in hand, you can generate entities like this: + +```bash +curl -o md_companies.yml https://raw.githubusercontent.com/alephdata/aleph/main/mappings/md_companies.yml +ftm map md_companies.yml ``` -Usage: ftm [OPTIONS] COMMAND [ARGS]... - Utility for FollowTheMoney graph data +This will yield a line-based JSON stream of every company in Moldova, their directors and principal shareholders. -Options: - --help Show this message and exit. +Screenshot of a terminal window. The terminal shows the output of the `ftm map` command to generate the Moldovan company data. -Commands: - aggregate Aggregate multiple fragments of entities - dump-model Export the current schema model - export-csv Export to CSV - export-cypher Export to Cypher script - export-excel Export to Excel - export-gexf Export to GEXF (Gephi) format - export-neo4j-bulk Export to Neo4J bulk import - export-rdf Export to RDF NTriples - import-vis Load a .VIS file and get entities - map Execute a mapping file and emit objects - map-csv Map CSV data from stdin and emit objects - pretty Format a stream of entities to make it readable - sieve Filter out parts of entities. - sign Apply a HMAC signature to entity IDs - sorted-aggregate Aggregate sorted fragments of entities - validate Re-parse and validate the given data +You might note, however, that this actually generates multiple entity fragments for each company (i.e. multiple entities with the same ID). This is due to the way the md_companies mapping is written: each query section generates a partial company record. In order to mitigate this, you will need to perform entity aggregation: + +```bash +curl -o md_companies.yml https://raw.githubusercontent.com/alephdata/aleph/main/mappings/md_companies.yml +ftm map md_companies.yml | ftm aggregate > moldova.ijson ``` + +The call for `ftm aggregate` will retain the entire dataset in memory, which is impossible to do for large databases. In such cases, it's recommended to use an on-disk entity aggregation tool, `followthemoney-store`. + +### Loading data from a local CSV file + +Another peculiarity of `ftm map` is that the source data is actually referenced within the YAML mapping file as an absolute URL. While this makes sense for data sourced from a SQL database or a public CSV file, you might sometimes want to map a local CSV file instead. For this, a modified version of `ftm map` is provided, `ftm map-csv`. It ignores the specified source URLs and reads data from standard input: + +```bash +cat people_of_interest.csv | ftm map-csv people_of_interest.yml | ftm aggregate +``` + +## Exporting entities to Excel or CSV + +FollowTheMoney data can be exported to tabular formats, such as modern Excel (XLSX) files, and comma-separated values (CSV). Since each schema of entities has a different set of properties it makes sense to turn each schema into a separate table: `People` go into one, `Directorships` into another. + +To export to an Excel file, use the `ftm export-excel` command: + +```bash +curl -o us_ofac.ijson https://storage.googleapis.com/occrp-data-exports/us_ofac/us_ofac.json +cat us_ofac.ijson | ftm validate | ftm export-excel -o OFAC.xlsx +``` + +Since writing the binary data of an Excel file to standard output is awkward, it is mandatory to include a file name with the `-o` option. + +Screenshot of Microsoft Excel showing the export from the example above. The Excel file has multiple sheets, one for each entity type (e.g. People, Companies, and Ownerships). + + + When exporting to Excel format, it's easy to generate a workbook larger than what Microsoft Excel and similar office programs can actually open. Only export small and mid-size datasets. + + +When exporting to CSV format using `ftm export-csv`, the exporter will usually generate multiple output files, one for each schema of entities present in the input stream of FollowTheMoney entities. To handle this, it expects to be given a directory name: + +```bash +curl -o us_ofac.ijson https://storage.googleapis.com/occrp-data-exports/us_ofac/us_ofac.json +cat us_ofac.ijson | ftm validate | ftm export-csv -o OFAC/ +``` + +In the given directory, you will find files names `Person.csv`, `LegalEntity.csv`, `Vessel.csv`, etc. + +## Exporting data to a network graph + +FollowTheMoney sees every unit of information as an entity with a set of properties. To analyse this information as a network with nodes and edges, we need to decide what logic should rule the transformation of entities into nodes and edges. Different strategies are available: + +* Some entity schemata, such as `Directorship`, `Ownership`, `Family` or `Payment`, contain annotations that define how they can be transformed into an edge with a source and target. +* Entities also naturally reference others. For example, an `Email` has an `emitters` property that refers to a `LegalEntity`, the sender. The `emitters` property connects the two entities and can also be turned into an edge. +* Finally, some types of properties (e.g. `email`, `iban`, `names`) can be formed into nodes, with edges formed towards each node that derives from an entity with that property value. For example, an `address` node for "40 Wall Street" would show links to all the companies registered there, or a node representing the name "Frank Smith" would connect all the documents mentioning that name. It rarely makes sense to turn all property types into nodes, so the set of types that need to be [reified]() can be passed as options into the graph exporter. + +### Cypher commands for Neo4J + +[Neo4J](https://neo4j.com/) is a popular open source graph database that can be queried and edited [using the Cypher language](https://neo4j.com/docs/cypher-refcard/current/). It can be used as a database backend or queried directly to perform advanced analysis, e.g. to find all paths between two entities. + +The example below uses Neo4J's `cypher-shell` command to load the US sanctions list into a local instance of the database: + +```bash +curl -o us_ofac.ijson https://storage.googleapis.com/occrp-data-exports/us_ofac/us_ofac.json +cat us_ofac.ijson | ftm export-cypher | cypher-shell -u user -p password +``` + +Screenshot of FtM entities imported to a Neo4J instance. + +By default, this will only make explicit edges based on entity to entity relationships. If you want to reify specific property types, use the `-e` option: + +```bash +cat us_ofac.ijson | ftm export-cypher -e name -e iban -e entity -e address +``` + +When working with file-based datasets, you may want to delete folder hierarchies from the imported data in Neo4J to avoid file co-location biasing path and density analyses: + +``` +# Delete folder hierarchies: +MATCH ()-[r:ANCESTORS]-() DELETE r; +MATCH ()-[r:PARENT]-() DELETE r; +# Delete entities representing individual pages: +MATCH (n:Page) DETACH DELETE n; +# Delete names or email only used once: +MATCH (n:name) WHERE size((n)--()) <= 1 DETACH DELETE (n); +MATCH (n:email) WHERE size((n)--()) <= 1 DETACH DELETE (n); +MATCH (n:address) WHERE size((n)--()) <= 1 DETACH DELETE (n); +# ... for all reified value types ... +``` + +At any time, you can flush the entire Neo4J and start from scratch: + +``` +MATCH (n) DETACH DELETE n; +``` + +#### Bulk loading data + +Another option for loading data to Neo4J is to export a set of entities into CSV files and then using the `neo4-admin import` command to load them into an empty database. This requires shutting down the Neo4J server and then running a command that will write the new database. + +In order to generate data in CSV format suitable for Neo4J import, use the following command: + +```bash +cat us_ofac.ijson | ftm export-neo4j-bulk -o folder_name -e iban -e entity -e address +``` + +This will generate a set of CSV files in a folder, and include a shell script file that describes the `neo4-admin` import command that should be used to load the data into a graph store. + +### GEXF for Gephi/Sigma.js + +[GEXF](https://gephi.org/gexf/format/) (Graph Exchange XML Format) is a file format used by the network analysis software [Gephi](https://gephi.org/) and other tools developed in the periphery of the [Media Lab at Sciences Po](http://tools.medialab.sciences-po.fr/). Gephi is particularly suited to do quantitative analysis of graphs with tens of thousands of nodes. It can calculate network metrics like centrality or PageRank, or generate complex visual layouts. + +The command line works analogous to the Neo4J export, also accepting the `-e` flag for property types that should be turned into nodes: + +```bash +curl -o us_ofac.ijson https://storage.googleapis.com/occrp-data-exports/us_ofac/us_ofac.json +cat us_ofac.ijson | ftm validate | ftm export-cypher -e iban -o ofac.gexf +``` + +Screenshot of Gephi. A small trove of emails has been visualized as a network. The entity schema type has been used to color nodes, while the size is based on the amount of inbound links (i.e. In-Degree). + +## Exporting entities to RDF/Linked Data + +Entity streams of FollowTheMoney data can also be exported to linked data in the `NTriples` format. + +```bash +curl -o us_ofac.ijson https://storage.googleapis.com/occrp-data-exports/us_ofac/us_ofac.json +cat us_ofac.ijson | ftm validate | ftm export-rdf +``` + +It is unclear to the author why this functionality exists, it was just really easy to implement. For those developers who enjoy working with RDF, it might be worthwhile to point out that the underlying ontology (FollowTheMoney) is also regularly published in [RDF/XML](https://followthemoney.tech/ns/ftm.xml) format. + +By default, the RDF exporter tries to map each entity property to a fully-qualified RDF predicate. Schemas include some mappings to FOAF and similar ontologies. + +## Importing Open Contracting data + +The [Open Contracting Data Standard](https://standard.open-contracting.org/latest/en/) (OCDS) is commonly serialised as a series of JSON objects. `ftm` includes a function to transform a stream of OCDS objects into `Contract` and `ContractAward` entities. This was developed in particular to import data from the DIGIWHIST [OpenTender.eu](https://opentender.eu/all/download) site, so other implementations of OCDS may require extending the importer to accommodate other formats. + +Here's how you can convert all Cyprus government procurement data to FollowTheMoney objects: + +```bash +curl -o CY_ocds_data.json.tar.gz https://opentender.eu/data/files/CY_ocds_data.json.tar.gz +tar xvfz CY_ocds_data.json.tar.gz +cat CY_ocds_data.json | ftm import-ocds | ftm aggregate >cy_contracts.ijson +``` + +Depending on how large the OCDS dataset is, you may want to use `followthemoney-store` instead of `ftm aggregate`. + +## Aggregating entities using ftm-store + +While the method of streaming FollowTheMoney entities is very convenient, there are situations where not all information about an entity is known at the time at which it is generated. For example, think of a [mapping](/docs/mappings) that loads company names from one CSV file, while the corresponding addresses are in a second, separate CSV table. In such cases, it is easier to generate two entities with the same ID and to merge them later. + +Merging such entity fragments requires sorting all the entities in the given dataset by their ID in order to aggregate their properties. For small datasets, this can be done in application memory using the `ftm aggregate`command. + +Once the dataset size approaches the amount of available memory, however, sorting must be performed on disk. This is also true when entity fragments are generated on different nodes in a computing cluster. + +For this purpose, `followthemoney-store` is available as a Python library and a command line tool. It can use any SQL database as a backend, with a local SQLite file set as a default. When using PostgreSQL as a database, `followthemoney-store` can use its built-in upsert functionality, making the backend more performant than others. + +To use `followthemoney-store` with SQLite, install it like this: + +```bash +pip install followthemoney-store +``` + +For PostgreSQL support, use the following settings: + +```bash +pip install followthemoney-store[postgresql] +export FTM_STORE_URI=postgresql://localhost/followthemoney +``` + +Once installed, you can operate the `followthemoney-store` command in read or write mode: + +```bash +curl -o us_ofac.ijson https://storage.googleapis.com/occrp-data-exports/us_ofac/us_ofac.json +cat us_ofac.ijson | ftm store write -d us_ofac +ftm store iterate -d us_ofac | alephclient write-entities -f us_ofac +ftm store delete -d us_ofac +``` + + + When aggregating entities with large fragments of text, a size limit applies. By default, no entity is allowed to grow larger than 50MB of raw text. Additional text fragments are discarded with a warning. + diff --git a/docs/src/pages/docs/mappings.mdx b/docs/src/pages/docs/mappings.mdx index 3cca4b261..2dd59e50d 100644 --- a/docs/src/pages/docs/mappings.mdx +++ b/docs/src/pages/docs/mappings.mdx @@ -5,6 +5,453 @@ title: Mappings # Mappings -Mappings are a mechanism for generating entities from structured data sources, including tabular data and SQL databases. +Mappings are a mechanism for generating entities from structured data sources, including tabular data and SQL databases. Mappings are defined in YAML files that describe how information from a table can be projected to FollowTheMoney entities. -Please refer to the [Aleph mappings documentation](https://docs.aleph.occrp.org/developers/mappings) for details. +## Getting started + +In order to map data to the FollowTheMoney model, you will need the following: + +* a source data table +* a tool to process the mapping +* and a mapping file \(to direct the tool\). + +**Source data** can be either a CSV \(comma-separated values\) file using the UTF-8 character encoding, or a valid [connection string](https://docs.sqlalchemy.org/en/13/core/engines.html#database-urls) to connect to a SQL database. Using SQL as a source also lets you perform JOINs within the database while mapping data. + +In order to **execute a mapping**, you need to install the [ftm command-line](/docs/cli) utility. + +To write a **mapping file**, you will first need to identify: + +- the types of entities included in the dataset \(e.g. `People`, `Companies`, `Directorships`\) +- the properties that describe each entity \(e.g. the `name` of a `Company`, or the `birthDate` of a `Person`\) +- and the the field or combination of fields that can be used to generate a `key` \(this used to uniquely identify each entity in the dataset\). Find more details on these requirements [below](#generating-unique-keys). + +## A simple mapping example + +Writing a mapping file is often an iterative process, which we can gradually expand upon to refine the data model. + +Below is a simple mapping file. It downloads a list of British members of parliament and transforms them into `Person` entities. + +```yaml title="brexitonians.yml" +gb_parliament_57: + queries: + - csv_url: http://bit.ly/uk-mps-csv + entities: + member: + schema: Person + keys: + - id + properties: + name: + column: name +``` + +The mapping file specifies a dataset name \(`gb_parliament_57`\) and uses a single query to pull data from a CSV file \(the dataset is from the excellent EveryPolitician project\). The query generates a `Person` entity, which maps the CSV's `id` column to a key, and the CSV's `name`column to the property `name` + +Try saving this file to your computer and executing it with the [ftm command-line tool](/docs/cli): + +``` +ftm map brexitonians.yml +``` + +The command will output a `Person` entity (formatted as a JSON object) for every unique record in the source table. + +### Assigning additional properties + +However, the source CSV file has far more detail on each MP, from e-mail addresses to political party affiliation. To include this data in `gb_parliament_57` , we need to map each CSV column to the respective property as defined in the FollowTheMoney schema. The properties vary based on the type of entity \(a `Person` will have different properties from a `Company`\). + +To find out what properties exist for a particular schema, you can [check out the YAML-based schema definitions](https://github.com/alephdata/followthemoney/tree/master/followthemoney/schema) on GitHub or the [Model Explorer](/explorer). + +Here's an updated mapping file, which maps additional columns from the CSV file to properties in the `Person` schemata \(`email`, `nationality`, and `alias`\). + +```yaml title="brexitoids.yml" +gb_parliament_57: + queries: + - csv_url: http://bit.ly/uk-mps-csv + entities: + member: + schema: Person + keys: + - id + properties: + name: + column: name + alias: + column: sort_name + email: + column: email + nationality: + literal: GB +``` + +### Generating multiple entities + +Now that we've generated a detailed record for each MP, we might want to add their party membership. First, let's map a party entity \(Line 12 onwards\): + +```yaml title="brexicels.yml" +gb_parliament_57: + queries: + - csv_url: http://bit.ly/uk-mps-csv + entities: + member: + schema: Person + keys: + - id + properties: + name: + column: name + party: + schema: Organization + keys: + - group_id + properties: + name: + column: group +``` + +When run this will create twice as many entities as before: the MPs, and parties. Note how each party is generated multiple times \(once for each of its members\). When you're using the command-line, you will need to perform [entity aggregation](/docs/cli#aggregating-entities-using-ftm-store) to merge these duplicates. + +### Creating relationships between entities + +What this does not yet do, however, is explicitly create a link between each MP and their party. In FollowTheMoney parlance, links \(or relationships\) are just another entity type. Note how, on lines 5 and 12 in the above mapping, we are assigning a temporary name for the `member` and the `party`. We can use these references when generating a third entity, the `Membership`: + +```yaml title="brexosaurs.yml" +gb_parliament_57: + queries: + - csv_url: http://bit.ly/uk-mps-csv + entities: + member: + schema: Person + keys: + - id + properties: + name: + column: name + party: + schema: Organization + keys: + - group_id + properties: + name: + column: group + membership: + schema: Membership + keys: + - id + - group_id + properties: + organization: + entity: party + member: + entity: member +``` + +When loaded into a FollowTheMoney-compatible tool such as [Aleph](https://docs.aleph.occrp.org), this mapping would now show browsable entities for the member and each party, and list the memberships on each of their profile pages. You can also export this data [to a more conventional node-graph data model](/docs/cli#exporting-data-to-a-network-graph) for use in Neo4J or Gephi. + +## A more realistic complex mapping + +The companies registry of the Republic of Moldova is an open dataset that consists of three separate source files that, taken together, produce a graph of company information, ownership and management: + +- `companies.csv` with companies' details like name, id, address, incorporation date; +- `directors.csv` with names of directors and their details; +- `founders.csv` also with names and details of the founding entities \(i.e. major shareholders\). + +The mapping example given below describes the relationship between the companies stored in `companies.csv` and directors and founders, stored in `directors.csv` and `founders.csv` respectively. + +```yaml title="moldova.yml" +md_companies: + queries: + - csv_url: http://assets.data.occrp.org/tools/aleph/fixtures/md_companies/companies.csv + # Entity definition section. + entities: + # This is an arbitrary entity name that will be used throughout this query + # section of the mapping. + company: + # Entity schema type from FollowTheMoney model. + schema: Company + # List of columns that are used as unique identifiers for each record. + # Could also be viewed as record aggregation when there are several + # records for the same company that differ only in, for example, address + # field. In this case the resulting entity will contain address values + # merged from different source data records. + keys: + - IDNO + - Denumirea_completă + # A set of properties that describe the chosen schema type. + # For each property one or several columns can be used to get value from. + # A literal string value could be given instead of a column value, + # e.g. for a country code. + properties: + name: + column: Denumirea_completă + registrationNumber: + column: IDNO + incorporationDate: + column: Data_înregistrării + address: + column: Adresa + jurisdiction: + literal: MD + legalForm: + column: Forma_org + status: + column: Statutul + - csv_url: http://assets.data.occrp.org/tools/aleph/fixtures/md_companies/directors.csv + # With this query Director records are loaded and the Directorship + # relation is defined between Directors and Companies. + entities: + # Again a Company entity is constructed using the same set of keys + # as in the query above in order to be referred to in the Directorship + # event definition. + company: + schema: Company + keys: + - Company_IDNO + - Company_Name + director: + schema: LegalEntity + keys: + - Company_Name + - Company_IDNO + - Director + properties: + name: + column: Director + # To only include records that have a non-empty `Director` column. + required: true + directorship: + schema: Directorship + # To avoid key collision between directors and directorships an additional + # literal string value is given with `key_literal`. + key_literal: Directorship + keys: + - Company_Name + - Company_IDNO + - Director + properties: + # Linking together directors and companies, where the director and + # organization properties of the Directorship interval contain references + # to the director and company entities that were constructed previously. + director: + entity: director + required: true + organization: + entity: company + required: true + role: + literal: director + # Similar to directors, in order to link founders to companies through + # an ownership event th company and founder entities have to be declared + # again in each query sectio. + - csv_url: http://assets.data.occrp.org/tools/aleph/fixtures/md_companies/founders.csv + entities: + company: + schema: Company + keys: + - Company_IDNO + - Company_Name + founder: + schema: LegalEntity + keys: + - Company_Name + - Company_IDNO + - Founder + properties: + name: + column: Founder + required: true + ownership: + schema: Ownership + key_literal: Ownership + keys: + - Company_Name + - Company_IDNO + - Founder + properties: + owner: + entity: founder + required: true + asset: + entity: company + required: true + role: + literal: founder + # In case there're extra tables with data that has to be linked companies, + # the the same set of keys is repeated and the relevant properties + # are declared. + - csv_url: http://assets.data.occrp.org/tools/aleph/fixtures/md_companies/licensed.csv + entities: + company: + schema: Company + keys: + - Company_IDNO + - Company_Name + properties: + sector: + column: Denumire + - csv_url: http://assets.data.occrp.org/tools/aleph/fixtures/md_companies/unlicensed.csv + entities: + company: + schema: Company + keys: + - Company_IDNO + - Company_Name + properties: + sector: + column: Denumire + caemCode: + column: Cod_CAEM +``` + +## Generating unique keys + +When creating entities from a dataset, each generated entity must be assigned a unique ID. This ID is computed from the `keys` defined in the mapping file. When writing the file, it is therefore necessary to understand what combination of source columns from the original table can be used to uniquely identify an entity in the context of a dataset. Failing to do so will result in "key collisions", a problem that results in variety or errors which are sometimes hard to diagnose: + +- Entities' properties contain values from different unrelated records \(e.g. addresses, dates of birth\); +- Wrong entity types \(`Persons` are generated as `LegalEntities` instead\) +- Related entities are merged together in various ways; +- Error messages are shown when trying to load the mapping \(e.g. `Cannot index abstract schema` or `No common ancestor: ...` \) + +For example, given a table of people with their personal details the mapping below might not always be valid, because different people can have the same first and last name \(and thus a key collision will happen\). + +```yaml title="bad.yml" +entities: + person: + schema: Person + keys: + - FirstName + - LastName + properties: + firstName: + column: FirstName + lastName: + column: LastName + birthDate: + column: DoB +``` + +The solution is to include in the list of keys as many properties as is necessary and sufficient to eliminate any intersection between unrelated entities of the same type. + +```yaml title="good.yml" +entities: + person: + schema: Person + keys: + - FirstName + - LastName + - DoB + properties: + firstName: + column: FirstName + lastName: + column: LastName + birthDate: + column: DoB +``` + +Keys for events \(Ownership, Sanction, Family\) will usually be a product of keys of the entities that such an event links together. + +```yaml title="combined.yml" +entities: + company: + schema: Company + keys: + - company_name + owner: + schema: Person + keys: + - owner_name + ownership: + schema: Ownership + keys: + - company_name + - owner_name +``` + +## Loading a mapping from a SQL database + +In the examples shown above, data has been loaded from CSV files. The mapping system can also connect to a SQL database using SQLAlchemy. Depending on the database system used, further Python drivers \(such as `psycopg2` or `mysqlclient`\) might be required for specific backends. + +When loading from a SQL database, you can begin your query with a specification of the tables you wish to access, and how they should be joined: + +```yaml title="database.yml" +za_cipc: + queries: + - database: postgresql://localhost/cipc + tables: + - table: za_cipc_companies + alias: companies + - table: za_cipc_directors + alias: directors + joins: + - left: companies.regno + right: directors.company_regno +``` + +Please note that when you query more than one table at the same time, all the column names used in the mapping need to be qualified with the table name, ie. `companies.name` or `directors.name` instead of just `name`. + +Mappings support substitution of environment variables. Instead of storing your database credentials to a mapping file, you might want to reference an environment variable like `${DATABASE_URI}` in the mapping file, and define the username and password externally. + +## Filtering source data + +When loading data from a mapping, you may sometimes want to filter the data so that only part of a table is imported. FollowTheMoney mappings will only let you do equality filters; anything more complex than that should be considered data cleaning and be done prior to executing the mapping. + +```yaml title="filters.yml" +gb_parliament_57: + queries: + - csv_url: http://bit.ly/uk-mps-csv + filters: + group: 'Conservative' + filters_not: + gender: 'male' + entities: + member: + schema: Person + keys: + - id + properties: + name: + column: name +``` + +## Extra functions for property values + +Mapping a column value to a property is normally a straight copy operation: + +```yaml +[...] +properties: + name: + column: person_name +``` + +There are some tricks available, however: + +```yaml title="hacks.yml" +# Setting multiple values for a property: +properties: + name: + columns: + - person_name + - maiden_name + +# Merging values ad-hoc: +properties: + name: + columns: + - first_name + - last_name + join: " " + +# Setting a constant value: +properties: + country: + literal: "SS" + +# Defining a date format: +properties: + birthDate: + column: dob + format: "%d.%m.%Y" +``` + +In general, we are not seeking to incorporate further data cleaning functionality into the mapping process. It's generally a good idea to design your data pipeline such that loading entities via a mapping is preceded by data cleanup step in which necessary normalisations are applied.