-
Notifications
You must be signed in to change notification settings - Fork 0
Home
File organization can be more flexible and file browsing less tedious.
Here are some files with associated user-defined metadata:
/home/work/projects/ginzoo2000/doc/letter_iron_cie.doc -> author=me, doc_type=letter, recipient=iron_cie, date=2017-01-01 /home/work/projects/ginzoo2000/data/survey.xls -> author=[com_team, me], doc_type='spreadsheet', datatype=poll, recipient=iron_cie, date=2016-04-02 /home/work/meetings/weekly_2015_08_10.doc -> author=secretary, doc_type=minutes, date=2015-08-10 /home/work/meetings/weekly_2016_08_17.doc -> author=secretary, doc_type=minutes, date=2016-08-17 /home/work/meetings/weekly_2016_08_24.doc -> author=secretary, doc_type=minutes, date=2016-08-24 /home/work/meetings/weekly_2017_08_31.doc -> author=secretary, doc_type=minutes, date=2017-08-31 /home/work/meetings/weekly_2017_08_31_group.jpg -> date=2017-08-31 /home/work/events/letter_host.doc -> author=me, doc_type=letter, recipient=ruby_hotel, date=2016-06-15
I wish to list all letters I wrote:
$ ~ > list_files author=me doc_type=letter /home/work/projects/ginzoo2000/doc/letter_iron_cie.doc /home/work/events/letter_host.doc
and list all meeting reports in 2016:
$ ~ > list_files author=me doc_type=minutes date>=2016 date<2017 /home/work/meetings/weekly_2016_08_17.doc /home/work/meetings/weekly_2016_08_24.doc
Ultimately I wish to browse files using metadata, independently of the underlying folder organization:
$ ~ > tree_view author doc_type ├── com_team | ├── spreadsheet │ │ ├── survey.xls ├── me | ├── letter | | ├── letter_host.doc | | ├── letter_iron_cie.doc │ ├── spreadsheet │ │ ├── survey.xls ├── secretary | ├── minutes │ │ ├── weekly_2016_08_10.doc │ │ ├── weekly_2016_08_17.doc │ │ ├── weekly_2016_08_24.doc │ │ ├── weekly_2016_08_31.doc ├── unsorted | ├── /home/work/meetings/weekly_2017_08_31_group.jpg
The classical organization via static nested folders is becoming less efficient. First because the amount of data files increases but also because the content information becomes more heterogeneous. Moreover, current file systems still lack a proper way of storing user-defined descriptors of files, ie metadata.
Very powerful third party solutions exist though, mostly relying on data-basing and often targeted for web applications. They add a layer on top of the data file system and the information is often enclosed in non human-readable containers (database). They thus require a piece of software to actually access, modify and query information. For desktop environments, this matter is partially addressed by so-called "intelligent" assistants like Cortana for MS Windows. They take care of parsing all the data files and trying to guess as much metadata as possible. They offer some basic keyword-based query engine which aims at being simple to formulate. The user simply types words about his query, akin to a Google search. Some accomplished and efficient tools in this vein are Recoll and Beagle.
However, these systems do not let users actually feed metadata and have their own way of labeling things. A unique tool answering this need is tagspaces, which provides a comprehensive user interface but no console tools though. An important lack is that metadata is unsorted. For example, there is no semantics difference between a tag indicating a project and a tag indicating a rating.
The proposed set of tools aims at a more efficient way of browsing files and intends to be:
- user-driven: the user provides most of meaningful metadata. Only limited automatic discovery is provided.
- future-proof: rely on human-readable formats, store metadata next to the data. No information is lost if tools are uninstalled.
- as simple as possible: provide equivalent of cd and ls commands that can query metadata.
- more flexible: enable metada-based file browsing. The user can create as many transversal views as needed with no data duplication. Each view covers specific metadata organized in a custom order.
Metadata basically include descriptors stored by the file system (FS): size, creation/modification/access times, credentials, filetype ... and depend on the OS. Among these, here are the FS metadata that are automatically gathered:
- file_type (either file or folder)
- file_modification_date
Additional metadata are considered here to be user-driven, ie it's the user who has the control over metadata definition. Some automatic discovery tools could be used, but they are not meant to be part of the core tools and would be seen as helpers for the user to fill large amount of metadata. The choice of not relying on automatic discovery tools is made to avoid deporting the mess generated a by large amount of data into a mess in the quantity and diversity of metadata.
Since the primary goal of using metadata is here to provide access to files by means of queries on descriptors, then just let users define their own system of descriptors. This may be seen as a quite selfish, user-limited, view and could prevent easy sharing of metadata. But then it's more the responsibility of a group of user or a community to agree on a standard metadata specification. Software brings molds, users bring dough.
In practice, metadata relate to files or folders on the drive and are stored in side-car files in the JSON format (see Metadata JSON file format). The name of each metadata file is the same as the associated file, with the extension .mdf
. For example:
/home/me/personal/CV/cv_long_en.doc /home/me/personal/CV/cv_long_en.doc.mdf /home/me/personal/CV.mdf
IMPORTANT: the naming consistency has to be maintained by the user (they have control ... and responsibility). Future improvements may rely on file content hashing and change monitoring via inchron as Beagle was doing, or fswatch.
Metadata are defined by a set of attribute/value pairs. For example:
- project="ginzoo200"
- author=["me", "myself", "irene"]
- date="#2016-05-24"
An attribute follows the python identifier format. Valid characters are the uppercase and lowercase letters A
through Z
, the underscore _
and, except for the first character, the digits 0
through 9
.
Examples of valid attribute names:
doc_type, composer, author, keyword, project, protocol, reviewer, creation_date, guideline101,
An attribute is associated with an array of values. The type of an attribute is determined by the common type of its associated values. It is resolved the first time an attribute is encountered. For other occurrences, values must be consistent with this type, else an error is produced. This applies to all values of a given attribute for all files or folders that have this attribute.
- A value must only contain the following characters:
-
- Alphanumerical characters:
A
toZ
,a
toz
and0
to9
- The hashtag
#
can only be used as first character, to indicate a date (see Date format) - Other allowed characters:
_
,-
,+
,:
,.
and@
- Alphanumerical characters:
Other characters (like space, &
, >
, etc.) are not allowed to avoid ambiguity with query operators and shell processing. Consider using underscores to replace spaces.
In general, metadata content is not meant to a have a fancy form, but only to robustly represent semantics. It is advised to reuse metadata values as much as possible.
- Formatting advice:
-
- adopt singular everywhere (no plural).
- use upper case only when necessary
- minimize the number of words
Examples of valid values:
justin_time, kay_oss, jean-pierre_jeunot, 64.42, #2016-12-17T13h29, #2013-12, #2011
Supported metadata types for values are:
- string
- number
- boolean (true|false)
- date (see Date format)
All values for a given attribute must have the same type, across all files and folders. Else an error is produced.
The date format is ISO 8601: [+-]YYYY-MM-DDThh:mm:sec[Z|+hh:mm]. The implementation provided by the python iso8601 module is used, although the space character is not allowed as a separator between date and time (T is used instead).
It is highly recommended to use fully qualified dates as much as possible (with at least year/month/day). If not, the 1st month / day / hour / minute / sec is used. Example: 2019-02
is interpreted as 2019-02-01T00:00:00
Metadata are stored in JSON files (.mdf
extension) containing mappings between unique attributes (or categories) and values. Values are in an array of homogeneous type (string, number, boolean):
{ "attribute_with_one_string_value" : ["value_string_1"], "attribute_with_multiple_string_values" : ["value_string_2", "value_string_3"], "attribute_with_one_numerical_value": [45.6], "attribute_with_numerical_values": [4, 8, 15, 16, 23, 42], "attribute_with_boolean_value": [false], "attribute_with_string_date": ["#2015-06-04"], }
Note that for a given attribute and a given file or folder, associated values should also be unique. If not, duplicates are ignored anyway.
The empty string is ignored for attributes and values (the user is warned).
- Reserved attributes used to store metadata from the file system are:
-
-
file_type
(string): either 'file', 'folder' TODO: how to handle symlinks? (unix only) -
file_modification_date
(date): when the file content was last modified. -
file_access_date
(date): time of most recent access.
-
- lsx: list files by querying metadata
- cdx: change directory by querying metadata
- treex: show a tree view based on given metadata attributes
Queries are logical conjunctions (logical AND) of predicates, separated by spaces. A predicate can be in two forms:
-
<attribute_name><operator><qvalue>
- Select items where
attribute_name
matches the given constraint for any of its associated value. Examples:author=mister_tea
,date<2016
.qvalue
must be convertible to the type of the given attribute.
-
[<negation>]<qvalue>
- Without negation: select all items where any value of any attribute with string type is equal to
qvalue
. With negation: select all items where all values of any attribute with string type are not equal toqvalue
. Note thatqvalue
is always interpreted as astring
here. Indexed values are converted to string before comparing toqvalue
.
The negation character !
used immediately before a string is unary and relates to the value immediately following it. It cannot be used before attribute names.
Examples of valid negations:
!felix_cited !docx
Examples of invalid usage of !
:
felix_cited! felix_!cited
The equality operator =
is binary and relates to the attribute name preceding it (left operand) and the value following it (right operand).
Examples of valid equalities:
author=felix_cited author=felix_cited doc_type=letter
Examples of invalid equalities:
author= felix_cited author = felix_cited author =felix_cited =author
The non-equality operator !=
is binary and relates to the attribute name preceding it (left operand) and the value following it (right operand).
Examples of valid non-equalities:
author!=felix_cited reviewed!=True
The relational operators <
, <=
, >
, >=
are binary and relate to the attribute name preceding it and the value following it.
Examples of valid usage of relational operators:
nb_pages<=50 temperature_celsius<37.2 creation_date>=2016-09-01 creation_date<2016-07-01 author_name>=joh
Examples of invalid usage of relational operators:
nb_pages <= 50 nb_pages<= 50 nb_pages <=50 nb_pages<=fifty 2016-09-01<creation_date author_name=<c <nb_pages nb_pages> 2016-09-01<creation_date<2016-07-01 creation_date=juin
Dates follows iso8601. Important: dates that are not fully qualified are set to the 1st matching month / day / hour / minute / sec. For example, "#2015" is interpreted as "#2015-01-01T00:00:00" and "#2015-07" as "#2015-07-01T00:00:00". Hence the query date>#2015
does not mean "any date strictly after the year 2015" (ie 2016, 2017...), but "any date strictly after January 1st 2015". The dates "#2015-04-01" and "#2015-01-01T00:00:01" will match this query.
If one wants to actually get entries where date is strictly after 2015, one should use date>=#2016
. In general, it is often misleading to use strict comparisons for dates.
Change directory by selecting a folder based on metadata query.
usage: cdx PREDICATE1 [PREDICATE2 ...]
If the set of given predicates yields a unique directory than cd to it. Else, available choices are simply displayed.
Show a tree view of files and folders according the given attributes.
usage: treex ATTRIBUTE1 [ATTRIBUTE2 ...] [--show_unsorted]
Files or folders actually having the given attributes will be displayed. Each layer of the tree corresponds to a given attribute. If the option --show_unsorted is given, then all other files and folders that don't have one of the given attributes will be displayed in an "unsorted" section at the end.
import medinx
# Parse .mdf files in a given directory and its subdirectories
full_index = medinx.parse_folder('.')
# Start a selection
selection = full_index.filter('author=me')
# Refine selection
selection.filter('doc_type=letter')
# Finally gather files in the selection
selected_files = selection.get_files()
# Start another selection, for folders
selection = full_index.filter('file_type=folder')
# Filter by value (any attribute)
selection.filter('ginzoo2000')
selected_folders = selection.get_files()
# Build a tree view
view = full_index.tree_view(['author', 'doc_type'])
for author_name, author_view in view.iteritems():
print('author: %s' % author_name)
for doc_type, doc_fns in author_view.iteritems():
print(' * doc type:' % doc_type)
for doc_fn in doc_fns:
print(' - %s' % doc_fn)