This document is archived.
The latest version is available at the GitHub repository of the FAIR DataPoint Specification: https://github.com/DTL-FAIRData/FAIRDataPoint-Spec/
This document is archived.
The purpose of this document is to specify the FAIR Data Point (FDP) software. This document includes requirements, architecture and design of the FDP software. This specification is primarily intended to be a reference for developing the first version of the FDP software by the DTL FAIR engineering team.
FDP is a software that, from one side, allows data owners to expose datasets in a FAIR manner and, for another side, allows data users to discover properties about offered datasets (metadata) and, if license conditions allow, the actual data can be accessed.
The FDP software is being initially developed as a stand-alone web application. However, the functionality/behaviour of the FDP can be also embedded in other applications to provide FAIR data accessibility to the application’s datasets. For instance, an existing data repository may choose to implement FDP's API and metadata content, behaving this way also as a FDP.
From the different data interoperability projects we are involved, the following usage scenarios have been identified. We used these usage scenarios to derive the requirements for the data storage and accessibility infrastructure and guide the design and development of the solution.
A researcher needs to find datasets containing data about proteins that are activated in specific tissues and combine these data with information of which genes are involved in the production of such proteins. In another situation, the researcher needs to know which biobanks carry a given type of biosample (e.g., blood samples) from patients possessing a specific phenotype (e.g., Alzheimer's disease) taken from a patient registry whose onset age was lower than 45 year-old. These data users need to use a straightforward search application that allows them to find the required information.
Once a data user finds where the needed datasets, including the information about their licenses and access protocols, the user wants to access the data, retrieving it. Because in many situations the data user will integrate many different datasets, she/he needs that the formats in which data will be retrieved and the access methods to be standardised. In other words, the method with which the data will be accessed should be common to all datasets and data providers. Also, the data format from the datasets should be using a common representation technology that facilitates data integration.
A research group is running a project in which data is being created. As the data will be used during the project for analysis and may also be useful for other users, the group would like to publish them in a way that allows potential users of the data to retrieve information about the datasets (metadata), data search engines to index the datasets' metadata, and users to retrieve the data. Some of the produced datasets have an open license but others have more restrictive licenses. Therefore, the data storage and accessibility infrastructure should be able to enforce the license by imposing conditions for users to access the restricted datasets.
Data metrics gathering
The owners of the data storage and accessibility infrastructure need to have information about the usage of their infrastructure. Different information should be gathered such as the number of users accessing the metadata and data, who are they, where are they coming from, etc. This information is used to assess the amount of computing resources necessary to cope with the requests, to assess the interest on each of the offered datasets and to understand which types of users are interested in which of the offered datasets. This information may also be used as an evidence of the relevance of the datasets, helping the data owners to justify requests for funding to keep the datasets available. The gathered metrics are to be used primarily by the owner of the FDP. The owner can opt to make the information, or part of it, publicly accessible. However, privacy concerns should be taken into account if identifiable information is gathered.
From the usage scenarios, we have identified a need for a data storage and accessibility infrastructure that we call FAIR Data Point (FDP). The FDP has the following goals:
Based on these goals, Figure 1 depicts the general architecture of an FDP. In this architecture, the FDP exposes its functionality to the users through an application programming interface (API) and a graphical user interface (GUI). The former is intended for software clients while the later for human clients. The figure also depicts four internal components, each one responsible for one of the four main behaviours expected from an FDP, namely, (i) provisioning of metadata information, (ii) access to the offered datasets, (iii) metrics gathering of metadata and data access and usage and, (iv) access control when the dataset's license imposes restrictions.
Fig. 1 - FDP General architecture based on the application's goals
The FDP has initially two usage purposes: (i) to be used as a stand-alone web application, where data owners give access to their datasets in a FAIR manner and, (ii) to be integrated in larger data interoperability systems, such as the FAIRport, providing the dataset accessibility functionality for such systems. Figure 2 depicts an FDP as a stand-alone application deployed in a web server, exposing to the Web its API and GUI. Figure 3 depicts a set of FDPs integrated as components in a Data FAIRport platform. In this case, each FDP gives access to the datasets published by a given data owner.
Fig. 2 - FAIR Data Point as a stand alone Web application
Fig. 3 - FAIR Data Point as an application component
In this section we use elements from the Archimate notation. The ArchiMate modelling language is an open and independent Enterprise Architecture standard that supports the description, analysis and visualisation of architecture within and across business domains. ArchiMate is one of the open standards hosted by The Open Group and is fully aligned with TOGAF.
The details of what each of these metadata object represent are given in the Metadata section below in this document. Also, the details of the FAIR Data Point API are given below in this document at the Application Programming Interface (API) section.
Fig. 4 - FAIR Data Point's Archimate Application layer architecture
The FAIR Data Point metadata about four entities, namely, the FAIR Data Point itself, the collection of datasets, each one of the offered datasets and the data within each of the dataset
FAIR Data Point metadata
The FAIR Data Point metadata contains information about the FDP itself and its governing authority. Figure 5 depicts the metadata for the FAIR Data Point and some of its attributes.
|Term Name: identifier|
|Definition:||An unambiguous and persistent reference to the FDP.|
|Comment:||Recommended best practice is to identify the resource by means of a string conforming to a formal identification system.|
|Term Name: license|
|Definition:||A document describing the conditions for access and usage of the FDP.|
|Term Name: title|
|Definition:||The name of the FDP.|
|Term Name: description|
|Definition:||A human-readable description of the FDP.|
|Term Name: hasVersion|
|Definition:||The version of the FDP software.|
|Term Name: Metadata Version|
|Definition:||The version of the FDP API specification implemented in this FDP deployment.|
|Term Name: issued|
|Definition:||Date of formal issuance (e.g., deployment) of the FDP.|
|Term Name: modified|
|Term Name: publisher|
|Definition:||The entity responsible for making the FDP available.|
Catalog, Dataset and Distribution metadata
For the representation of the catalog of datasets, each one of the offered datasets and their distributions, we adopt as basis the W3C's Data Catalog Vocabulary (DCAT). DCAT defines three main classes:
- dcat:Catalog: defines the catalog, i.e., the collection of datasets;
- dcat:Dataset: represents a individual dataset in the collection;
- dcat:Distribution: represent an accessible form of a dataset, e.g., a downloadable file or a web service that gives access to the data.
List of required and optional predicates for the catalog metadata
List of required and optional predicates for the dataset metadata
List of required and optional predicates for the distribution metadata
Data content metadata
Application Programming Interface (API)
The FDP's API follows the REST architectural style and, more specifically, the Hypermedia as the Engine of Application State (HATEOAS) pattern. In summary, a HATEOAS API provides information on how to navigate through the API even if the client does not have previous knowledge of the interface.
Metadata Provider API
Figure 6 depicts the HATEOAS RESTful API of FDP. In the figure, the upper-left green box represents the FDP service and responds to requests to the root URL, hereby represented as "/".
TODO: include a link to the SWAGGER document. Although by adopting HATEOAS guidelines an API specification is not needed because, at any point, a HATEOAS API provides information about how to navigate further from that point, it is still relevant to document the whole FDP API.
Graphical User Interface (GUI)
The graphical user interface (GUI) allows human users to interact with the metadata and data access APIs in a human readable manner. The GUI allows the human user to navigate through the many levels of information by following the hyperlinks on each of the pages.