Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Page Properties
Target release1.0
Epic
Document status
Status
titleDRAFT
Document owner

Luiz Olavo Bonino (Unlicensed)

DesignerMark Thompson (Unlicensed)
DevelopersKees Burger
QA

...

The purpose of this document is to record the results of the requirements gathering process carried out for the Open, Reusable Knowledge graph Annotator (ORKA) and NanoPub Store products in the context of the ODEX4all project, as well as describe the software specification for the two systems. This project uses an adapted version of the Agile and Scrum methodologies. This adaptation is required by some of the characteristics of the project such as the distributed nature of the development team, the collaboration with other teams from different products and projects, and the possibilities of dynamic variations of the team members. To adapt the classical Agile and Scrum methodologies to the characteristics and needs of ODEX4all, the two main changes has been defined as: (a) more time is dedicated to the analysis and design phase of the development to allow for more detailed specifications of the systems to be developed, and (b) instead of daily stand-up Scrum meetings and long Sprint and evaluation meetings, the team will rely more heavily on online Scrum supporting tools such as Jira, Jira Agile, Jira Portfolio and Atlassian Confluence, with a significant part of the team interaction taking place on electronic means such as Skype calls, instant messaging, etc. 

Goals

  • Allow users/curators to annotate any existing (RDF) graph.
  • Should provide some UI to allow users to “edit” a triple and help the user to locate suitable RDF identifiers for S, P, O.
  • Final annotation will be stored as an open-access nanopublication which may be ingested by the original source or used in other ways.
  • To be implemented as a generic (web) service that can be called from other tools (e.g. BRAIN).

Usage Scenarios

From the initial discussions regarding ORKA and NanoPub Store, the following usage scenarios have been defined to guide the requirements gathering and consequent design and development of the solution.

Triple Annotation

An analyst, using a graph-based analysis tool wishes to edit a certain assertion. The edition of an assertion may have different motivations, such as: the analyst disagrees with the assertion, the analyst wants to define a different predicate for the assertion, the analyst wants to express his agreement with the assertion, etc.

Since the analyst is not the creator/owner of the dataset where this assertion has been originally defined, this edition should be seen as a different piece of knowledge, maintaining, of course, the relation with the original assertion. 

Triple citation

A user, utilizing a graph-based analysis tool finds an assertion for which he wants to have a publicly resolvable identifier in order for him to be able to cite the assertion. To be able to properly cite the assertions, they should also contain provenance information about the author(s) of the assertion, publication date, origin, etc.

Assertion discovery

Users would like to search for available assertions for citation of the assertions or integration of the knowledge represented in the assertions with different pieces of knowledge or data. Therefore, the search results should be both in a format that human users can verify that the result matches his expectations and in a format that can be easily integrated with other datasets.

Use Cases

From the usage scenarios described in Section 2.1, the use cases depicted in Figure 1 have been derived.

Use casesImage Removed
Figure 1 - Use cases

The actors of the use case are divided in two types of users, namely, the NanoPub Author and the NanoPub User. While the former interacts with the system for the purpose of annotating an assertion, publishing the assertion or managing the assertions he published, the later consumes the published assertions by means of searching, retrieving and citing the assertions.

To guarantee the proper provenance of the published/edited assertion, the NanoPub Authors are required to provide his unique identification that will be part of the new publication. The system also requires the NanoPub Author to provide his unique identification to allow the management of all the assertion published by that author.

In the figure, 3 main use cases are defined for the NanoPub Author. The author can annotate (edit) a selected triple he found in a graph analysis tool, can publish a given triple in terms of a nanopublication without editing it, and can list all his/hers published nanopublications.

For the NanoPub User, 3 other use cases have been defined. The user can search for nanopublications, once a nanopublication has been found and selected, the user can retrieve it (for instance, to include the nanopublication in a graph) or cite it using its resolvable id. 

System Architecture

From the use cases described in the last session, we have decided to split the expected functionality of the system in two separate applications, namely, ORKA and NanoPub Store. ORKA is responsible for providing support to users who want to annotate a triple originated from a graph analysis tool by changing the triple's predicate. The edited triple is then transformed into a nanopublication adding the provenance of the author and keeping the relationship with the original triple. The NanoPub Store is responsible for storing the nanopublications created in the Triple Annotator as well as allowing users to search for stored nanopublications and cite them.

Image Removed
Figure 2 - General architecture of ORKA and NanoPub Store applications

Figure 2 depicts the general architecture of ORKA (yellow) and NanoPub (green) applications. The applications have been divided in the following three main integrated but independent layers:

  • Graphical User Interface (GUI) layer - a browser-based user interface supporting the interactions of the applications users. The functionality of the two services are accessible to the users through these interfaces.
  • Application Programming Interface (API)layer - the APIs provide access to the applications functionalities to other software systems. The user's analysis graph tool submits the selected triple for annotation using the Triple Annotator's API. Similarly other tools or the Triple Annotator stores nanopublications using the NanoPub Store API. For both applications, their GUIs also interact with their respective APIs.
  • Service layer - the service layer contains the business logic of the applications. This layer is only reachable form the API layer, providing isolation to the business logic.

In the case of the NanoPub Store, an another layer is needed to provide the actual storage facility. Since nanopublications are based on RDF, the storage facility is depicted in Figure 2 as the Triple Store component.

This multi-layered approach support the following non-functional requirements:

  • Modularity and separation of concerns - by splitting the functionality in different component layers, the development of each layer can focus on the aspects covered by that layer, such as graphical user interface, application programmable interface and business logic. 
  • Maintainability - as long as the communication interfaces among the layers are intact, modifications on the code of one layer do not affect the other layers, facilitating maintenance of the code.
  • Scalability - having the separate layers facilitates scalability since the components of each layers can be deployed in different servers, profiting from infrastructural scalability mechanisms such as load balancing and elasticity.

User interaction

ORKA

Figure 3 depicts a basic sequence of interactions of ORKA for the use case of triple annotation. This interaction starts on the graph tool where the NanoPub Author selects a given triple to be edited. From the graph tool, the author exports the selected triple represented as a nanopublication to ORKA (sequence 1 in the figure). Since ORKA properly stores provenance information of the nanopublications, it requires the author to provide a unique and resolvable identification. This identification is provided by an Authentication Provider such as ORCID, Google and Facebook. Once a nanopublication is received by ORKA, the system invokes the Authentication Provider requesting the authentication of the author. After the author successfully authenticates his credentials on the Authentication Provider, the provider returns to ORKA an authentication token. By using a third party authentication provider, the Triple Annotator does not need to manage user’s credentials and the users do not need to remember another set of username and password.

Image Removed
Figure 3 - Interaction sequence for triple annotation

Once the NanoPub Author is properly authenticated, the Triple Annotator interface provides the facilities to allow the author to edit the original triple (sequence 2). When the author starts editing the triple, the Triple Annotator creates a new nanopublication with the same object, subject and predicate (sequence 2.1) as the original triple adding the provenance information of the authenticated NanoPub Author and the reference to the original triple. In this way, the Triple Annotator maintains the provenance chain. The new nanopublication is then stored in the system.

Triple citation

 

Assertion discovery

...

Goals

  • Allow users/curators to annotate any existing (RDF) graph.
  • Should provide some UI to allow users to “edit” a triple and help the user to locate suitable RDF identifiers for S, P, O.
  • Final annotation will be stored as an open-access nanopublication which may be ingested by the original source or used in other ways.
  • To be implemented as a generic (web) service that can be called from other tools (e.g. BRAIN).

Usage Scenarios

From the initial discussions regarding ORKA and NanoPub Store, the following usage scenarios have been defined to guide the requirements gathering and consequent design and development of the solution.

Triple Annotation

An analyst, using a graph-based analysis tool wishes to edit a certain assertion. The edition of an assertion may have different motivations, such as: the analyst disagrees with the assertion, the analyst wants to define a different predicate for the assertion, the analyst wants to express his agreement with the assertion, etc.

Since the analyst is not the creator/owner of the dataset where this assertion has been originally defined, this edition should be seen as a different piece of knowledge, maintaining, of course, the relation with the original assertion. 

Triple citation

A user, utilizing a graph-based analysis tool finds an assertion for which he wants to have a publicly resolvable identifier in order for him to be able to cite the assertion. To be able to properly cite the assertions, they should also contain provenance information about the author(s) of the assertion, publication date, origin, etc.

Assertion discovery

Users would like to search for available assertions for citation of the assertions or integration of the knowledge represented in the assertions with different pieces of knowledge or data. Therefore, the search results should be both in a format that human users can verify that the result matches his expectations and in a format that can be easily integrated with other datasets.

Use Cases

From the usage scenarios described in Section 2.1, the use cases depicted in Figure 1 have been derived.

Use casesImage Added
Figure 1 - Use cases

The actors of the use case are divided in two types of users, namely, the NanoPub Author and the NanoPub User. While the former interacts with the system for the purpose of annotating an assertion, publishing the assertion or managing the assertions he published, the later consumes the published assertions by means of searching, retrieving and citing the assertions.

To guarantee the proper provenance of the published/edited assertion, the NanoPub Authors are required to provide his unique identification that will be part of the new publication. The system also requires the NanoPub Author to provide his unique identification to allow the management of all the assertion published by that author.

In the figure, 3 main use cases are defined for the NanoPub Author. The author can annotate (edit) a selected triple he found in a graph analysis tool, can publish a given triple in terms of a nanopublication without editing it, and can list all his/hers published nanopublications.

For the NanoPub User, 3 other use cases have been defined. The user can search for nanopublications, once a nanopublication has been found and selected, the user can retrieve it (for instance, to include the nanopublication in a graph) or cite it using its resolvable id. 

System Architecture

From the use cases described in the last session, we have decided to split the expected functionality of the system in two separate applications, namely, ORKA and NanoPub Store. ORKA is responsible for providing support to users who want to annotate a triple originated from a graph analysis tool by changing the triple's predicate. The edited triple is then transformed into a nanopublication adding the provenance of the author and keeping the relationship with the original triple. The NanoPub Store is responsible for storing the nanopublications created in the Triple Annotator as well as allowing users to search for stored nanopublications and cite them.

Image Added
Figure 2 - General architecture of ORKA and NanoPub Store applications

Figure 2 depicts the general architecture of ORKA (yellow) and NanoPub (green) applications. The applications have been divided in the following three main integrated but independent layers:

  • Graphical User Interface (GUI) layer - a browser-based user interface supporting the interactions of the applications users. The functionality of the two services are accessible to the users through these interfaces.
  • Application Programming Interface (API)layer - the APIs provide access to the applications functionalities to other software systems. The user's analysis graph tool submits the selected triple for annotation using the Triple Annotator's API. Similarly other tools or the Triple Annotator stores nanopublications using the NanoPub Store API. For both applications, their GUIs also interact with their respective APIs.
  • Service layer - the service layer contains the business logic of the applications. This layer is only reachable form the API layer, providing isolation to the business logic.

In the case of the NanoPub Store, an another layer is needed to provide the actual storage facility. Since nanopublications are based on RDF, the storage facility is depicted in Figure 2 as the Triple Store component.

This multi-layered approach support the following non-functional requirements:

  • Modularity and separation of concerns - by splitting the functionality in different component layers, the development of each layer can focus on the aspects covered by that layer, such as graphical user interface, application programmable interface and business logic. 
  • Maintainability - as long as the communication interfaces among the layers are intact, modifications on the code of one layer do not affect the other layers, facilitating maintenance of the code.
  • Scalability - having the separate layers facilitates scalability since the components of each layers can be deployed in different servers, profiting from infrastructural scalability mechanisms such as load balancing and elasticity.

User interaction

ORKA

Figure 3 depicts a basic sequence of interactions of ORKA for the use case of triple annotation. This interaction starts on the graph tool where the NanoPub Author selects a given triple to be edited. From the graph tool, the author exports the selected triple represented as a nanopublication to ORKA (sequence 1 in the figure). Since ORKA properly stores provenance information of the nanopublications, it requires the author to provide a unique and resolvable identification. This identification is provided by an Authentication Provider such as ORCID, Google and Facebook. Once a nanopublication is received by ORKA, the system invokes the Authentication Provider requesting the authentication of the author. After the author successfully authenticates his credentials on the Authentication Provider, the provider returns to ORKA an authentication token. By using a third party authentication provider, the Triple Annotator does not need to manage user’s credentials and the users do not need to remember another set of username and password.

Image Added
Figure 3 - Interaction sequence for triple annotation

Once the NanoPub Author is properly authenticated, the Triple Annotator interface provides the facilities to allow the author to edit the original triple (sequence 2). When the author starts editing the triple, the Triple Annotator creates a new nanopublication with the same object, subject and predicate (sequence 2.1) as the original triple adding the provenance information of the authenticated NanoPub Author and the reference to the original triple. In this way, the Triple Annotator maintains the provenance chain. The new nanopublication is then stored in the system.

Triple citation

 

Assertion discovery

 

Attaching evidence to an annotation

Knowledge graph annotations are more valuable in terms of reusability if the annotator can provide supporting evidence. Here we focus on evidence taken from scientific literature, although other sources of supporting evidence may exist: databases, technical reports, electronic lab notebooks, etc. Yet another category consists of (relatively) weak types of evidence: like-buttons, vote up/down, etc.

Here we propose several scenarios of literature-based evidence based on the PubAnnotation community framework.

Scenario 1: cite a document fragment

  • user provides PMID
  • ORKA retrieves abstract from PubAnnotation
  • user selects a fragment (words or sentences) that support his annotation
  • ORKA records the PubAnnotation fragment URI as supporting evidence in the nanopublication

Scenario 2: cite an existing annotation

 

  • user provides PMID
  • ORKA shows embedded TextAE (abstract with highlighted annotations)
  • user selects a PubAnnotation-annotation
  • ORKA records the PubAnnotation annotation URI as supporting evidence in the nanopublication

 

Scenario 3: create and cite annotation (extension of Scenario 2)

 

  • if none of the existing annotations are suitable evidence:
    • user creates a new PubAnnotation-annotation in TextAE (consider adding step-by-step guidance)
  • proceed as in S2

Notes

  1. Note the difference between “ORKA annotations” (curator says relation X should actually be Y with evidence Z) and “PubAnnotation annotations” (essentially just markup of entities and relations in the text). In this proposal evidence for the former type of annotation is provided by a reference to the latter.
  2. PubAnnotation supports markup of relations between entities: Cancer “caused by” Smoking. In PubAnnotation, projects containing entity-type annotations are in the majority. Perhaps we could consider a process where ORKA users follow a step-by-step wizard to identify their evidence: 1) select subject term, 2) select object term and 3) select predicate term.

  3. The previous note’s suggestion as well as S3 create new content that may be pushed into PubAnnotation.

  4. PubAnnotation is still in active development: PubMed/PMC corpora are not complete, many annotation contributions are not yet public. However, PubAnnotation will likely scale up in the near future.

  5. In S2 we want a URI to refer to an individual annotation: currently not supported by PubAnnotation. In discussion with PubAnnotation developers to add this feature.
  6. PubAnnotation aims to store annotations, but its community also develops standard APIs for uniform access to on-the-fly annotation webservices. These may prove useful if ORKA users can not find the entities that make up their evidence.

  7. We have evaluated other annotation services like Domeo (Paolo Ciccarese) and Utopia (Steve Pettifer). However, PubAnnotation seems to offer the most straight-forward possibilities for integration with ORKA. In any case, re-use and interoperability of created annotations between the three platforms should be relatively easy (use of APIs and OA ontology).

Assumptions

Requirements

#TitleUser StoryImportanceNotes
1
     

...