Multi-tenant academic repository software

By Ben Summers

1st June 2018

Multi-tenant repositories seem to the topic of the day, we’re hearing them mentioned everywhere. Fortunately we’ve been running multi-tenant services for years, so we have lots of experience to share. If you’d like a demo, just let us know.

Let’s go through the three models of multi-tenancy, what we can show you today, and an overview of the underlying technology that elegantly provides multi-tenancy within Haplo.

THE 3 MODELS OF MULTI-TENANCY

1. Single institution, multiple collections

With legacy repository software, institutions often have to use multiple repositories to handle different kinds of outputs. The most common is a pair of repositories to handle traditional outputs and research data. This is because the metadata, workflows, and rules are quite different, and legacy repositories don’t offer sufficient flexibility. This is not a great experience for researchers who need to deposit items, and misses an opportunity to showcase all a researcher’s outputs in a single place.Haplo Repository has a single repository containing all types of outputs, with a single interface for researchers to deposit items with a guided UI to help them select the right type. Each kind of output has it’s own metadata, workflow, and rules, and is published in a single public portal with all outputs, along with a full web profile of that researcher’s academic background.

2. Multiple institutions, multiple public repository interfaces

While the first level of multi-tenancy covers the needs of the vast majority of institutions, sometimes a group of institutions wants to use a shared repository to pool resources and expertise. This is hard to do in legacy repository software, because the code is full of assumptions and has a rigid model of permissions and user roles.Haplo Repository can store items from multiple institutions. The flexible permissions system gives each user a view that only contains items from their institution, and the ingest workflow routes new items to the institution’s metadata team. Users can belong to multiple institutions, or have an oversight role of the entire shared repository so they can assist users from multiple institutions. Finally, reporting delivers insights across the shared repository, or for just the user’s institution.Each institution has their own public portal, hosted on a URL on their institutional domain (web address) with their own branding and customisation, and can opt-in to also publish in a shared repository public interface. This is especially useful for consortia of universities, or collaborating research institutes, who need individual identities and present their work as a whole.

3. Independent repositories, single server

Legacy repository software is expensive to host because it’s hard to share resources. You have to run a separate database and a separate application server for each tenant on its own physical server or virtual machine. This is wasteful of resources and cumbersome to administrate, so there are no economies of scale.To be practical, hosted software has to be truly multi-tenant, where a single database and application server can host multiple clients with complete isolation between them. This is the model that all modern service providers use, especially large providers like Google, Amazon and Microsoft.Haplo Repository is a true multi-tenant server. A single instance can run hundreds of repository applications, and each of those is totally isolated with an independent customised configuration. This allows us to host advanced repository systems cost-effectively and streamline our infrastructure administration.Sometimes you’ll need a central repository which contains the metadata of all the items, for managing and reporting across a set of repositories. As well as standard protocols, Haplo supports efficient inter-repository messaging to sync metadata records between otherwise isolated repositories.

MULTI-TENANCY AND HAPLO REPOSITORY

The University of Westminster was looking for a repository capable of handling images and other non-traditional research outputs and the creation of dynamic portfolios to support the capture of practice-based research in disciThe second model is the most interesting from a technology perspective, so we’ve put together a demonstration multi-tenant repository. This:

  • Manages items from multiple institutions in a single repository.
  • Each institution has their own public repository interface, on a separate URL, which has automatic branding with institution name and logo.
  • To demonstrate how much flexibility there is, an alternative public repository interface has completely different branding and user interface.
  • As well as the per-institution repositories, there’s a shared repository, and a configuration interface to control which institutions appear in it. Changing the configuration instantly changes the institutions included.
  • Users from institutions can log in and self-deposit new items.
  • A workflow routes the new item to the metadata team for that institution, who work with the researcher to prepare the item for publication.
  • When it’s ready, the metadata team can publish it to the repository.
  • Oversight users can see and approve any item from any institution.
  • Oversight users can view reporting for the entire shared repository, and filter down to individual institutions. Authorised users from institutions just see reporting for their institution.
  • A administrator can add another institution through a simple user interface, uploading a logo image and setting the address of their public repository. This configures everything automatically, and the new repository is available in under a second.


We’re particularly proud that, although this is a radically different model to our normal single-institution repository system, the multi-tenancy is implemented by a thin configuration and user interface layer on top of the standard repository system.

If you’d like to see this demonstration repository, we’d be delighted to show it to you.

OTHER BENEFITS OF HAPLO

You don’t just benefit from multi-tenancy with Haplo Repository, you gain all the other advantages of using a modern repository system:

  • An focus on the user experience, resulting in high deposit rates and researcher satisfaction.
  • Integrated web profiles, to showcase author and institutional experience and research.
  • An advanced data model enabling true “data once” (for example, if the ORCIDintegration adds an ORCID to the author’s record, it’s immediately included in the metadata of all their research outputs)
  • Sophisticated search, including full text indexing and graph queries.
  • The option of including fully integrated CRIS functionality, from REFmanagement to full lifecycle Current Research Information System functionality, like doctoral programme management, ethical approval, research funding, and integration with wider institutional infrastructure.
  • An unparalleled focus on security, developed and hosted under an ISO27001 certified security management system, giving you the confidence to manage confidential research data.

HOW IT’S IMPLEMENTED

One of the many lovely things about working with academic institutions is that their repository administrators and metadata teams love to talk about the technicalities of repository software, and we love to talk about the system we’ve built. So here’s a overview of the technical details!Each of the different multi-tenancy models is easy to implement with the Haplo Repository, and can be combined together to deliver your perfect repository.

1. Single institution, multiple collections

Storing multiple kinds of item in a single repository is a natural ability of the Haplo data model, and the way the system is implemented as many co-operating plugins.

Each item in the repository has a well-defined type, which defines its overall behaviour and metadata fields. The ‘type’ can be thought of as a metadata template with associated rules and behaviours. For example, research data has very difficult publication and access request workflows, because it is often sensitive data which needs to be prepared before it can be shared.

On top of the core item and metadata handling, the repository functionality is implemented by a large number of plugins — a typical Haplo application may be implemented as over 100 plugins. These plugins are carefully written to either provide functionality or policy. So, you might have a ingest workflow with an implementation of the functionality, but you’d add an additional policy plugin which controlled the fine details of how it behaved. This allows you to implement any rules you need, so the behaviour of the repository can change depending on what type of item it is, the metadata of the item, or perhaps the research discipline of the authors.

2. Multiple institutions, multiple public repository interfaces

The shared multi-tenant repository is the most interesting multi-tenancy model, which really shows off the capability of the Haplo platform.

The underlying platform separates the institutions, while allowing them to share common information, such as records about publishers, organisations, institutions, and so on, increasing efficiency and reducing errors through a single source of information. A fine-grained permissions system enforces isolation, and the user interface is designed around providing different subsets of information for different users.

The platform provides several simple building blocks that combine elegantly to implement this multi-tenant shared repository:

  • Each record in the repository is labelled with one or more labels.
  • Each user has a label statement which describes their permissions as a set of Boolean expressions based on the labels applied to a record.
  • A labelling plugin is configured to label items. In this case, the labels would include a label representing the institution.
  • A permissions plugin is configured to generate label statements for each user, restricting users from institutions to viewing repository items labelled with their institution.
  • Permissions are enforced at the lowest level of the platform. A search will only return records the user is permitted to see, naturally restricting a institution’s view of the repository to their own records.
  • A user is prevented from editing or creating content for another institution by the permissions system. Depending on configuration, they may be able to view content of another institution.
  • The labelling system is extended to the reporting system. Users can only run reports on data their label statements permit them to see, naturally filtering reports to their institution’s items.


All this labelling and permissions is completely automatic, and even administrator users do not need to think about labelling:

  • Labels are automatically applied, using a combination of record metadata and the user’s role. This labelling is configurable through labelling plugins, so the exact requirements of the group of institutions can be implemented.
  • The user interface adapts around the information they can see, so a user from an institution will effectively see a repository with just their data.
  • Users who administer the entire shared repository have permissions that allow them to access all data.


Public repository interfaces can be customised for each institution, along with a public shared repository where institutions can opt in:

  • Haplo’s Web Publisher functionality can publish multiple completely independent repository web sites (publications) from a single shared repository.
  • Each Publication provides its own branding and customised interface to best showcase the research.
  • Each Publication runs under a Service User that restricts the information it can access. Just with normal users, the Publication’s service user will be restricted by the permissions system to only access records labelled with the relevant institution, naturally providing a public repository for that institution alone.
  • A global shared repository can be provided similarly, with a service user that has permissions to read records for those institutions that opt in. For many records, this is not the canonical repository, so metadata is written to inform crawlers and other systems that the record is a reference to a record in the institution’s own repository.
  • Because the each Publication is independent, repository interfaces can be published on URLs owned by the institution.

Metadata schemas can be shared or configured for a institution:

  • Using a shared repository uses a single shared schema for all tenants, which encourages an underlying common vocabulary.
  • Over the top of this shared schema, individual tenants can layer their own schema on top, or define their own types.
  • Each tenant will choose the types of information they want to store in their repository, so their users do not see irrelevant options.


3. Independent repositories, single server

The Haplo application server is a single Java process that can serve many isolated applications. Every day, we work on development servers that are hosting hundreds of independent applications on a single Haplo application server.

The database uses PostgreSQL ‘schemas’, a feature that allows each tenant to have strong isolation with their own set of database tables. When serving a request, the platform uses the hostname to choose the right tenant, selects the right database schema, then serves the request.

Within the Haplo platform’s server code, there’s no global state, so each tenant has their own completely separate configuration, caches, and set of plugins, and is isolated and resource controlled to ensure each tenant can’t affect any other. This is not an easy thing to implement, and requires great discipline and careful selection and use of software libraries, so it has to be designed in from the very start.

While these repositories are independent, sometimes they’ll need to communicate. There are two options, standard web protocols over HTTP, or sending messages between repositories using an internal message bus, or external messaging systems such as Amazon Kinesis.

Haplo has been used to deliver multi-tenant information systems, in production, for over 10 years. It’s a mature, secure, and well tested platform.

WANT TO TALK REPOSITORIES?

We’re always keen to talk about repositories, from the underlying technology, how an excellent user experience can increase deposit rates and accuracy, to the advantages of a repository as a central part of the institution’s CRIS.


Please get in touch for a demo. (Even if you don’t need advanced multi-tenant features.)