Search indexing Alfresco

Posted by Toon Geens on 6/10/20 11:42 AM

Find me on:

Enterprise Search is the totality of all software for searching for information within an enterprise. It allows you to forget that there is a multitude of content management systems, orchestrated together to handle the inbound information. As long as you are entitled to access a piece of information, it should not matter which business line or which content management system it comes from.

In the context of Content Management Systems, storing information is usually the easy part. More often than not, the search capabilities are on a second rank when it comes to solution analysis.
The organic growth of many businesses and their business lines start as an individual entity that at later stages can be merged with others.

At that point, each business line is likely to develop its own domain, as well as content management strategies and solutions. But when it comes to using that information, suddenly the invisible walls of these silos of content management systems start to materialize. At that point, our customers start searching for solutions that we refer to as Enterprise Search.

 

Indexing Alfresco

At first look, many search techniques seem viable, but there are a few issues to overcome. To illustrate common problems that we have encountered, both in the lab and out in the wild, we will use one of the most recent use cases we had to solfSve and will refer to it as the “search service”.

The big picture: this search service uses CMIS API and applies query-post-filtering by delegating the authorization evaluation to the original information system. Here’s how it works:

  1. An authenticated user bob sends a search query to the search service
  2. The search service queries its index, gets back a list of unfiltered results (unfiltered means that user-permissions have not been evaluated on the results yet)
  3. For each Alfresco-document in the result-set, the search result wants to verify with Alfresco that bob has (at least) read-level access on that document in Alfresco 
  4. Filter out the results where the user does not have read-access
  5. Present the results to the end-user

The focus of this article is to address the index-time challenges. The query-time challenges will be addressed in a follow-up article, but to be able to paint the full picture, we will also briefly describe those as well.

 

Enterprise search

Figure 1 - System context diagram for the integration of Alfresco Content Services with an Enterprise Search Application

 

The challenges

1. Query-time challenges

The search service applies a post-filtering technique, meaning that the search results are filtered after the query execution, so the user sees only the results to which he actually can access. This verification requires a round-trip API call per document, meaning that even with some parallelism applied, the overhead of the permission verification calls could have a big impact on performance (cfr. N+1 Select Problem, here’s another example).

2. Index-time challenges

The search service uses CMIS API and does a full folder structure crawl to find new content to index. While this might look viable in a small-scale setup, this could lead to issues down the line:

  • New content will not be available until the next full crawl - so best case is you find new content the next day,
  • Given that CMIS is a rather chatty protocol and you will need to go through the complete repository folder structure, this will take many many hours, even for a medium-sized repository - regardless if there is a lot of new content or no content at all.

 

The solution - Alfresco Solr API

1. The concept

Using CMIS to crawl the full Alfresco repository is a sub-optimal solution. Alfresco provides an API for its own out-of-the-box search-engine, which can be used to efficiently track changes in Alfresco. Our solution is to make use of the existing transaction-tracker-API to index Alfresco, instead of crawling the whole Alfresco repository (using CMIS).


The main advantages are:

  • It’s relatively cheap to query changes in Alfresco - you could poll for changes every 15 seconds - which means faster up-to-date results (~1 minute vs 24 hours) and way less load on the system (100 API calls vs 100.000 API calls)
  • In a transactionally consistent way crawling might miss documents: what  does it happen if I move a document from folder A to folder B, while the system is crawling the folder structure?

This API uses mutual TLS authentication - which means that the connection is authenticated on both sides, with a TLS server certificate and a TLS client certificate.

Since this API is JSON-based, it provides endpoints to basically do everything an search-index would need:

  • Transactions - all writes/deletes happen in an append-only transaction log
  • Nodes - which nodes/objects are created/updated/deleted in a given transaction
  • Metadata - full metadata for nodes and links (associations) between nodes
  • Text extraction of the documents (plain text from pdf, docx, ….)
  • ACL changes - not relevant for our use case
  • Dictionary model changes (probably not relevant for our use case)

 

Alfresco Container Diagram Figure 2 - Container Diagram for the Alfresco repository indexation process

 

2. Technical Details

This API is available since 2012 and is exactly what is needed for this use case. Unfortunately, this is a closed API and Alfresco does not publish documentation.

However, Alfresco is open-source, so we can look at how it all works. These are the 3 main endpoints that can be used to index and track the repository:

  • GET /api/solr/transactions?minTxnId={minTxnId}&{maxTxnId}
    • get a list of transaction information - basically used to get a list of txnIds
  • POST /api/solr/nodes 
    • with payload { txnIds = [ 123, 124, 125 ] }
    • or payload { fromTxnId = 123, toTxnId = 125 }
    • will return an array of node-status information: nodeId, status (updated, deleted), nodeRef (logical id) - you'll need the nodeIds in the next call
  • POST /api/solr/metadata
    • with payload { nodeIds = [ 201, 202 ] }
    • returns detailed node (~ document/folders/objects) information (metadata, relations, ...) and can be used to build your search index

All changes in Alfresco are captured in new transactions, so if you keep track of the last indexed transaction-id, you can check for changes with the /api/solr/transactions endpoint and incrementally update your index. Alfresco does exactly that every 15 seconds by default.

This API is by default only accessible on port 8443/tcp and protected with mTLS authentication (both server and client TLS certificates). Alfresco provides default certificates that can be imported in a browser (alfresco-client.p12 with password alfresco) or can be used with curl (alfresco-client.pem). If the Alfresco installation regenerated the certificates - as you should have, but most forget this step - you might need to extract the client certificates from the Alfresco Search Service (Solr).

Related to this work, we created a GitHub project with a Java client library for this API (and a few others). It's still work in progress, but it fully implements and supports the 3 API calls mentioned above. 


The future of the cloud is in the connected services. Search, as the core component of any information system, can be made both scalable and more accessible for external services. For any external service, we can now match the scalability and accessibility of Alfresco by using the same tools they use - the Alfresco Solr API.

 

 

Topics: Alfresco, Searching for information, API, Indexing, Solr

About Xenit 

 

Xenit is a Belgium-based IT company, focusing  on content services solutions, and covering all document-related business processes, from data migration to digital archive to hybrid/cloud hosting solution, to help organizations get control of their information. Premier Partner and System Integrator of Alfresco Digital Business Platform, Xenit has more than 10 years of experience in Alfresco Content and Process Services.

 

Subscribe to Email Updates

Recent Posts

Posts by Topic

see all