Skip to main content

A document is basically a set of text fields, like a title, a summary, a full text content, a date, an origin, a URL, etc.

At start, as we wanted to have a search engine that implements elaborated text search on these regulation documents, we needed a solution that is powerful, flexible, offers many advanced search features, and is tuned for text searches.

So we quickly opted for ElasticSearch, a widely recognized document database solution, designed to efficiently store, search, and analyze large volumes of data in real-time.

It provides great search capabilities that meet the expectations set out above. It is also recognized mainly for its easy integration, its distributed architecture enabling high availability, and its ability to handle large dataset and deliver fast, accurate search results. On top of this, ElasticSearch has a huge community and a wealth of documentation.

code1
Here is the main structure of a typical regulatory document in RegReview, retaining only the relevant fields for the search engine:

Specifications and Implementation

The search specifications of the search engine are not so complex. The current functionalities of the search engine can be divided into two primary components:

  1. Full-text search: This component concentrates on the titlesummary, and text_content fields, taking terms from a search bar as an input. It operates in a manner similar to popular search engines like Google, providing results based on matching relevance to the entered terms.
  2. Strict filters: The search engine also incorporates strict filters for other fields such as topicsdate, and source. These filters allow users to narrow down search results based on specific criteria, involving precise and targeted search parameters operating on the document fields.

The search engine is based on some of ElasticSearch’s powerful native search functionalities, which only need to be properly combined to deliver relevant and satisfying results.

Among the functionalities used, notable examples include fuzzy queries, match queries, and span near queries, which exhibit exceptional performance when applied to text field types, especially on a legal documents corpus like ours.

In the end, the RegReview search engine is built with a python module acting as a proxy between the application and the ElasticSearch instance, templating the queries sent to the index in the Search API context.

Alongside this, a number of other ElasticSearch features have proved very helpful in the search engine, such as paginationsorting or fields returned selection.

img3

Query Shape

Let’s illustrate this with a simple example: if a user wants to retrieve documents that meet the following criteria:

  • mention “laundering”
  • contain the “Compliance” or “Basel” topics
  • were published by the “European Banking Authority”
  • were published in 2022
code2
The search parameters provided from the application to the search engine module will look like the following:

The search engine module role is to generate a complete query, matching the desired search parameters and made up of several major parts:

  • Full-text search clauses
  • Topics clause
  • Filter clauses
code3
In the end, all these atomic clauses are wrapped up to form a large query, which has the following shape:

The emergence of a new need in RegReview

Despite the initial search engine’s impressive capabilities and user satisfaction, the expanding scope of scraped regulatory sources revealed a new requirement. As documents in different languages accumulated, a linguistic patchwork emerged, posing challenges for users who speak only one language.

To address this, multilingual management is essential for the document database.

This enhancement would allow users to select their preferred language for document display in the application. This flexibility would enable regulatory monitoring across diverse sources, with documents seamlessly presented in the user’s preferred language.

For example, a French-speaking user could effortlessly access documents in French, regardless of their original language.

Multilingual Properties

The introduction of this new multilingual management functionality significantly changes the structure of the documents themselves. As a result, the search engine and all operations performed on the documents undergo substantial changes.

Before implementing the new multilingual search engine, we must establish specifications and constraints to maintain consistency and robustness.

1. The following fields are affected by the multilingual aspect and specific to each version of a document:

  • title
  • summary
  • text_content
  • topics

2. It is imperative that 2 users with different language preferences are presented with the same set of documents in the search results, even if those are displayed in different languages in the end. This ensures consistency and eliminates any potential confusion, establishing the search engine as impartial towards the user’s language preference.

3. If the document version matching the user’s language preference does not exist, a fallback is made to the original language version of the document.

 

Topics Display: Language-Specific Tagging and Priority Rules

The topics field is written by a customized AI based on language-specific NLP models, with each language having its own taxonomy. It's important to note that the concept of topics is not global in a document, each language version having its own topics field.

Consequently, it is possible for a document to have topics in its English version but not in its French or Chinese versions.

To make the most of the AI results and display as much topics as possible, a specific rule has been established for their presence within a document, following a prioritized sequence:

  1. If available, topics in the user’s preferred language are displayed.
  2. If not available in the preferred language, topics in the document’s original language are displayed.
  3. In the absence of topics in the preferred or original language, English-language topics are displayed.
  4. If there are no topics available in the English version, no topics are displayed.

This prioritized sequence ensures that topics are presented to users in their preferred language whenever possible, while providing fallback options when necessary.

Multilingual integration within RegReview

Considering the above-mentioned elements, we can easily define a new structure for the documents within the compiled database.

 

The new document structure

The new structure is largely based on the last one, ensuring a smooth migration process that minimizes any disruptions:

code4

Each of the language_code under versions key is a ISO 639-1 language code.

A deliberate and ingenious decision has also been made to include a separate language code, referred to as the original language code, consistently present in all documents. This choice offers several advantages:

  • Default language code: In cases where the language cannot be detected for a document, the original language is the sole key in versions.
  • Facilitates fallback: The presence of the original language code simplifies the implementation of the fallback mechanism described above.
  • Essential for query templating: The original language code plays a fundamental role in the templating of queries executed in ElasticSearch, as explained in the next section.
  •  

Search Engine Scope

When considering the search engine, an important constraint revolves around the independence of search results from the user’s language preference (mentioned in the section “Multilingual properties”). This raises a crucial question: What should be the “query scope” of the new multilingual search engine?

To ensure independence, a simple solution emerged: the search engine’s scope must be limited to the original version of the document. This approach offers many core advantages:

  • Unilateral Consistency: the original version remains the same for all users
  • Robustness: the original version is present in all documents
  • Backward Compatibility: it reconciliates with the original search

Thus, the new search engine scope will only be the original version in the documents.

 

The Topics specific behavior

However, the topics specific rules (mentioned in the section “Topics Display: Language-Specific Tagging and Priority Rules”) need careful consideration as this aspect introduces a higher level of complexity in the search engine. Therefore, the topics field cannot be treated the same as other language-specific fields.

To establish the specific rule for displaying topics, we could manipulate the documents returned in the search results and present the desired information to the end user. Though, the challenge does not lie in the display itself, but in filtering the topics field. For consistency reasons, the filtering scope should align with the actual topics displayed to the user.

In other words, the scope of the topics filter must be dynamic, following the rules stated above: sometimes, the scope must be on versions.<preferred_language_code>.topics, sometimes on versions.original.topics, and sometimes on versions.en.topics

Nevertheless, the technical feasibility of this solution has no guarantee.

Addressing the topics Issue

After some investigation, the topics issue turned out to be manageable thanks to a feature released in ElasticSearch 7.11runtime fields.

Runtime fields are a powerful feature in Elasticsearch that allow to dynamically define and compute fields at query time, without the need to modify the mapping or reindex data, offering flexibility and agility.

This perfectly meets our need: the idea would be to create a runtime field for which valuation is done on-the-fly by a script transcribing the specific rules.

 

Dynamic Generation of topics field

In our specific use case, a new topics runtime field is dynamically computed during query execution. This field is then used as a scope for the topics clause of the search query.

To do so, we will use the runtime_mappings parameter, allowing to specify a runtime fields to compute at query time in the search API.

Runtime fields in ElasticSearch mainly rely on scripting, a powerful feature that enables custom scripts execution at run time. These scripts are used to generate custom clauses that Elasticsearch can run and exploit.

Painless language is the most suitable of the supported language. This documentation page provides a comprehensive overview of the Painless context for runtime fields.

code5
For our topics runtime field, that gives something like that:

In this script, the language_code is evaluated with the user's preferred language before performing the query.

The logic of the script is relatively straightforward: it sequentially checks the topics present in the user’s preferred version, followed by the original version, and finally the English version. It is achieved in the Painless language’s doc context, which allows access to document fields.

Ultimately, this topics runtime field can be used as-is in the query build, in the topics clause mentioned above.

The bonus benefit of this runtime field is that it can be included in the search response, and directly provides the appropriate topics for presentation in the application interface, without any further operations needed on client side.

 

Taking advantage of runtime fields

Runtime fields have proven to be valuable not only for addressing the topics rule but also for implementing fallback in a straightforward manner.

Considering the multilingual properties, it is essential for users to see documents in their preferred language version. In Elasticsearch, this is done by selecting the appropriate fields to be returned in the search results.

For instance, if a user’s preference language is French, the source_includes argument of the search API is specified as follows:

   [  "versions.fr.title",

      "versions.fr.summary",

      "versions.fr.text_content"    ]

 

Good. But we then have to deal with the case where the preferred version doesn’t exist in the document. Again, runtime fields offer a simple solution by generating new dedicated fields for fallback titlesummary, and text_content.

code6
Concretely, it is done by the following runtime field definition (which emits value only if the desired preferred language version does not exist):
code7
This time, we use the Painless language’s params context. Runtime fields can also address other simple needs, like retrieving languages available on each document:

Conclusion

conclusion

Ultimately, the implementation of multi-language management in the RegReview search engine showcased the capabilities of ElasticSearch and validated our decision to use this technology. The native features of ElasticSearch effectively addressed our problem, making the implementation process relatively straightforward.

However, if we focus on the runtime fields themselves, certain limitations emerged:

  • The generated runtime fields cannot be of type text, which prevents to perform full-text search on them.
  • Performance is impacted due to the dynamic calculation of declared runtime fields with each call: to optimize performance, significant effort was dedicated to refining scripts.
  • Scripts tend to encounter issues when exploring fields in a document. Robust conditional handling had to be incorporated to address situations where fields were missing or not present in the index mapping.
  • A notable bug was discovered: the limitation of sending only up to 100 values in a runtime field. This constraint poses challenges, particularly for the topics field.

Despite these limitations, the use of runtime fields provided valuable functionality for multi-language management in the search engine, and their advantages outweighed the encountered challenges.

Share

Our publications

Logo

Sia US FS - Weekly Regulatory Update

Sia US FS - Weekly Regulatory Update

2025

Read more
Ai Abstract Art

2024 Key Technology Trends

The introduction of Chat GPT and other Large Language Models (LLMs), Generative AI (Gen AI) stands out as the dominant conversation in 2023.
Original Article: https://www.sia-partners.com/en/insights/publications/2024-key-technology-trends

2025

Read more
OpenStreetMap from Heka

OpenStreetMap: Using open data and guaranteeing …

In this article, we explore the use of OpenStreetMap for extracting geospatial data and implementing an approach to enhance the overall data quality. We delve into the key steps of data extraction including selecting the area of interest, data collection, cleaning, and preparing data for future use.

2025

Read more