Boosting search engine capabilities of RegReview: Introducing document multi-languages management
RegReview is an AI solution for compliance teams, to automate regulatory monitoring and processes, which brings together several tools, the most essential of which is a search engine operating on a compiled database of custom-built regulatory sources.
The database contains ~300k documents.
A document is basically a set of text fields, like a title, a summary, a full text content, a date, an origin, a URL, etc.
At start, as we wanted to have a search engine that implements elaborated text search on these regulation documents, we needed a solution that is powerful, flexible, offers many advanced search features, and is tuned for text searches.
So we quickly opted for ElasticSearch, a widely recognized document database solution, designed to efficiently store, search, and analyze large volumes of data in real-time.
It provides great search capabilities that meet the expectations set out above. It is also recognized mainly for its easy integration, its distributed architecture enabling high availability, and its ability to handle large dataset and deliver fast, accurate search results. On top of this, ElasticSearch has a huge community and a wealth of documentation.
Specifications and Implementation
The search specifications of the search engine are not so complex. The current functionalities of the search engine can be divided into two primary components:
- Full-text search: This component concentrates on the
title
,summary
, andtext_content
fields, taking terms from a search bar as an input. It operates in a manner similar to popular search engines like Google, providing results based on matching relevance to the entered terms. - Strict filters: The search engine also incorporates strict filters for other fields such as
topics
,date
, andsource
. These filters allow users to narrow down search results based on specific criteria, involving precise and targeted search parameters operating on the document fields.
The search engine is based on some of ElasticSearch’s powerful native search functionalities, which only need to be properly combined to deliver relevant and satisfying results.
Among the functionalities used, notable examples include fuzzy queries, match queries, and span near queries, which exhibit exceptional performance when applied to text
field types, especially on a legal documents corpus like ours.
In the end, the RegReview search engine is built with a python module acting as a proxy between the application and the ElasticSearch instance, templating the queries sent to the index in the Search API context.
Alongside this, a number of other ElasticSearch features have proved very helpful in the search engine, such as pagination, sorting or fields returned selection.
Query Shape
Let’s illustrate this with a simple example: if a user wants to retrieve documents that meet the following criteria:
- mention “laundering”
- contain the “Compliance” or “Basel” topics
- were published by the “European Banking Authority”
- were published in 2022
The search engine module role is to generate a complete query, matching the desired search parameters and made up of several major parts:
- Full-text search clauses
- Topics clause
- Filter clauses
The emergence of a new need in RegReview
Despite the initial search engine’s impressive capabilities and user satisfaction, the expanding scope of scraped regulatory sources revealed a new requirement. As documents in different languages accumulated, a linguistic patchwork emerged, posing challenges for users who speak only one language.
To address this, multilingual management is essential for the document database.
This enhancement would allow users to select their preferred language for document display in the application. This flexibility would enable regulatory monitoring across diverse sources, with documents seamlessly presented in the user’s preferred language.
For example, a French-speaking user could effortlessly access documents in French, regardless of their original language.
Multilingual Properties
The introduction of this new multilingual management functionality significantly changes the structure of the documents themselves. As a result, the search engine and all operations performed on the documents undergo substantial changes.
Before implementing the new multilingual search engine, we must establish specifications and constraints to maintain consistency and robustness.
1. The following fields are affected by the multilingual aspect and specific to each version of a document:
title
summary
text_content
topics
2. It is imperative that 2 users with different language preferences are presented with the same set of documents in the search results, even if those are displayed in different languages in the end. This ensures consistency and eliminates any potential confusion, establishing the search engine as impartial towards the user’s language preference.
3. If the document version matching the user’s language preference does not exist, a fallback is made to the original language version of the document.
Topics Display: Language-Specific Tagging and Priority Rules
The topics
field is written by a customized AI based on language-specific NLP models, with each language having its own taxonomy. It's important to note that the concept of topics is not global in a document, each language version having its own topics
field.
Consequently, it is possible for a document to have topics in its English version but not in its French or Chinese versions.
To make the most of the AI results and display as much topics as possible, a specific rule has been established for their presence within a document, following a prioritized sequence:
- If available, topics in the user’s preferred language are displayed.
- If not available in the preferred language, topics in the document’s original language are displayed.
- In the absence of topics in the preferred or original language, English-language topics are displayed.
- If there are no topics available in the English version, no topics are displayed.
This prioritized sequence ensures that topics are presented to users in their preferred language whenever possible, while providing fallback options when necessary.
Multilingual integration within RegReview
Considering the above-mentioned elements, we can easily define a new structure for the documents within the compiled database.
The new document structure
The new structure is largely based on the last one, ensuring a smooth migration process that minimizes any disruptions:
Each of the language_code
under versions
key is a ISO 639-1 language code.
A deliberate and ingenious decision has also been made to include a separate language code, referred to as the original
language code, consistently present in all documents. This choice offers several advantages:
- Default language code: In cases where the language cannot be detected for a document, the
original
language is the sole key inversions
. - Facilitates fallback: The presence of the
original
language code simplifies the implementation of the fallback mechanism described above. - Essential for query templating: The
original
language code plays a fundamental role in the templating of queries executed in ElasticSearch, as explained in the next section.
Search Engine Scope
When considering the search engine, an important constraint revolves around the independence of search results from the user’s language preference (mentioned in the section “Multilingual properties”). This raises a crucial question: What should be the “query scope” of the new multilingual search engine?
To ensure independence, a simple solution emerged: the search engine’s scope must be limited to the original
version of the document. This approach offers many core advantages:
- Unilateral Consistency: the original version remains the same for all users
- Robustness: the original version is present in all documents
- Backward Compatibility: it reconciliates with the original search
Thus, the new search engine scope will only be the original
version in the documents.
The Topics specific behavior
However, the topics specific rules (mentioned in the section “Topics Display: Language-Specific Tagging and Priority Rules”) need careful consideration as this aspect introduces a higher level of complexity in the search engine. Therefore, the topics
field cannot be treated the same as other language-specific fields.
To establish the specific rule for displaying topics, we could manipulate the documents returned in the search results and present the desired information to the end user. Though, the challenge does not lie in the display itself, but in filtering the topics
field. For consistency reasons, the filtering scope should align with the actual topics displayed to the user.
In other words, the scope of the topics filter must be dynamic, following the rules stated above: sometimes, the scope must be on versions.<preferred_language_code>.topics
, sometimes on versions.original.topics
, and sometimes on versions.en.topics
Nevertheless, the technical feasibility of this solution has no guarantee.
Addressing the topics Issue
After some investigation, the topics
issue turned out to be manageable thanks to a feature released in ElasticSearch 7.11
: runtime fields.
Runtime fields are a powerful feature in Elasticsearch that allow to dynamically define and compute fields at query time, without the need to modify the mapping or reindex data, offering flexibility and agility.
This perfectly meets our need: the idea would be to create a runtime field for which valuation is done on-the-fly by a script transcribing the specific rules.
Dynamic Generation of topics
field
In our specific use case, a new topics
runtime field is dynamically computed during query execution. This field is then used as a scope for the topics
clause of the search query.
To do so, we will use the runtime_mappings parameter, allowing to specify a runtime fields to compute at query time in the search API.
Runtime fields in ElasticSearch mainly rely on scripting, a powerful feature that enables custom scripts execution at run time. These scripts are used to generate custom clauses that Elasticsearch can run and exploit.
Painless language is the most suitable of the supported language. This documentation page provides a comprehensive overview of the Painless context for runtime fields.
In this script, the language_code
is evaluated with the user's preferred language before performing the query.
The logic of the script is relatively straightforward: it sequentially checks the topics present in the user’s preferred version, followed by the original version, and finally the English version. It is achieved in the Painless language’s doc
context, which allows access to document fields.
Ultimately, this topics
runtime field can be used as-is in the query build, in the topics
clause mentioned above.
The bonus benefit of this runtime field is that it can be included in the search response, and directly provides the appropriate topics for presentation in the application interface, without any further operations needed on client side.
Taking advantage of runtime fields
Runtime fields have proven to be valuable not only for addressing the topics rule but also for implementing fallback in a straightforward manner.
Considering the multilingual properties, it is essential for users to see documents in their preferred language version. In Elasticsearch, this is done by selecting the appropriate fields to be returned in the search results.
For instance, if a user’s preference language is French, the source_includes
argument of the search API is specified as follows:
[ "versions.fr.title",
"versions.fr.summary",
"versions.fr.text_content" ]
Good. But we then have to deal with the case where the preferred version doesn’t exist in the document. Again, runtime fields offer a simple solution by generating new dedicated fields for fallback title
, summary
, and text_content
.
Conclusion

Ultimately, the implementation of multi-language management in the RegReview search engine showcased the capabilities of ElasticSearch and validated our decision to use this technology. The native features of ElasticSearch effectively addressed our problem, making the implementation process relatively straightforward.
However, if we focus on the runtime fields themselves, certain limitations emerged:
- The generated runtime fields cannot be of type
text
, which prevents to perform full-text search on them. - Performance is impacted due to the dynamic calculation of declared runtime fields with each call: to optimize performance, significant effort was dedicated to refining scripts.
- Scripts tend to encounter issues when exploring fields in a document. Robust conditional handling had to be incorporated to address situations where fields were missing or not present in the index mapping.
- A notable bug was discovered: the limitation of sending only up to 100 values in a runtime field. This constraint poses challenges, particularly for the topics field.
Despite these limitations, the use of runtime fields provided valuable functionality for multi-language management in the search engine, and their advantages outweighed the encountered challenges.