Artificial intelligence at the service of shared data: semantic search and conversational assistants in data spaces

AI Open Space

Artificial intelligence at the service of shared data: semantic search and conversational assistants in data spaces

Managing a catalog with hundreds of datasets from multiple organizations is one of the biggest operational challenges of any data space. Traditional keyword search methods fall short when metadata is heterogeneous, incomplete, or in different languages. Our data space solves this problem by integrating artificial intelligence capabilities directly into the connector infrastructure.

The problem: finding the needle in the data haystack

Imagine a European consortium with 15 participants, each sharing between 20 and 50 datasets. A data consumer needs to quickly locate one containing information about air quality indices in rural areas of Castilla y León. With exact text search, if the provider labeled their dataset as "Datos de monitorización - estaciones rurales CyL", an English search won't return results.

Semantic search solves this: instead of comparing text strings, it compares meanings. The system understands that "air quality monitoring" and "monitorización de calidad del aire" are conceptually equivalent.

Sidecar architecture: local AI, sovereign data

One of the fundamental principles of our data space is that data never leaves the participant's infrastructure without explicit consent. That's why the artificial intelligence components are deployed as a sidecar service running alongside the connector, not in a third party's cloud.

This sidecar architecture includes three main components:

  • Configurable LLM engine. The participant chooses which language model to use. It can be a local model running on Ollama, or an external service if the organization's policy allows it. The abstraction is transparent: the rest of the system interacts through a common API regardless of the provider.

  • RAG system (Retrieval-Augmented Generation). Each dataset's metadata is converted into vectors and indexed in a vector database (Qdrant). When a user makes a query, the system retrieves the most relevant fragments and uses them as context to generate a precise response.

  • MCP server (Model Context Protocol). Exposes connector capabilities as tools that any AI agent compatible with the MCP protocol can use. This enables assistants like Claude or custom agents to interact directly with the data space.

Semantic search: asking in natural language

The vectorization system converts each catalog dataset into a numerical representation that captures its meaning. When a user searches for "environmental sensor data in Castilla y León," the system calculates the semantic distance between the query and each indexed dataset, returning the most relevant ones regardless of the language or exact terminology used.

This capability transforms the data discovery experience: from browsing endless lists of metadata to conversing with the catalog.

Conversational assistant: managing the connector through dialogue

Beyond search, our data space incorporates a conversational assistant integrated into the connector's web interface. This assistant understands the data space context and can help users with administrative tasks: checking the status of ongoing negotiations, exploring remote data catalogs, or configuring connector parameters, all through natural language.

The assistant uses a WebSocket protocol for real-time communication, enabling fluid and contextual responses without page reloads.

AI agents as data space participants

Perhaps the most disruptive innovation is the integration of the MCP (Model Context Protocol). This standard, gaining traction as a bridge between language models and external systems, allows AI agents to connect to the connector and use its functionalities as tools.

In practice, this means an organization can deploy an agent that automatically discovers relevant datasets in the data space, negotiates agreements according to predefined policies, and orchestrates data transfers, all without direct human intervention but under the data sovereignty rules defined by each participant.

Privacy first: local inference, no compromises

A critical aspect of this integration is that all AI processing happens locally. Metadata is vectorized within the participant's infrastructure, queries are processed in the local sidecar, and no data is sent to external AI services unless the participant explicitly configures it.

This "privacy first" approach enables leveraging artificial intelligence capabilities without surrendering data sovereignty—a balance that many market solutions fail to achieve.

Towards a truly intelligent data space

Integrating AI into data spaces is not a cosmetic addition: it is a fundamental enabler for organizations with diverse technical profiles to participate in the data economy. A researcher unfamiliar with the internal structure of an IDS catalog can ask in natural language and get exactly what they need.

Our data space demonstrates that it is possible to combine the regulatory rigor demanded by European standards with the flexibility and usability provided by modern artificial intelligence.