Data pipelines between organizations: automate, process and analyze without losing control

AI Open Space

Data pipelines between organizations: automate, process and analyze without losing control

Sharing data between organizations goes far beyond moving files from point A to point B. Data needs to be transformed, validated, anonymized, and enriched before reaching its destination. Our data space incorporates a pipeline system that allows visually designing these processing flows, executing them on demand or on a schedule, and monitoring every step of the process.

From simple transfer to intelligent flow

Most data exchange solutions operate with a point-to-point model: a provider offers a dataset and a consumer downloads it. But in real scenarios, data is rarely ready for direct use.

Our data space introduces the concept of pipeline as a directed graph of nodes. Each node can be a data source (where data is extracted from), a processing application (that transforms or analyzes data), or a sink (where results are stored). Nodes connect to each other defining a complete data flow.

Data space applications: processing and analysis

Applications are external HTTP services that act as intermediate nodes in the pipeline. There are two types:

  • Processing applications. They receive input data, apply transformations, and produce output data that continues through the pipeline. Examples: personal data anonymization, format conversion (CSV to JSON), synthetic data generation, or enrichment from external sources.

  • Analysis applications. Instead of producing transformed data, they generate visual reports (HTML) displayed in the connector's interface. Examples: data quality analysis (completeness, accuracy, consistency), anomaly detection, or real-time monitoring.

Most importantly: these applications can be developed in any language or framework, as long as they implement the data space's standard HTTP protocol. This allows third parties to contribute specialized applications without modifying the system's core.

Visual editor: designing flows without code

Our data space offers a visual editor where users can drag and connect nodes to design pipelines. Each node displays its configuration and status, and connections between nodes define the data flow. No coding or underlying API knowledge is required.

The system automatically validates that connections are consistent: that a node's output data types match those expected by the next, and that all references to sources, sinks, and applications are valid.

Execution, monitoring, and logs

When a pipeline runs, the Data Transfer Module orchestrates the entire process: reads data from sources, passes it through each application in sequence, and writes results to sinks. The pipeline transitions through clear states (Pending, Running, Completed, Error, Killed) and generates timestamped logs for each step.

If a pipeline fails or gets stuck in an inconsistent state, the system automatically detects the situation through a staleness detection mechanism, preventing resources from being blocked indefinitely.

Automatic scheduling: self-running pipelines

For periodic synchronization scenarios, pipelines support configurable auto-triggers. A pipeline can be scheduled to run every hour, every day, or at any custom interval (minimum 30 seconds). The system automatically manages the execution queue, ensuring two executions of the same pipeline never overlap.

This enables use cases like daily inventory synchronization between suppliers and distributors, hourly environmental data updates, or overnight transaction processing.

Analysis results: actionable information

Analysis applications integrated into the pipeline can generate three types of results: static (one-time reports), auto-refreshing (periodic updates), and real-time (via WebSocket). These results are displayed directly in the connector's interface, providing data managers with immediate information about the quality, volume, and characteristics of data flowing through their pipelines.

The result is a system where sharing data is not a passive act, but an active, monitored, and intelligent process.