

DataHub is a comprehensive platform for acquiring, storing, processing, and sharing data, built on Data Lakehouse architecture. It offers organizations full control over the information lifecycle – from the integration of diverse sources, through analytics and advanced processing, to secure data sharing with users, applications, and external systems.
Key features of the DataHub solution
Data centralization and unification.
DataHub integrates structured, semi-structured, and unstructured data, eliminating information silos and providing access to a single, reliable source of truth within the organization.
Secure and scalable architecture.
The platform is based on containerization (Kubernetes), ensuring dynamic resource allocation, high availability, and data security at every stage of processing. Kerberos authentication, RBAC/ABAC, encryption (AES, TLS), MFA, and extensive auditing meet the highest compliance standards (GDPR, HIPAA).
Self-service data collection and profiling.
The built-in Multipurpose Acquisition Point (MAP) allows for automatic integration of sources via API, Kafka, SFTP, AD/LDAP, PKI key support, push/pull transfer support, and independent data categorization by business users.
Advanced processing and automation.
DataHub supports batch, stream, and near-real-time processing (Apache Spark, Flink, Airflow). The low-code pipeline system streamlines complex ETL/ELT operations, transformations, validation, enrichment, and data anonymization. The Data Quality Engine and AI Driven Analyzer provide automatic quality control, anomaly detection, and corrective recommendations.
Lifecycle and metadata management.
Cloudera Data Catalog and Apache Atlas centralize metadata management, lineage, sensitive information classification, and permission control. Automated labeling (masking, tagging) and support for security policies granularly control access and compliance.
Analytics and integration.
Direct access via SQL (Impala, Hive, SparkSQL), REST, GraphQL, ODBC/JDBC. DataHub is fully compatible with BI tools—Tableau, PowerBI, Qlik Sense—and enables integration with external systems and applications.
DataHub is the foundation of digital transformation, increasing operational efficiency, improving the quality of reporting, enabling rapid response to changing market conditions, and building the company's operational potential based on data.
Key features and benefits
Centralization and unification of data.
Integration of information from various internal and external sources, such as business applications, databases, files, data streams, and, in particular, IoT systems. Such integration eliminates data silos and provides a consistent, comprehensive view of information, building a "single source of truth" within the organization.
Flexible storage of all types of data.
Structured, semi-structured, and unstructured – allowing even very large volumes of data to be stored in their original form (RAW data) and in forms that enable their free and effective use (extracted, normalized, enriched, corrected, transposed, and transformed data).
Data Lakehouse architecture.
Provides an optimal balance between flexibility and performance.
Advanced data processing and transformation.
The system supports both batch processing and real-time analysis, offering tools for automatic cleaning, validation, normalization, and enrichment of data. Support for complex transformations and aggregation of information, as well as the ability to define automated data pipelines, enables effective management of large information sets.
Effective data management.
This includes central metadata management, data cataloging, and information dictionary maintenance. The system automatically determines data types, origins, owners, and lifecycles, ensuring high-quality monitoring and maintenance and enabling the implementation of data governance policies.
Data security.
The system is equipped with multi-level authentication and authorization mechanisms, adapted to the sensitivity classification of the information being processed. Encryption techniques are used for both data at rest and in transit, as well as auditing and access tracking tools. All solutions ensure compliance with legal requirements, such as GDPR and industry standards.
Easy data sharing.
Thanks to modern APIs and dedicated connectors, integration with analytical and reporting systems, business applications, and external systems becomes much simpler. It is possible to create virtual data warehouses (Data Marts) for the needs of specific departments, and controlled, granular access to information guarantees security and flexibility in data use.
Flexibility and scalability.
An additional advantage is the optimization of costs and resources, achieved through the effective use of cloud technologies (in particular private cloud environments) and open-source solutions, automatic resource management, and the ability to flexibly scale the environment. The system architecture is modular and open, which facilitates its expansion with new technologies and data sources, and also enables adaptation to the growing share of unstructured data and specific AI/ML requirements. Support for hybrid solutions, combining local infrastructure with multi-cloud models, allows for optimal control of costs, compliance, and performance, providing a foundation for the future development of the organization's digital competencies.
Data classification covers both technical and business aspects, enabling the full potential of processed resources to be exploited while maintaining compliance with regulations and industry requirements.
Classification based on data structure
Structural data.
Relational tables (SQL databases), spreadsheets, data in the form of columns and rows, characterized by a rigid, predefined structure (schema) and ease of indexing and searching.
Dane semi-strukturalne.
Takie jak dokumenty JSON, XML, YAML, pliki dzienników zdarzeń, dane NoSQL, charakteryzujące się częściowo zdefiniowanym schematem (np. drzewiasta struktura elementów) oraz elastycznością w zakresie dodawania nowych atrybutów, ale bez ścisłej normalizacji.
Unstructured data.
Uniform text (e-mails, Word/PDF documents), images, video, audio, binary files, social media data streams, which share the common feature of having no imposed structure or form, making them much more difficult to index, analyze, or interpret.
Metadata.
File descriptions (creation date, author), labels, copyright information, keywords in data repositories, etc., constituting a critical set of information describing the data, necessary for its correct interpretation, determining context, business significance, cataloging, auditing, testing adequacy and compliance with standards, automation of life cycle management, etc.
System architecture. The system is modular, flexible, and scalable to enable seamless ingestion, storage, and sharing of all types of information in line with the requirements of modern organizations. At its core is the Data Lakehouse concept, which combines the advantages of Data Lake and Data Warehouse.
Classification based on data structure
Operational data (OLTP).
Short-term, frequently updated in real time, e.g., e-commerce orders, bank transaction records, various statuses.
Analytical data (OLAP / Big Data).
Collected in data warehouses, data lakes, or Hadoop/Spark clusters; focused on reporting, historical analysis, pattern exploration, examples of which include monthly sales reports, user behavior analysis, trend prediction;
Streaming data.
Continuous data sources such as machine telemetry, server event logs, IoT sensors, social media channels or news feeds, processed in the System using Apache Kafka, Apache Flink, Spark Structured Streaming.
Archival data.
Old versions of data, database snapshots, historical logs; less frequently modified, usually archived on tape, in object storage (e.g., S3, Blob), used for audits, in cases where systems need to be restored after a failure, or in connection with long-term analysis.
The system does not favor any type of data, which means that all data entering it will be placed in a single, consistent environment.
Classification based on data sensitivity
Public (open) data.
No access restrictions; e.g., press releases, government statistics, publicly available product catalogs.
Internal data.
Access restricted to organization employees; e.g., company procedures, internal reports, low-risk operational data.
Confidential data.
Access based on permissions; e.g., financial data, commercial contracts, customer data.
Strictly regulated data.
High level of protection-encryption, strict supervision; e.g., sensitive personal data (social security numbers), medical data (diagnoses), information covered by the GDPR, defense sector data.
IN-OUT classification
IoT and Edge data.
Collected and processed close to their sources, "at the edge of the network," data from various types of sensors, measuring equipment, signal monitoring systems, meters, autonomous vehicles, and "smart" devices.
AI/ML data.
Training sets (datasets), models, embeddings; characterized by large volume and the need for specialized infrastructure (GPU, TPU).
Privacy-enhanced data.
Anonymized, pseudonymized, homomorphic, collectively encrypted, used in the analysis of sensitive data without compromising privacy.
Web3 / Blockchain data.
Specific to operations carried out in decentralized Blockchain networks, generated and processed by decentralized applications (dApps), examples of which include transaction records (ledger), smart contracts, and NFTs, characterized by transparency, dispersion, and immutability.
Data security architecture
The reference architecture is based on four pillars that ensure the consistency and effectiveness of the implemented solutions. Within this concept, particular emphasis has been placed on issues related to user and service authentication.

Technology
Platform Class: Data Lakehouse
DataHub is built on state-of-the-art open source components and proven Cloudera enterprise technologies. The ecosystem provides full automation of the data lifecycle, enterprise-grade security, and the ability to integrate with external solutions.
Apache Kafka.
A real-time streaming data transmission system. It enables the integration of multiple sources, buffering, and reliable large-scale I/O data transfer.
Apache Flink / Apache Spark.
Data processing engines: Spark – for advanced batch and analytical processing (batch/stream), Flink – for real-time streaming data analysis and transformation.
Apache Airflow.
A tool for orchestrating data flows (workflow) and automating complex ETL/ELT processes. It enables graphical planning, scheduling, and monitoring of tasks.
Apache NiFi.
(optional) Facilitates data flow between systems, automates data collection, enrichment, and transfer.
Kubernetes.
A platform for containerization and automatic scaling of services and ensuring high availability of the DataHub environment.
Apache Ozone (S3-compatible storage).
A distributed data storage system. Provides flexibility, scalability, and support for huge volumes of data structures (Parquet, Avro, ORC).
Apache.
Hive – catalog and data warehouse engine (SQL, batch processing), Impala – fast SQL analytical queries on large sets, Iceberg – Lakehouse transaction warehouse with support for versioning and data integrity on a large scale.
Cloudera Data Platform (CDP).
Integrated platform for managing all DataHub services: cataloging, warehouses, security, analytics, ML.
JanusGraph + Apache Atlas.
Atlas – metadata management, classification, and data lineage. JanusGraph – graph database that helps track connections and the origin of information.
Cloudera Data Catalog.
Centralization of metadata, automatic tagging, classification, support for security policies and data lifecycle.
Apache Ranger.
Access control, security policy management, granular data authorization (row, column, table level) in accordance with RBAC and ABAC.
Kerberos, OAuth 2.0, TLS.
Advanced authentication (Active Directory/LDAP with SSO), transmission encryption, and secure communication inside and outside the platform.
Prometheus and Grafana, Cloudera Manager.
Environment monitoring, performance visualization, automatic alerting, and diagnostics.
Integrations and connectors.
ODBC/JDBC, REST API, GraphQL API – allow you to quickly connect DataHub with BI tools (Tableau, Power BI, Qlik), external systems, and third-party applications.


Schedule a DataHub presentation
Contact us and find out how DataHub can help you fully leverage the potential of your data – schedule a presentation!
Feel free to contact us!
** I declare that, pursuant to Article 6(1)(a) of Regulation (EU) 2016/679 of the European Parliament and of the Council of April 27, 2016, on the protection of natural persons with regard to the processing of personal data (...) ("GDPR"), I consent to the processing of my personal data for the purpose of performing a contract or communicating with allclouds.pl sp. z o.o. (The content of the personal data notice is available here)
allclouds.pl sp. z o.o.
ul. Jutrzenki 139, 02-231 Warszawa
www.allclouds.pl • office@allclouds.pl
phone: +48 22 100 43 80 • fax: +48 22 100 43 84
NIP: PL5223052539 • REGON: 363597531 • KRS: 0000598708
PN-EN ISO 9001 • PN-EN ISO/IEC 27001 • PN-EN ISO 14001 • PN-EN ISO 22301 • PN-EN ISO/IEC 27017 • PN-EN ISO/IEC 27018

