Data Processing

Big Data – Blog by YY

Data Processing

Architecture Design, as the heart in every aspect, is a complex task.

For example, the architecture design in automobile, the one in skyscraper, and etc. The reason is the architect needs to take everything, if possible, into consideration.

For the design in Big Data architecture, data architects need to consider the aspects from data, volume, variety, and velocity. However, not just the traits in data, but also the speed of technology innovations and competitive products in marketing,  it is not trivial challenge for data architects.

In this post, I will explain technical terms regarding data processing in data architectures, the tooling in use for the individual data processing and, finally, how they are handled differently today.

Data, the core in data architecture, has being increased multifold; however, not just the volume of data, but also the processing.

In the past, all frequently accessed data was stored in RAMs, but due to the multifold in volume, it is been stored on multi-disks crossed a number of machines connected via network. In this way, the processing is taken closer to data which reduces not only network I/O significantly, but also the consumption RAMs in individual server with the emergence of distributed data processing.


Then, the three terms in data processing being popped out, that is, Batch, Real-time, and, finally, Hybrid processing.


Batch processing

The collection of input within a specified interval of time and execute data transformations on it in a scheduled way. Traditionally historical data load, the KPI calculation is examples of Batch processing.

Technology Used: Hadoop, Spark, Flink, Hive, Pig.


Real-time processing

Involves running data transformations once data in delivery.

Technology Used: Spark, Spark SQL, Flink, Flink SQL, Impala.


Hybrid processing

The combination of both batch and real-time data processing.

After data processing, another layer comes after, consuming the output from data processing. Different users like business users, administrators, partners, and analysts, etc. can further utilize data in different formats, for example, business processes, analysis for recommendation engine, etc.

 Stay tuned for the next post regarding the layer after data processing for more details.

Posted in ICG Blog.