Big Data – Blog by YY
Data Processing II
In the previous part of the series, we looked at various data processing techniques, from traditional (Batching processing) to the one (Hybrid-processing) commonly being used nowadays. Besides, in the previous series, we only focused on “INPUT.” However, “OUTPUT” part, is what upper level managers care about, because it involved how the INPUT being calculated, what fields in INPUTs being joined, and, etc regarding the numbers/analytics charts in reports, as well as the index that indicates the status in organizations.
“OUTPUTs,” consumes what INPUTs being processed by processing layer. However, different users like administrators, business users, vendors, analysts etc, can consume data in different formats. Furthermore, analysis on OUTPUTs can be used by analytic platform for further analysis, such as Machine Learning (ML), Deep Learning (DL), and business processes etc.
Different forms of OUTPUTs are:
– Export Datasets:
That is how dataset can be generated for further analysis/processing. The commonly used tools is Hive export or directly pull data from HDFS. However, it is totally based on the preferred storage for dataset in organizations.
– Reporting and visualization:
Different reporting and visualization tool can seamlessly in conjunction with assist of connectors for Hadoop/DBs.
– Data Exploration:
That is the places where most Data Scientists work on, building models and perform deep exploration in a staging/sandbox environment.
What sandbox is, as a separate cluster or a separate workplace within the same cluster that contains subset of data.
– Middle-tier Query:
The so-called interactive query is the term commonly in use. It can be implemented via SparkSQL, Hive, and Impala.
In short, the key thing in designing BIG DATA architecture is,
– Identify use case: The following factors should take into consideration.
The form and frequency of data, types of data, processing types and analytics types.
– Myriad of tooling: There are tons of tools in BIG DATA marketing and the proliferation of software has led lots of confusion at what to use and when, multiple technologies offering features in similarity and claiming to be better than others. Narrow down the technologies that satisfy the requirements in your organization is always the best policy before the architecture in design.