Introduction to Data

(只提供英文版本)

Big Data – Blog by YY

Introduction to Data

In this weekly blog, I am gonna introduce the characteristics agains “data.”

 

Yes, “Data”, the four-long character stands for proprietary relating to individual holdings after the terms “BIG DATA” bursts out into attentions, as well as the most amazing and intriguing elements once you spend time on it.

“Data” being processed can be categorized as the following,

–       structured,

information has well-defined length and format, as well as referring those have high degree of organization, such as the one stored in database, computer-/machine-/human-generated data.

  • example
    • computer-generated: server-/web-logging data

http://docs.aws.amazon.com/AmazonS3/latest/dev/LogFormat.html

  • human-generated: database data
  • machine-generated: sensor data

 

–       unstructured,

the opposite of structured data. Tho most high volume data type in word.

information has NO well-defined length and format, lack of degree of organization in between.

  • example
    • video/photos/audio files
    • email

 

–       semi-structured (another form of structured data),

information has organizational properties, making it feasible to understand and analyze, but still containing unorganized data in it.

  • example
    • csv/xml/json files, etc

example of cvs file, data type in fields among rows are all in consistency, except the #1 row(headers).

 

Sometimes, it’s hard to distinguish the difference between the ones, semi-structured/unstructured data.

For making life simple and naive, some experts argue there is no semi-structured data in existence; however, others don’t.

 

How to extract elements from raw information is the fundamental task in data analysis. The most common way for data processing is, trying to keep and transform all information in raw data to structured one, and dealing with consequential processing or even analytics tasks among the one in the future.

 

The reason might because its more easier to figure out correlations among elements in structured data than those in unstructured one.

Posted in ICG日志.