storage - Implementing large scale log file analytics -


Does anyone point me in any context or provide a high level overview of how Facebook, Yahoo, Google etc. Companies exhibit large scale (for example, multi-TB class) log analysis that they do for operations and specifically web analytics?

Especially focusing on web analytics, I am interested in two related aspects: query execution and data collection.

I know that using common approaches to reduce the map, distribute every query on the cluster (like using Hadoop) However, what is the most efficient storage format to use? This is log data, so we can assume that each event has a time stamp, and in general the data is structured and not sparse. In most web analytics queries, the slides of the data have to be analyzed between two arbitrary timestamps, and that data involves retrieving the total data or discrepancies.

Will a column-oriented DB be an effective way to store big tables (or HBs), and more importantly, such data queries? Is it the fact that you are selecting a subset of rows (based on timestamps) working against the basic premise of this type of storage? Would it be better to store it as unstructured data, such as a reverse index?

Unfortunately any size fits in all answers.

I am currently using Cascading, Hadoop, S3 and Astor data to process 100 days of gigas through a phased pipeline inside the SWS.

Esther data is used for query and reporting because it provides a SQL interface clears large data and is parsed by Cascading Processes at Hadoop using Cascading JDBC Interface, Loading ester data is a trivial process.

Keep key / value stores like Hibbes and HyperTable, so do not do ad-hoc queries and join without help

In full disclosure, I am a developer on the cascading project.


Comments

Popular posts from this blog

python - Overriding the save method in Django ModelForm -

html - CSS autoheight, but fit content to height of div -

qt - How to prevent QAudioInput from automatically boosting the master volume to 100%? -