At AppNexus, over 2MM log events are ingested into our data pipeline every second. Log records are sent from upstream systems in the form of protobuf messages. Raw logs are compressed in Snappy when stored on HDFS. That said, even with compression, this still leads to over 25TB of log data collected every day. On top of logs, we also have 100s of MapReduce jobs that process and generate aggregated data. Collectively, we store petabytes of data in our primary Hadoop cluster.
Parquet is a columnar storage format in the Hadoop ecosystem. Compared to a traditional row oriented format, it is much more efficient in storage and has better query performance. Parquet is widely used in the Hadoop world for analytics workloads by many query engines. Among them are engines on top of Hadoop, such as Hive, Impala and systems which go beyond MapReduce to improve performance(Spark, Presto).
Parquet stores binary data in a column-oriented way, where the values of each column are organized so that they are all adjacent, enabling better compression. It is especially good for queries which read particular columns from a “wide” (with many columns) table, since only needed columns are read and IO is minimized. Read this for more details on Parquet.