Twitter and Big Data!
Did you know the Twitter community generates more than 12 terabytes of data per day? That equals 84 terabytes per week and 4368 terabytes — or 4.3 petabytes — per year. That’s a lot of data certainly. Though these posts also known as tweets are small, these add up sufficient bytes of data. According to 2014, twitter estimated over 200 billion tweets per year.
This is Big Data. To solve this, twitter uses distributed computing technology. Distributed storage makes it easy to manage data. Imagine you have tons of data, large enough compared to your machine’s capacity. With Hadoop, you can create a cluster of hundreds of computers(nodes) and pretend that the hard drive of these machines adds up to a gigantic drive at your disposal.
Twitter uses Hadoop, Pig, and HBase at large to manage and analyze the huge data that it collects. Hadoop is a tool that is used to analyze distributed big data, streaming data, timestamp data, and text data.
Hadoop is at the core of Twitter data management. With hundreds of thousands of servers, the Hadoop file systems at twitter hosts over 300 PB of data. The physical storage capacity of Hadoop clusters that store and analyze data adds up to more than 1 exabyte. With over 100,000 hard drives, a typical cluster can be assumed to have 100 petabytes of logical storage.
500 PB of data is divided into various groups (real-time, processing, data warehouse, and cold storage) which is stored in multiple clusters. Each cluster can be scaled to have over 10,000 nodes.
Twitter handles over a trillion messages per day and all of these are processed into over 500 categories and then copied across all our clusters. Each cluster in the twitter environment has a specific purpose. If one data-center is used to store production processing data some other may contain ad-hoc data-warehousing.