How are websites like twitter and facebook doing it?

JEET Notable Member

Messages:: 3,832

Likes Received:: 502

Best Answers:: 19

Trophy Points:: 265

#1

Hello,
This is just something that has confused me for a very very long time.
If you look at twitter or facebook, you'll find many people who have thousands and thousands of friends. Some even 70K and so on.
I'm not talking about people who follow me, but people "I" follow. (following 70k)
Now when I login, I see tweets from all those 70K people I follow. Newest ones on top and so on.
Which database system or SQL server can handle that kind of queries?
The easiest query I can think of is something like this:

select id, tweet, username from table where id_of_logged_in_member='123' and username in ('abc','xyz','zzz') order by time desc limit 30

I'm sure I cannot make a list of 70K usernames to use in that query.

Additionally, a website that has 70k followings in a single account might have a billion members in total. Even if 10% of those make a tweet everyday, the table grows by 100 million records each day.
How will you query a table that huge even if it has indexes?
I can remove older records and store them elsewhere every 10 days, but still its a billion records to scan per page load. And thats minimum.

My question: How are these guys doing it?

I'm literally blank here. So any ideas are welcome for discussion.
I believe that DP has brilliant minds and this is something we can discuss.
Please post.
Thanks

JEET, Aug 19, 2015 IP

Dayvi Member

Messages:: 41

Likes Received:: 3

Best Answers:: 0

Trophy Points:: 33

#2

When it comes to data this big you hire a team of developers with knowledge of different areas and let them work it out.

All that the people here can give you is a brief overview of possibilities.

Dayvi, Aug 19, 2015 IP

billzo Well-Known Member

Messages:: 961

Likes Received:: 278

Best Answers:: 15

Trophy Points:: 113

#3

Facebook and Twitter are multi-billion dollar corporations and can hire the best teams of engineers they can find and develop their own software in-house. Not to mention they can afford to have their own data centers, as Facebook does.

I will give you a link to an informative article about how FriendFeed used (it no longer exists) MySQL to create tables to hold its own indexes to overcome MySQL's shortcomings and increase performance.

backchannel.org/blog/friendfeed-schemaless-mysql

Some of it is clever and some of it is quite simple.

You can also look at CouchDB, MongoDB, and other NoSQL solutions and see if you can use them in whole or in part.

JEET said: ↑

Even if 10% of those make a tweet everyday, the table grows by 100 million records each day.
How will you query a table that huge even if it has indexes?
Click to expand...

Something that large I would assume would be distributed among many servers. Could each user have its own table?

billzo, Aug 19, 2015 IP

JEET Notable Member

Messages:: 3,832

Likes Received:: 502

Best Answers:: 19

Trophy Points:: 265

#4

Yes, of course they are all billion dollar organizations, but I think that facebook was started by a student in Standford. Not sure.
Google was a search engine by two professors.
This is why I said that we can discuss possibilities. We are not making another facebook.

Even I thought about individual tables, but then the problem is how many queries do I need to run to fetch latest updates from 70k members?
I cannot join them because there is no common field joining the data.
They all need to be in one single table.

I just checked some openings at twitter and it looks like they are using SQL and JAVA. Some PYTHON too.
Sure, JAVA is much, much faster, but crunching through billions of records?
I'll read the backchannel.org/blog/friendfeed-schemaless-mysql arrticle. Thanks for that.

JEET, Aug 19, 2015 IP

PoPSiCLe Illustrious Member

Messages:: 4,623

Likes Received:: 725

Best Answers:: 152

Trophy Points:: 470

#5

Twitter has a follow limit for non-verified users at around 2000 accounts. Hence you'll never reach that 70k monstrosity.

You'll have to remember that Twitter (and Facebook) use a completely different setup than Joe Normal. Distributed data centers, cluster based databases, and so forth.

When I log in to Twitter, I'll get a list of all the latest tweets by people I follow. Does that have to be a query across tables? Nope. It can just as well be a query against a temp table with just what I need to see in it. Also, AFAIK, Twitter uses websockets to push out new Tweets, hence no huge, ekstra queries.

There's probably some rather clever dbschema behind Twitter, by all means, but the problem isn't as big as you make it out to be.

Besides, only around 28% of Twitter users are active, and many rarely tweet (while others tweet alot). The db is growing, constantly, but maybe not with as much as you think.

PoPSiCLe, Aug 19, 2015 IP

ThePHPMaster Well-Known Member

Messages:: 737

Likes Received:: 52

Best Answers:: 33

Trophy Points:: 150

#6

Speaking from experience - Where I had to design/implement application that serves millions of users a minute, it is not hard at all.

I can tell you straight out that the language it is designed in doesn't matter. It can be done in PHP or C++, you will end up with the same issues/bottlenecks.

First issue would be how to store the data coming in and process it. Second issue would be how to deliver the data.

Both can be solved by many different type of technologies. For example, you can have a load balancing cluster (many people are starting to use Amazon) like this:

This means regardless of how many users access the system, you can serve them. Amazon will make it easier to construct new instances as needed.

You can store things in memory using cache (Memcached/Redis/etc..) and duplicate that data across all servers with many available sync systems.

You can implement a Messaging system (Rabbit MQ, Zero MQ, etc..).

Storing and processing the data is another story, which can only be solved by your business needs. Some companies use MySQL to store data, which is fine to processing millions of records if correctly setup. Other companies might decide on using Hadoop or MySQL itself to backup old data and just keep the fresh data that is needed locally.

At the end you surly will end up using at least a few technologies based on your needs.

Nothing is too big once you start breaking it down into smaller parts.

ThePHPMaster, Aug 19, 2015 IP

ketting00 likes this.

JEET Notable Member

Messages:: 3,832

Likes Received:: 502

Best Answers:: 19

Trophy Points:: 265

#7

Thanks PoPSiCLe and ThePHPMaster
PoPSiCLe, I think the 2000 limit automatically goes away once you have 2000 people following you.
But even with 2000 usernames in the query, it becomes a very long query...
Distributing the data servers across multiple data centers is a good option, but that won't work on websites like FB or twitter, because we follow members from all around the world.
Suppose if the data of london users is in a datacenter in UK, and I being in USA is following someone from london.
So when I login, the query would have to first find out the countries, then request data from each datacenter. This would take even longer.
This would certainly work on a search engine like google, or on amazon where they show you results based on your location.

Temp tables also might not work because then the script will have to update two or more tables instead of one, each time any member makes a tweet, or posts on the wall.
The key to this has to be in the indexes...

My problem is how to send a query so large to the database.

select id, post, username from table where username in ('abc','xyz') limit 10

I can't even imagine having 2000 usernames in that list.
What will be the correct query if I use something like this:

select id, post, username from table where username in ( select username from userTable where friendID='123' ) limit 10

Is that a correct query, or...
Thanks

JEET, Aug 20, 2015 IP

deathshadow Acclaimed Member

Messages:: 9,732

Likes Received:: 1,999

Best Answers:: 253

Trophy Points:: 515

#8

You know that load balancing @ThePHPMaster mentioned? Those big companies have cargo containers chock full of servers to balance to, but there's something else that can let one distribute the job around to hardware better, and that's task division.

When you have a single server for SQL, the most common thing you hear is to "try as much as possible to use just one query". This is because single task single order is often most efficient for a single thread and single disk interface; hell let's face it, most standalone SQL servers spend 99% of their time with their thumb up their backside waiting for the hard disk.

BUT, what if you have several THOUSAND servers all realtime mirroring the same data? Suddenly doing fifty separate 'in' like:
	SELECT id, date
	FROM posts
	WHERE userId = ?
	ORDER BY date DESC
	LIMIT 10
Code (markup):
... and ran them each in their own THREADS with different user ID's? Let's say you had a thousand different user ID's you were going to look for -- What's going to be faster, running one massive query with like over a thousand values, or running a thousand smaller queries concurrently across a thousand different servers all holding the same data?

Then once you have all the posts.id you would splice together ALL your result sets sorted by date (since it's a 10 limit no need to get fancy, additive bubble drop would be fastest and use the least memory as counter-intuitive as that sounds). Once you have that final ten from all ID's sorted by date, you run queries off those ID's to pull the message's content.

... and that's one BIG thing, do NOT waste time pulling information you should already know more than once. In your example you had it looking for username and pulling username. String comparisons are slow, use the user's ID instead. Really the users NAME has no damned business wasting space in the post database, that's what a user database is for. Some people would try to do a "join" on that, DON'T!!! Waste of RAM, waste of time, just makes the table bigger.

In that same way, do NOT pull the full data for the result before you actually know the results. This too may seem counter-intuitive at first but when running multiple query sets where you're going to merge the result for something like sort by date, wait to pull the actual data for that record (message content for example) until you have the ID's. Once again those too can be pulled as multiple separate queries and if you hook it via something like:
for ($t = 0; $t < $postCount; $t++) $thread[$t]->start($postList[$t]);
for ($t = 0; $t < $postCount; $t++) {
	$thread[$t]->join();
	template_outputMessage($thread[$t].messageData);
}
Code (markup):
The ::join() will wait for that thread to be finished before moving on to letting the template output the result. It is possible for each thread to finish out of sequence, so we want to join() in sequence to wait so our output is in the desired order.

Mind you, even in PHP going multithreaded is pretty complex:
http://php.net/manual/en/class.thread.php

... and requires a slightly different mindset to leverage properly -- but splitting smaller queries off into their own threads or running multiple queries concurrently can leverage multiple SQL mirrors better, leverage multiple processors better, and be night-and-day on performance.

IT's why a number of concepts like "JOIN" can actually be SLOWER if you've got the hardware to leverage.

TECHNICALLY the above methodology DOES consume more processing power and on a single standalone server would be WAY inefficient, but it better scales the more separate servers you can leverage becoming far FAR faster and efficient than a single massive query and result set would once you've got the hardware.

Load balancing, task division, multi-threading. Of course that PHP threads wrap Posix threads, when you're sitting there waiting for the joined async thread to finish executing, saying ->join gives the kernel free will to go work on something else also better leveraging multi-threading even on a single core... which treads into concepts like cooperative vs. preemptive multitasking and ideas like "why can't we do both?"

I've used a few RTOS where they were so strict about timeslicing that they didn't allow the next slice to start until the current one's time allotment ran out EVEN if the current slice was finished. Great for precise timing of events, terrible for fully leveraging the hardware's time.

deathshadow, Aug 20, 2015 IP

JEET Notable Member

Messages:: 3,832

Likes Received:: 502

Best Answers:: 19

Trophy Points:: 265

#9

Hi @deathshadow
That was extremely helpful! Answered a lot of my questions.
Although those companies are big now, but they were small once. One datacenter, one server.
What do you think would be the best approach in that case?
One big query, or too many small queries, each waiting for the server to free up?
If one big query, then how to include all 2000 usernames in that one query?
Thanks

JEET, Aug 20, 2015 IP

qwikad.com Illustrious Member Affiliate Manager

Messages:: 7,361

Likes Received:: 1,713

Best Answers:: 31

Trophy Points:: 475

#10

I don't know if you have time to look into all of this, but twitter's github pages are quite informative on this subject:

https://github.com/twitter/mysql
https://github.com/twitter/flockdb
https://github.com/twitter/twemcache
https://github.com/cassandra-rb/cassandra
https://github.com/twitter/gizzard
https://github.com/twitter/elephant-bird
https://github.com/twitter/haplocheirus

qwikad.com, Aug 20, 2015 IP

deathshadow likes this.

JEET Notable Member

Messages:: 3,832

Likes Received:: 502

Best Answers:: 19

Trophy Points:: 265

#11

Thanks @qwikad.com
I'm checking it right now.
Thanks

JEET, Aug 20, 2015 IP

deathshadow Acclaimed Member

Messages:: 9,732

Likes Received:: 1,999

Best Answers:: 253

Trophy Points:: 515

#12

qwikad.com said: ↑

but twitter's github pages are quite informative on this subject:
Click to expand...

... and points out another really good detail -- it's a piece of advice I was given three decades ago that still crops up in the occasional article on good practices.

If your tools or libraries aren't letting you do what you want how you want it done, CHANGE THEM! If that doesn't work, get off your tuchas and build your own!

Look at that list -- they have their own fork of mysql, their own fork of memcached, their own fork of a LZ style compression engine... and when that wasn't enough they built their own custom database engines for things like storing historical graphs, creating shards (what I was talking about with distributed tasks), custom database specific to the timeformat data of their setup...

.. and sometimes it does come down to that too -- creating a unique system specific to YOUR data, instead of trying to use a generic "one size fits all" engine like SQL. If you have ONE type of query ALWAYS passing the same type of data and ALWAYS returning the same type of data, a generic engines extra overhead can take anywhere from two to twenty times longer than a nice tight bit of optimized code -- even if that custom code is in a ridiculously inefficient language like Scala. (though that might hinge on how much cache is allocated post JIT compilation)

Far, FAR too often someone here will as a question like "How do I build by own PHP site from scratch" and some dipshit will respond with "Oh just use wordpress" -- or "just use codeignitor" -- just like how halfwits try to use jQuery and Bootcrap to make up for not knowing enough about HTML or CSS.

The problem being that typically prevents the person in question from learning the underlying processes required to expand and scale the project if need be. Then a year down the road IF they actually see success (which given the disasters vomited up with such sleazy train wrecks of ineptitude, well...) and it's time to scale, they end up either stuck needing more hardware than they can actually afford, or having to toss the entire project in the can and start over.

It's a classic problem when planning ahead or moving forward with an existing project; don't let the limitations of your tools, libraries or frameworks hold you BACK -- it's why frameworks that don't work like the underlying language does (MVC on PHP, jQuery, etc) often hobble you moving forward, no matter how much you may have been deluded into THINKING it saved you time at the start.

Again, avoid the credit mentality; paying more later for something you can't afford now is NEVER a rational business decision... and again as I said previously, don't get too attached to any one tool or any one way of doing something. Learn the technology, not the tool -- learn the language, not some stupid framework sitting on top of it. Then you'll have the knowledge and understanding to deal with WHEN that day comes and you come across a need that the 'tool' doesn't provide.

Last edited: Aug 20, 2015

deathshadow, Aug 20, 2015 IP

mytechiestuff likes this.

ihebfarhat Peon

Messages:: 2

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 1

#13

the answer is simple facebook and twitter has multiple server that can handle all these request

ihebfarhat, Aug 25, 2015 IP

PoPSiCLe Illustrious Member

Messages:: 4,623

Likes Received:: 725

Best Answers:: 152

Trophy Points:: 470

#14

There'a a LITTLE bit more to it than just throwing more servers at the problem. But then I guess you're just trying to up your post count.

PoPSiCLe, Aug 25, 2015 IP

NetStar Notable Member

Messages:: 2,471

Likes Received:: 541

Best Answers:: 21

Trophy Points:: 245

#15

Lots of BS answers here... #1 those web sites or I should say infrastructres have thousands of servers and utilize hardware and software to balance the traffic. #2 They heavily use advanced cache methods so they are not always calling from the database. #3 They use advanced algorithms to show you ONLY the tweets/statuses of users you've interacted with or may be interested in. They really aren't showing you 70k updates from your "friends" but only a small fraction. #4. They most certainly have the budget to make resource heavy requests and 5. They have an "elite" staff of nerds and geeks who get paid well to work around the clock to ensure their servers/software/algorithms/whatever makes sense and works well with one another.

That's the easy non technical answer. I'm sure the technical answer would leave you even more confused. More importantly...if you launch a web site with similar functionality you most likely will never have to worry about these issues alone. Once you get to the point you aren't able to independently solve these problems you should have more than enough traffic to find investors or generating enough money to hire these nerds and geeks (they are more nerdier than deathshadow and popsicle combined).

NetStar, Aug 27, 2015 IP

ketting00 Well-Known Member

Messages:: 782

Likes Received:: 28

Best Answers:: 3

Trophy Points:: 128

#16

Interesting thread.

So basically, non-technically, simple PHP and MySQL can handle this provided that you've managed load balance well.
My question is do they have a specific load balancer that handles requests specifically coming from France.
Twitter seems to show posts from all the people you've fallowed from all over the world.

ketting00, Aug 28, 2015 IP

PoPSiCLe Illustrious Member

Messages:: 4,623

Likes Received:: 725

Best Answers:: 152

Trophy Points:: 470

#17

There isn't that much PHP in Twitter anymore, I think - weren't they running on Node.js or something like that?

Anywho - caching, pulling the relevant data for each user etc., not to mention having localized duplication servers serving local content should leviate quite a few of the hurdles. Twitter, for instance, does not apply a filter (at least not for normal users) - you get all the tweets, from all you are following, all the time. This might of course be different if you have several tens of thousands of followed people, but I don't think so. Although, most of the huge Twitter accounts doesn't necessarily follow thousands of people, they just have thousands of people follow them (the most popular celeb accounts, for instance). So it's one to many, not many to one, most of the time.

PoPSiCLe, Aug 28, 2015 IP

Log in or Sign up

How are websites like twitter and facebook doing it?

JEET Notable Member

Dayvi Member

billzo Well-Known Member

JEET Notable Member

PoPSiCLe Illustrious Member

ThePHPMaster Well-Known Member

JEET Notable Member

deathshadow Acclaimed Member

JEET Notable Member

qwikad.com Illustrious Member Affiliate Manager

JEET Notable Member

deathshadow Acclaimed Member

ihebfarhat Peon

PoPSiCLe Illustrious Member

NetStar Notable Member

ketting00 Well-Known Member

PoPSiCLe Illustrious Member

Useful Searches