So yesterday, we divided into my theory of how Google work, but today, we get to see the inner workings of how Facebook serves a picture. I came across a very interesting presentation. The presentation shows what kind of technology Facebook uses, how they customize their own kernels, file systems, use CDNs (Content Distribution Network), caching, and etc to improve speed. I guess for starters I’ll have to explain how the internet works before I even get around to explaining one of the roles CDNs play.
“The internet is a bunch of interconnecting tubes”. Although this doesn’t fully do the internet justice, I can see how it might make sense to others. The internet is really a series of interconnect computers. You have thousands and thousands of computers connected to each other all across the world. The interaction between computers generally consists of interactions between a server and a client. The further the server is from the client, the more computers it’ll have to go through to arrive at the client.
A CDN is a network of computers that are generally well distributed for the region(s) it serves. These distributed computers cache or save the information that are frequently requested and act as a server for such information. What this does is that it prevents a client computer from having to wait for the data to come all the way from that super far away server. Obviously, there are other uses besides speed, such as preventing the system which generated the content from having to regenerate redundant information a second time.
So in the Facebook’s image serving system every picture gets cached at three levels, according to the lecture; once at the CDN level, once at memcache, and once by MySQL. Although later on, the lecturer says two. In this case the most important reason for caching is to prevent disk reads or MySQL requests. If the request matches something in the cache, it simply returns the information which it has stored in the memory, bypassing any disk reads or MySQL queries. If the requested information isn’t in the cache the server will then perform either a disk read or a MySQL request, which on a heavily trafficked system can the be difference between a split second or 5 seconds. In this case if the information isn’t cached the server hits the “Net APP” which I visualize as a massive central database to request the file’s location. This file location information is then used to retrieve the file requested. This file gets sent back to the user through the pipeline again, but is also sent to the cache to be cached.
Their cache system uses a most accessed last out system. What this mean is that the more the image gets accessed, the longer it stays in cache, which simply makes sense.
The lecture also goes into how they created their own file system and kernel and the reasons why the needed to create their own file system and kernel.
It was a very interest lecture, I recommend my readers to check out.
http://static.flowgram.com/p2.html#2qi3k8eicrfgkv