[This post is my answer to a question asked on Quora. To see the original question, click the link at the bottom of the post.]
The most difficult part of this program is the 15 minute time limitation to process the screenshots.
I was able to render a full (top-to-bottom) screenshot of the webpage response in 4.285 seconds on my Macbook Air (Core i7, 2 GHz, 8GB RAM). It would take my machine 714 minutes to render and save 10,000 screenshots assuming the same time is spent for each one.
The only way (in 2015) to achieve this is to parallelize across as many cores available per machine and distribute the workload across numerous machines. Currently, the largest available machine on Amazon Web Services (AWS) has 36 cores and 244 GiB RAM. I’m not going to try and quantify how much “faster” this is than my machine, but I don’t think it alone is fast enough to process the images in the time you’ve allowed.
Since you tagged this question with ‘Ruby (programming language)’, I’m assuming you’d like to know how it could be done in Ruby.
The gem, 'webshot’, makes screenshot capture very easy. There is a dependency, however, on a tool called 'PhantomJS’, which is basically a web browser without a user interface, primarily used for testing websites and webapps. See the README for webshot for instructions on how to install it and PhantomJS: https://github.com/vitalie/webshot
Once you have the gem and dependencies installed, writing the Ruby is fairly straightforward. Read in your website URLs to capture, and provide them to webshot along with the appropriate dimensions of the screenshot (yes, you can take full-page screenshots). By default, webshot will write the screenshot file to disk after it has been captured. My suggestion would be to upload the file to S3 (or similar) so that all your screenshots are stored in one central location, since you will need to run this program on many separate machines who all have their own disks.
Now, let’s assume you have a working version of your screenshot program on your own computer. The next step would be to deploy it to cloud servers which can share the workload by chipping away at the list of 10,000 websites at the same time. It may not be the most cost-effective, but the easiest way to do this would be on a Platform-as-a-Service (PaaS) like Heroku, where you can push your code once and make copies of the same machine setup, each running your program. You will need to run a sample size of URLs on this system, adding a worker server until you know the optimal number of servers you need to meet the 15 minute requirement. One note: you will need to include a buildpack for PhantomJS if you use Heroku (stomita/heroku-buildpack-phantomjs).
The missing piece is how to supply the 10,000 website URLs to these machines where they each take part of the workload, and so that there isn’t duplication (wasting time and resources), meaning no machine should grab the same URL as another machine. This can be managed using a queue like RabbitMQ (https://www.rabbitmq.com/). This would sit in front of the screenshot workers, which would each read URLs, one-by-one from the queue until all the website screenshots have been captured. An added benefit is that you can provide RabbitMQ the list of URLs dynamically via some interface (iOS app, website form, etc), rather than a static list that you might upload alongside your code. Most PaaS (including Heroku) have click-to-install add ons like RabbitMQ making it simple to deploy alongside your app.
See more answers from the original thread on Quora: https://www.quora.com/If-you-were-asked-to-write-a-script-that-should-take-screenshots-of-10-000-websites-and-it-should-process-in-15-minutes-how-would-you-do-it