From CleanPosts

Jump to: navigation, search



Grab a copy of a website

wget -w9 -r --random-wait -l3 -np -E https://rosettacode.org/wiki/Rosetta_Code

My blog is just a diary, with very few comments from the outside, so to make a copy of my blog I can just load the posts into my browser at once and save the whole thing to disk. But the Elephant Bar is a blog with about 70 comments per day. I've been posting on the Elephant Bar for about three years, and I wanted to retrieve all my comments. But there's more than 2000 articles, and to surf to the comments you have to follow links and I didn't want to get Mouse Finger. So I set Linux to do it for me with the following command in a terminal session:

wget -nc -w 3 -r --random-wait -l 2 -np -E --domains=2164th.blogspot.com http://2164th.blogspot.com

-nc means no clobber. It means if wget has already downloaded a page, it won't download it again.

-w 3 means wait three seconds between downloads, so I don't hammer the target website. That allows the boys at the Elephant Bar to keep posting.

-r means "recursive"...that means I want wget to surf to any links it finds and download those too.

--random wait varies the three second delay randomly so the server on the other end doesn't think I'm a robot doing this, which I am.

-l 2 means I just want to surf to a depth of 2. That way I don't probe too deeply which would fill up my hard drive real quick. I wanted the comments, but I don't want to follow any links made in those comments. Otherwise I'd be downloading the whole internet.

-np means "no parent". I just want to stay in the Elephant Bar, not go up to Blogger itself and start downloading everyone else's blog.

-E means convert funky pages ending in ".asp" to html.

--domains=2164th.blogspot.com limits my search to the Elephant Bar, so I don't follow anything on listed in the blogroll. And the last bit is the URL to the Elephant Bar itself.

Wget ran for about 24 hours and finally finished up. I now have a mirror of the EB blog on my hard drive, complete with all the comments, which I want to filter to get just my comments and repost them here on my blog. Fresh off that success, I'm stealing the entire website at www.textfiles.com. Linux is wunnerful.

Personal tools
Strangers In Paradise