Cache
- Depressing
- Posts: 1989
- Joined: 2008.09.28 (01:10)
- NUMA Profile: http://nmaps.net/user/UniverseZero
- Steam: www.steamcommunity.com/id/universezero/
- MBTI Type: ENTJ
- Location: The City of Sails, The Land of the Long White Cloud
- Contact:
- Lifer
- Posts: 1066
- Joined: 2008.09.26 (18:37)
- NUMA Profile: http://nmaps.net/user/EdoI
- MBTI Type: INTJ
- Location: Zenica, Bosnia and Herzegovina
- Global Mod
- Posts: 1416
- Joined: 2008.09.26 (05:35)
- NUMA Profile: http://nmaps.net/user/scythe33
- MBTI Type: ENTP
- Location: 09 F9 11 02 9D 74 E3 5B D8 41 56 C5 63 56 88 C0
or somesuch. Suki probably knows the command. I don't.
- Retrofuturist
- Posts: 3131
- Joined: 2008.09.19 (06:55)
- MBTI Type: ENTP
- Location: California, USA
- Contact:
All I did was write a spider in Python.scythe33 wrote:wget -r forums.therealn.com
or somesuch. Suki probably knows the command. I don't.
wget would probably download only the first page of everything, or maybe just the stylesheet if all of the post data is private (since the page would load all of the posts based on which tags you put in the URL... or whatever those & doodads are called; I don't know PHP).
Trying it anyway, expecting hilarious results...
...
Holy fuggin' crap, it worked. I mean, I hope you like reading URL's and are prepared to see a royal lot of "you need to log in for this" pages, but it's actually downloading every page of every topic. Amazing.
Code: Select all
wget -r http://forum.therealn.com

- Semimember
- Posts: 22
- Joined: 2008.09.28 (14:06)
- Depressing
- Posts: 1977
- Joined: 2008.09.26 (06:46)
- NUMA Profile: http://nmaps.net/user/rennaT
- MBTI Type: ISTJ
- Location: Trenton, Ontario, Canada
- Contact:
To run that command, UZ, you'll need either an Linux distribution or Cygwin. There's an excellent guide to setting up Cygwin here.

'rret donc d'niaser 'vec mon sirop d'erable, calis, si j't'r'vois icitte j'pellerais la police, tu l'veras l'criss de poutine de cul t'auras en prison, tabarnak
- Lifer
- Posts: 1099
- Joined: 2008.09.26 (21:35)
- NUMA Profile: http://nmaps.net/user/smartalco
- MBTI Type: INTJ

Tycho: "I don't know why people ever, ever try to stop nerds from doing things. It's really the most incredible waste of time."
Adam Savage: "I reject your reality and substitute my own!"
- Bayking
- Posts: 315
- Joined: 2008.10.01 (20:26)
- NUMA Profile: http://nmaps.net/user/exuberance
- Location: Guelph, Ontario, Canada
I'm guessing it does something like this: (I don't actually know, but this would be the simplest- though not the most efficient- solution)smartalco wrote:wait, wait, shouldn't there only be a handful of actual physical pages? How the fuck is it getting the whole forum when 95% of the pages are just the same viewtopic.php page with random variables after it? Is it actually parsing each page, looking for links, and then downloading the page that the link results in?
IF adress starts with http://forum.therealn.com AND page has not yet been seen THEN
save page
repeat this process with every link on the page
END IF
So basically it just clicks every link, ignoring it if it is not part of therealn.com (to prevent it from attempting to download the entire interblag) or if it's seen it before (to prevent infinite loop) Oh and it probably ignores everything after a hash symbol (#) so that you don't get teh same page at different positions, but you do get a different page for different variable before the hash (topic and thread)
Random Thought: If you were to download the entire internet, (a) how many years would it take at various constant download speeds and (b) how much space would you need?
Comic Activity-O-Meter: (how often I'm updating my comic)
NOTE: If I just add a bunch of comics in one day, but plan on going back to normal after that, I probably won't update the status.
+ Dead: Canceled. Done. Maybe you'll get a random comic like once a year, but it's pretty much done.
- Zombie (Dead/Comatose): The comic is probably done regular updates forever, but I'll probably still add something once in a blue moon. It's still POSSIBLE, that I'll raise the status up, but not very likely. Maybe I'll have a comicplosion for like a week, then go back to being dead
+ Comatose: Complete stand-by. No (or very few) updates for some amount of time, but the comic's far from being over
- <AFK> (Comatose/Loitering): Stand-by, but you might possibly count on a few updates once and a while. Again, this is temporary
+ Loitering: Like comatose, but for short amount of times.
- Turtling (Loitering/Semi-Active): Really slooooww updates
+ Semi-Active: One every 2 weeks...ish?
- Quasi-Active (Semi-Active/Active): Averaging about 2 comics every 3 weeks
+ Active: Loosely defined status, but about a weekly update
- Over-Active (Active/Power-leveling): About 2 comics a week
+ Power-leveling: About 3 comics a week. Possible a schedule, possibly not
- Über-Epic (Power-leveling/COMICPLOSION!!): In some cases, this may actually be mean updates more frequently than COMICPLOSION!!, but I'm defining this level as a non-organized comic rush, kind of like a few days after my comic started
+ COMICPLOSION!!: Daily updates for a minimum of 5 days (since the daily updates started. It remains at this status until the 5, 7, whatever days are done)

"Science without religion is lame. Religion without science is blind." ~Albert Einstein
My N+ Vector Sprite Sheet ::: My Caption Contest ::: My Comic :::Puzzles of the Exuberant ::: DEFEND YOUR NINJA: THE FLASH GAME (Release Date TBA)

Exüberance on WoW

- Retrofuturist
- Posts: 3131
- Joined: 2008.09.19 (06:55)
- MBTI Type: ENTP
- Location: California, USA
- Contact:
Sounds like you thought exactly what I thought would happen, but yeah, it does actually appear to follow on-site links.smartalco wrote:wait, wait, shouldn't there only be a handful of actual physical pages? How the fuck is it getting the whole forum when 95% of the pages are just the same viewtopic.php page with random variables after it? Is it actually parsing each page, looking for links, and then downloading the page that the link results in?

- Lifer
- Posts: 1099
- Joined: 2008.09.26 (21:35)
- NUMA Profile: http://nmaps.net/user/smartalco
- MBTI Type: INTJ
At the rate the pipes can supply data, you would never finish downloading the internet, as content is being created faster then you could download it (and this will continue to be true as internet speeds increase, as the rate of content creation will also increase)Exüberance wrote:Random Thought: If you were to download the entire internet, (a) how many years would it take at various constant download speeds and (b) how much space would you need?

Tycho: "I don't know why people ever, ever try to stop nerds from doing things. It's really the most incredible waste of time."
Adam Savage: "I reject your reality and substitute my own!"
-
- The number of Electoral College votes needed to be President of the US.
- Posts: 282
- Joined: 2008.10.07 (04:17)
- NUMA Profile: http://nmaps.net/user/Fraxtil
- MBTI Type: INTJ
- Location: Arizona, USA
- Contact:
Much of the content on the Internet is dynamic; it wouldn't really be possible to download it all (imagine downloading every search query page on every search engine).Exüberance wrote:Random Thought: If you were to download the entire internet, (a) how many years would it take at various constant download speeds and (b) how much space would you need?
- Bayking
- Posts: 315
- Joined: 2008.10.01 (20:26)
- NUMA Profile: http://nmaps.net/user/exuberance
- Location: Guelph, Ontario, Canada
I guess what I'm wondering is how much space is currently taken up by everything on the internet (as in the filesize of each webpage and it's components, so dynamic pages is the filesize of the code, not each possible webpage you could download)
That would be like the uber1337 version of a jelly-bean contest except it would be impossible to actually figure out the answer :( that's no fun. I'm not even going to attempt to guess because even on a logarithmic scale I'd probably be way off.
Comic Activity-O-Meter: (how often I'm updating my comic)
NOTE: If I just add a bunch of comics in one day, but plan on going back to normal after that, I probably won't update the status.
+ Dead: Canceled. Done. Maybe you'll get a random comic like once a year, but it's pretty much done.
- Zombie (Dead/Comatose): The comic is probably done regular updates forever, but I'll probably still add something once in a blue moon. It's still POSSIBLE, that I'll raise the status up, but not very likely. Maybe I'll have a comicplosion for like a week, then go back to being dead
+ Comatose: Complete stand-by. No (or very few) updates for some amount of time, but the comic's far from being over
- <AFK> (Comatose/Loitering): Stand-by, but you might possibly count on a few updates once and a while. Again, this is temporary
+ Loitering: Like comatose, but for short amount of times.
- Turtling (Loitering/Semi-Active): Really slooooww updates
+ Semi-Active: One every 2 weeks...ish?
- Quasi-Active (Semi-Active/Active): Averaging about 2 comics every 3 weeks
+ Active: Loosely defined status, but about a weekly update
- Over-Active (Active/Power-leveling): About 2 comics a week
+ Power-leveling: About 3 comics a week. Possible a schedule, possibly not
- Über-Epic (Power-leveling/COMICPLOSION!!): In some cases, this may actually be mean updates more frequently than COMICPLOSION!!, but I'm defining this level as a non-organized comic rush, kind of like a few days after my comic started
+ COMICPLOSION!!: Daily updates for a minimum of 5 days (since the daily updates started. It remains at this status until the 5, 7, whatever days are done)

"Science without religion is lame. Religion without science is blind." ~Albert Einstein
My N+ Vector Sprite Sheet ::: My Caption Contest ::: My Comic :::Puzzles of the Exuberant ::: DEFEND YOUR NINJA: THE FLASH GAME (Release Date TBA)

Exüberance on WoW

- Lifer
- Posts: 1099
- Joined: 2008.09.26 (21:35)
- NUMA Profile: http://nmaps.net/user/smartalco
- MBTI Type: INTJ

Tycho: "I don't know why people ever, ever try to stop nerds from doing things. It's really the most incredible waste of time."
Adam Savage: "I reject your reality and substitute my own!"
- Albany, New York
- Posts: 521
- Joined: 2008.09.28 (02:00)
- MBTI Type: INTJ
- Location: Inner SE Portland, OR
- Contact:
keep in mind that Google and other search engines do, in many ways, keep a local copy of the internet. Of course, search engines only deal in HTTP, and even then only in some of it - pages can forbid search engines from indexing them via robots.txt or meta tags, and even beyond that there's the section of the internet often referred to as the 'dark web' which, for various reasons, is inaccessible to search engines. A much higher percentage of the content out there is 'dark' than you might think.Exüberance wrote:Oh yeah.... way to kill a thought experiment.
I guess what I'm wondering is how much space is currently taken up by everything on the internet (as in the filesize of each webpage and it's components, so dynamic pages is the filesize of the code, not each possible webpage you could download)
That would be like the uber1337 version of a jelly-bean contest except it would be impossible to actually figure out the answer :( that's no fun. I'm not even going to attempt to guess because even on a logarithmic scale I'd probably be way off.
If we look at the scripts that generate webpages and ignore things outside of HTTP(S), I'd imagine it's really quite small. The bulk of the information on the web is stored in databases of various sorts, the scripts only provide an interface to those databases.

Website! Photography! Robots! Facebook!
The latest computers from Japan can also perform magical operations.
- Global Mod
- Posts: 1416
- Joined: 2008.09.26 (05:35)
- NUMA Profile: http://nmaps.net/user/scythe33
- MBTI Type: ENTP
- Location: 09 F9 11 02 9D 74 E3 5B D8 41 56 C5 63 56 88 C0
Who is online
Users browsing this forum: No registered users and 14 guests