Cache

Talk about computers, hardware, applications, and consumer electronics.
User avatar
Depressing
Posts: 1989
Joined: 2008.09.28 (01:10)
NUMA Profile: http://nmaps.net/user/UniverseZero
Steam: www.steamcommunity.com/id/universezero/
MBTI Type: ENTJ
Location: The City of Sails, The Land of the Long White Cloud
Contact:

Postby Universezero » 2009.05.01 (09:00)

Most holidays I usually go up to my beach-house, and there's no Internet there. So, I was wondering if it's possible to cache/download the entire RealN Forums as an archive to look through while I'm there.
Image

User avatar
Lifer
Posts: 1066
Joined: 2008.09.26 (18:37)
NUMA Profile: http://nmaps.net/user/EdoI
MBTI Type: INTJ
Location: Zenica, Bosnia and Herzegovina

Postby EdoI » 2009.05.01 (15:22)

Try this.

User avatar
Global Mod
Global Mod
Posts: 1416
Joined: 2008.09.26 (05:35)
NUMA Profile: http://nmaps.net/user/scythe33
MBTI Type: ENTP
Location: 09 F9 11 02 9D 74 E3 5B D8 41 56 C5 63 56 88 C0

Postby scythe » 2009.05.01 (17:25)

wget -r forums.therealn.com
or somesuch. Suki probably knows the command. I don't.
As soon as we wish to be happier, we are no longer happy.

User avatar
Retrofuturist
Posts: 3131
Joined: 2008.09.19 (06:55)
MBTI Type: ENTP
Location: California, USA
Contact:

Postby t̷s͢uk̕a͡t͜ư » 2009.05.01 (19:40)

scythe33 wrote:wget -r forums.therealn.com
or somesuch. Suki probably knows the command. I don't.
All I did was write a spider in Python.

wget would probably download only the first page of everything, or maybe just the stylesheet if all of the post data is private (since the page would load all of the posts based on which tags you put in the URL... or whatever those & doodads are called; I don't know PHP).

Trying it anyway, expecting hilarious results...

...
Holy fuggin' crap, it worked. I mean, I hope you like reading URL's and are prepared to see a royal lot of "you need to log in for this" pages, but it's actually downloading every page of every topic. Amazing.

Code: Select all

wget -r http://forum.therealn.com
Good call, scythe. o_O
[spoiler="you know i always joked that it would be scary as hell to run into DMX in a dark ally, but secretly when i say 'DMX' i really mean 'Tsukatu'." -kai]"... and when i say 'scary as hell' i really mean 'tight pink shirt'." -kai[/spoiler][/i]
spoiler

Image


User avatar
Semimember
Posts: 22
Joined: 2008.09.28 (14:06)

Postby ZZ9 » 2009.05.02 (12:05)

If you add the -k switch, it'll change all of the links into local URLs.

User avatar
Depressing
Posts: 1977
Joined: 2008.09.26 (06:46)
NUMA Profile: http://nmaps.net/user/rennaT
MBTI Type: ISTJ
Location: Trenton, Ontario, Canada
Contact:

Postby Tanner » 2009.05.03 (12:34)

The problem with asking Linux nerds for help is that they often get so involved in finding a solution that they exceed the abilities of the person asking for help.

To run that command, UZ, you'll need either an Linux distribution or Cygwin. There's an excellent guide to setting up Cygwin here.
Last edited by Tanner on 2009.05.05 (02:58), edited 1 time in total.
Image
'rret donc d'niaser 'vec mon sirop d'erable, calis, si j't'r'vois icitte j'pellerais la police, tu l'veras l'criss de poutine de cul t'auras en prison, tabarnak

User avatar
Lifer
Posts: 1099
Joined: 2008.09.26 (21:35)
NUMA Profile: http://nmaps.net/user/smartalco
MBTI Type: INTJ

Postby smartalco » 2009.05.04 (23:53)

wait, wait, shouldn't there only be a handful of actual physical pages? How the fuck is it getting the whole forum when 95% of the pages are just the same viewtopic.php page with random variables after it? Is it actually parsing each page, looking for links, and then downloading the page that the link results in?
Image
Tycho: "I don't know why people ever, ever try to stop nerds from doing things. It's really the most incredible waste of time."
Adam Savage: "I reject your reality and substitute my own!"

User avatar
Bayking
Posts: 315
Joined: 2008.10.01 (20:26)
NUMA Profile: http://nmaps.net/user/exuberance
Location: Guelph, Ontario, Canada

Postby Exüberance » 2009.05.05 (00:09)

smartalco wrote:wait, wait, shouldn't there only be a handful of actual physical pages? How the fuck is it getting the whole forum when 95% of the pages are just the same viewtopic.php page with random variables after it? Is it actually parsing each page, looking for links, and then downloading the page that the link results in?
I'm guessing it does something like this: (I don't actually know, but this would be the simplest- though not the most efficient- solution)

IF adress starts with http://forum.therealn.com AND page has not yet been seen THEN
save page
repeat this process with every link on the page
END IF

So basically it just clicks every link, ignoring it if it is not part of therealn.com (to prevent it from attempting to download the entire interblag) or if it's seen it before (to prevent infinite loop) Oh and it probably ignores everything after a hash symbol (#) so that you don't get teh same page at different positions, but you do get a different page for different variable before the hash (topic and thread)

Random Thought: If you were to download the entire internet, (a) how many years would it take at various constant download speeds and (b) how much space would you need?
ExüberNewsFeed: Exuberance is mostly <AFF> (Away From Forums) for a while, though I may still participate in epic contests/threads. When I return, I shall bring several comic updates (enough to finish season 1) and hopefully 1 or 2 games- at least one of which is N-related
Comic Activity-O-Meter: (how often I'm updating my comic)
(Click here to see what each level and half-level means in terms of updates per time period)

NOTE: If I just add a bunch of comics in one day, but plan on going back to normal after that, I probably won't update the status.
+ Dead: Canceled. Done. Maybe you'll get a random comic like once a year, but it's pretty much done.
- Zombie (Dead/Comatose): The comic is probably done regular updates forever, but I'll probably still add something once in a blue moon. It's still POSSIBLE, that I'll raise the status up, but not very likely. Maybe I'll have a comicplosion for like a week, then go back to being dead
+ Comatose: Complete stand-by. No (or very few) updates for some amount of time, but the comic's far from being over
- <AFK> (Comatose/Loitering): Stand-by, but you might possibly count on a few updates once and a while. Again, this is temporary
+ Loitering: Like comatose, but for short amount of times.
- Turtling (Loitering/Semi-Active): Really slooooww updates
+ Semi-Active: One every 2 weeks...ish?
- Quasi-Active (Semi-Active/Active): Averaging about 2 comics every 3 weeks
+ Active: Loosely defined status, but about a weekly update
- Over-Active (Active/Power-leveling): About 2 comics a week
+ Power-leveling: About 3 comics a week. Possible a schedule, possibly not
- Über-Epic (Power-leveling/COMICPLOSION!!): In some cases, this may actually be mean updates more frequently than COMICPLOSION!!, but I'm defining this level as a non-organized comic rush, kind of like a few days after my comic started
+ COMICPLOSION!!: Daily updates for a minimum of 5 days (since the daily updates started. It remains at this status until the 5, 7, whatever days are done)

Image
"Science without religion is lame. Religion without science is blind." ~Albert Einstein
My N+ Vector Sprite Sheet ::: My Caption Contest ::: My Comic :::Puzzles of the Exuberant ::: DEFEND YOUR NINJA: THE FLASH GAME (Release Date TBA)
Image
Exüberance on WoW
Image
Maps in the Fernat Epic (so far): (meh, let's put this in a spoiler too. My sig's gettin too big. I'm such a packrat :p)

Nmaps.netNmaps.net


User avatar
Retrofuturist
Posts: 3131
Joined: 2008.09.19 (06:55)
MBTI Type: ENTP
Location: California, USA
Contact:

Postby t̷s͢uk̕a͡t͜ư » 2009.05.05 (01:00)

smartalco wrote:wait, wait, shouldn't there only be a handful of actual physical pages? How the fuck is it getting the whole forum when 95% of the pages are just the same viewtopic.php page with random variables after it? Is it actually parsing each page, looking for links, and then downloading the page that the link results in?
Sounds like you thought exactly what I thought would happen, but yeah, it does actually appear to follow on-site links.
[spoiler="you know i always joked that it would be scary as hell to run into DMX in a dark ally, but secretly when i say 'DMX' i really mean 'Tsukatu'." -kai]"... and when i say 'scary as hell' i really mean 'tight pink shirt'." -kai[/spoiler][/i]
spoiler

Image


User avatar
Lifer
Posts: 1099
Joined: 2008.09.26 (21:35)
NUMA Profile: http://nmaps.net/user/smartalco
MBTI Type: INTJ

Postby smartalco » 2009.05.05 (21:26)

Exüberance wrote:Random Thought: If you were to download the entire internet, (a) how many years would it take at various constant download speeds and (b) how much space would you need?
At the rate the pipes can supply data, you would never finish downloading the internet, as content is being created faster then you could download it (and this will continue to be true as internet speeds increase, as the rate of content creation will also increase)
Image
Tycho: "I don't know why people ever, ever try to stop nerds from doing things. It's really the most incredible waste of time."
Adam Savage: "I reject your reality and substitute my own!"

The number of Electoral College votes needed to be President of the US.
Posts: 282
Joined: 2008.10.07 (04:17)
NUMA Profile: http://nmaps.net/user/Fraxtil
MBTI Type: INTJ
Location: Arizona, USA
Contact:

Postby Fraxtil » 2009.05.06 (02:05)

Exüberance wrote:Random Thought: If you were to download the entire internet, (a) how many years would it take at various constant download speeds and (b) how much space would you need?
Much of the content on the Internet is dynamic; it wouldn't really be possible to download it all (imagine downloading every search query page on every search engine).

User avatar
Bayking
Posts: 315
Joined: 2008.10.01 (20:26)
NUMA Profile: http://nmaps.net/user/exuberance
Location: Guelph, Ontario, Canada

Postby Exüberance » 2009.05.06 (19:41)

Oh yeah.... way to kill a thought experiment.

I guess what I'm wondering is how much space is currently taken up by everything on the internet (as in the filesize of each webpage and it's components, so dynamic pages is the filesize of the code, not each possible webpage you could download)


That would be like the uber1337 version of a jelly-bean contest except it would be impossible to actually figure out the answer :( that's no fun. I'm not even going to attempt to guess because even on a logarithmic scale I'd probably be way off.
ExüberNewsFeed: Exuberance is mostly <AFF> (Away From Forums) for a while, though I may still participate in epic contests/threads. When I return, I shall bring several comic updates (enough to finish season 1) and hopefully 1 or 2 games- at least one of which is N-related
Comic Activity-O-Meter: (how often I'm updating my comic)
(Click here to see what each level and half-level means in terms of updates per time period)

NOTE: If I just add a bunch of comics in one day, but plan on going back to normal after that, I probably won't update the status.
+ Dead: Canceled. Done. Maybe you'll get a random comic like once a year, but it's pretty much done.
- Zombie (Dead/Comatose): The comic is probably done regular updates forever, but I'll probably still add something once in a blue moon. It's still POSSIBLE, that I'll raise the status up, but not very likely. Maybe I'll have a comicplosion for like a week, then go back to being dead
+ Comatose: Complete stand-by. No (or very few) updates for some amount of time, but the comic's far from being over
- <AFK> (Comatose/Loitering): Stand-by, but you might possibly count on a few updates once and a while. Again, this is temporary
+ Loitering: Like comatose, but for short amount of times.
- Turtling (Loitering/Semi-Active): Really slooooww updates
+ Semi-Active: One every 2 weeks...ish?
- Quasi-Active (Semi-Active/Active): Averaging about 2 comics every 3 weeks
+ Active: Loosely defined status, but about a weekly update
- Over-Active (Active/Power-leveling): About 2 comics a week
+ Power-leveling: About 3 comics a week. Possible a schedule, possibly not
- Über-Epic (Power-leveling/COMICPLOSION!!): In some cases, this may actually be mean updates more frequently than COMICPLOSION!!, but I'm defining this level as a non-organized comic rush, kind of like a few days after my comic started
+ COMICPLOSION!!: Daily updates for a minimum of 5 days (since the daily updates started. It remains at this status until the 5, 7, whatever days are done)

Image
"Science without religion is lame. Religion without science is blind." ~Albert Einstein
My N+ Vector Sprite Sheet ::: My Caption Contest ::: My Comic :::Puzzles of the Exuberant ::: DEFEND YOUR NINJA: THE FLASH GAME (Release Date TBA)
Image
Exüberance on WoW
Image
Maps in the Fernat Epic (so far): (meh, let's put this in a spoiler too. My sig's gettin too big. I'm such a packrat :p)

Nmaps.netNmaps.net


User avatar
Lifer
Posts: 1099
Joined: 2008.09.26 (21:35)
NUMA Profile: http://nmaps.net/user/smartalco
MBTI Type: INTJ

Postby smartalco » 2009.05.07 (15:19)

You aren't allowed to keep that avatar if you just give up.
Image
Tycho: "I don't know why people ever, ever try to stop nerds from doing things. It's really the most incredible waste of time."
Adam Savage: "I reject your reality and substitute my own!"

User avatar
Albany, New York
Posts: 521
Joined: 2008.09.28 (02:00)
MBTI Type: INTJ
Location: Inner SE Portland, OR
Contact:

Postby jean-luc » 2009.05.16 (20:06)

Exüberance wrote:Oh yeah.... way to kill a thought experiment.

I guess what I'm wondering is how much space is currently taken up by everything on the internet (as in the filesize of each webpage and it's components, so dynamic pages is the filesize of the code, not each possible webpage you could download)


That would be like the uber1337 version of a jelly-bean contest except it would be impossible to actually figure out the answer :( that's no fun. I'm not even going to attempt to guess because even on a logarithmic scale I'd probably be way off.
keep in mind that Google and other search engines do, in many ways, keep a local copy of the internet. Of course, search engines only deal in HTTP, and even then only in some of it - pages can forbid search engines from indexing them via robots.txt or meta tags, and even beyond that there's the section of the internet often referred to as the 'dark web' which, for various reasons, is inaccessible to search engines. A much higher percentage of the content out there is 'dark' than you might think.

If we look at the scripts that generate webpages and ignore things outside of HTTP(S), I'd imagine it's really quite small. The bulk of the information on the web is stored in databases of various sorts, the scripts only provide an interface to those databases.
-- I might be stupid, but that's a risk we're going to have to take. --
Image
Website! Photography! Robots! Facebook!
The latest computers from Japan can also perform magical operations.

User avatar
Global Mod
Global Mod
Posts: 1416
Joined: 2008.09.26 (05:35)
NUMA Profile: http://nmaps.net/user/scythe33
MBTI Type: ENTP
Location: 09 F9 11 02 9D 74 E3 5B D8 41 56 C5 63 56 88 C0

Postby scythe » 2009.05.16 (20:48)

As soon as we wish to be happier, we are no longer happy.


Who is online

Users browsing this forum: No registered users and 14 guests