[ previous ]
[ Contents ]
[ 1 ]
[ 2 ]
[ 3 ]
[ 4 ]
[ 5 ]
[ 6 ]
[ 7 ]
[ 8 ]
[ 9 ]
[ 10 ]
[ 11 ]
[ 12 ]
[ 13 ]
[ 14 ]
[ 15 ]
[ 16 ]
[ 17 ]
[ A ]
[ B ]
[ C ]
[ D ]
[ E ]
[ next ]
Smart Cache Manual
Chapter 1 - Introduction
1.1 About this manual
This manual has been converted from Smart Cache English homepage to
debiandoc-sgml
format, which allows to generate many output
formats from one source. After conversion, this manual was extended by Radim
Kolar into current form and merged with translated Czech documentation, which
is no longer maintained.
English is not my native language and if you see any errors, just ignore it or
mail me
.
1.2 When all things began
, the Word already was. The Word dwelt with God, and what God was, the
Word was. The Word, then, was with God at the beginning, and through him all
things came to be; no single thing was created without him. All that came to
be was alive with his life, and that life was the light of men. The light
shines on the dark, and the darkness never mastered it.
New Testament, The Gospel according to John, The coming of Christ.
1.3 Smart Cache born
After leaving my job, I start to use modem connection to Internet. It was
slow, but biggest problem for me was a quite high prices payed to the monopoly
Czech telecommunication company SPT Telecom. I have find that I need
some useful tool, which allows me to browse WWW pages off line.
I have tried several methods (See Other off
line browsing solutions, Section 1.4) to achieve this goal, but all of them
has some limitations and I found them totally unusable for me. These programs
are not bad, there are just not optimal for this what I wants.
1.4 Other off line browsing solutions
This section was written in 1998 with exception wwwoffle.
- IBM Internet Connection Server 4.0
-
This is WWW server with built in proxy cache. Proxy cache uses simple
CERN-like directory structure, so it was easy to find cached files. Also proxy
cache has the switch for off line mode, when returns only cached pages.
Biggest problem with this server was, that server was that this server is based
on original CERN HTTP daemon, which was not thread-safe. IBM ported this
daemon into OS/2, but they does not care about this and do not implemented any
locking mechanism to protect thread sensitive data and for thread
synchronisation. Server complains very often about locked
.cacheinfo
files and loaded objects was not stored at disk. After
some time IBM has made new version 4.1. This version introduced new HTTP/1.1
support into WWW server and proxy cache. WWW server works with some occasional
crashes, but proxy cache was totally broken. I never managed it running, they
probably do not test this part of their product. After some time IBM abandoned
this server and recommends ICS's users to upgrade to Lotus Domino.
- Mailing pages to myself in Netscape
-
I have found, that in Netscape Navigator is possible to email entire web page.
So I started mailing interesting web to me and browsing it via Sent
Mail folder. This works quite well, off line browsing was possible (even
with embedded pictures). But Netscape do not save pictures into Sent
Mail folder, It saved pictures only into its internal disk cache, so after
expiring pictures, I was unable to see that.
- Using Netscape's internal disk cache
-
Netscape browser has persistent disk cache. This disk cache is able to cache
web objects between sessions and there are couple of programs called
Netscape's disk cache explorers which allows user to browse off line
via Netscape's cache. But this also has the several limitations:
-
Netscape do not caches web pages without 'Last-Modified' HTTP
header. If fact It does. Pages are stored on the disk, but never readied back
and Netscape deletes them on exit - so there are lost. This is the biggest
problem, because nowadays many web pages are generated on the fly by WWW
server, so you have only images in the cache.
-
Cache is very slow when grows in size into 30-40 MB. This is not a problem in
UNIX version of the Netscape, but OS/2 and Windows versions have this problem.
-
All informations about cache are stored in one file -
index.db
.
When this file gets corrupted (not so uncommon) you will lost
everything.
-
Garbage collections is very strange. I have set disk cache to 50 MB, when it
grows over 50 MB Netscape deletes nearly all files and leaves only 10 MB in
cache. Too bad.
- Using Microsoft Internet Explorer's disk cache
-
I do not believe what I see. This was much worse that Netscape. MSIE 4 is
stupid and it caches even badly downloaded (too short) file. This badly
downloaded file displays as good, and If you request reload on that bad file,
It does not gets reloaded, only checked via If-Modified-Since request.
If you want remove this bad file from cache, you must clear the entire cache.
No MSIE, thank you.
- Using web grabbers
-
This looks very promising, but there are some problems:
-
web grabber downloads what it wants, so it normally downloads many useless
pages and not pages which you want to see.
-
web grabber has a very few configurations options. Even very good program,
such as
wget
is so stupid. This does not apply to my new
developed web downloaded with working name loader
.
-
biggest problem is with web pages refreshing. You have only three choices -
refresh all (this normally downloads entire set of pages again), never refresh,
or refresh it manually via WWW browser and Save as...
- Using Lotus Notes/Domino
-
Lotus Notes can work with HTML documents the same way as it does with it's
normal Document database. You may use any Notes's features, such as Agents or
Scripts on WWW documents. This is very good for writing Internet or Intranet
applications, but not the best solution for normal browsing. The built-in WWW
browser is very limited, even when compared to old Netscape 2. It downloads
only one WWW object at once - web page with many pictures takes very long to
load. Also Notes requires too much system resources and you can not run it on
486 computer with only 20MB RAM.
- WWW Offline Explorer (wwwoffle)
-
This program does the basically same thing for offline browsing as Smart Cache
but with some nice additional features (like HTML changing). It is written in
C and is available only for Unixes, but Windows testing version also exists. I
have performed webbench on both (SC and wwwoffle) and when using small size
cache (about 10MB) results are similar (wwwoffle is about 8% faster, in the
same benchmark as used in Smart Cache
Performance, Section 9.1 it has on Linux 984 pages/min), but on large cache
size SC is much faster because wwwoffle uses just one root directory level (SC
uses 2) and no www's directory level; you will end with very large directories,
which are very slow to search (at least on my machine), also WWWOffle's history
is recorded as symlinks in special directory, which makes one symlink for each
visited URL. WWWOffle do not supports old HTTP/0.9 clients. Stored files in
wwwoffle has HTTP headers inside and uses long hashed filenames, but if you use
HTML interface, cache contents can be browsed.
Summary: If you have a very large network (300+ users) or if you want Squid,
get Squid. If you don't like SC, try wwwoffle, it will do the good job also.
Another well known proxy cache is Apache 1.3, it is a good emergency sollution
- if you need proxy cache quickly - Apache is installed everywhere.
1.5 Smart Cache design goals
After considering Other off line browsing
solutions, Section 1.4, I decided to write my own program, which solves all
of these problems. I write down following design notes:
-
Perfect off line browsing support. User must not see any difference between on
line and off line browsing.
-
Implement it as proxy cache. It will be independent on the used browser and
fully transparent to user.
-
Use CERN-HTTP like (not hashed like Apache, Squid or WWWOffle) directory
structure of the proxy cache for easy locating of cached objects. Try to use
read file names like index.html and not obscure hashed like
Q3E4R2T342XCV3F42G3H2323.
-
For performance reasons, implement 2 swap directory levels (idea from Squid).
-
For easy file access do not store HTTP headers inside cached objects. When I
am trying to extract binary files (pictures) from CERN, Squid or Apache's
caches with text editor (I was lazy to write special program) that claims to
support binary files (ViM) it still fails and file gets corrupted.
-
Do not store all received HTTP headers, just important one.
-
Program must be fully portable. I want to use it on OS/2, Linux and Windows.
-
Cache must be able to cache everything what other caches don't. I don't to
write good cache which respects headers which webmaster uses to gain more
hits. Modem lines are slow. In fact, after writing Smart Cache I was
surprised, how much faster can browsing be if we cache something more than
usual and kill some adv. banners. This really makes a difference!
-
Program must allow to block unwanted URLs. Yes, for killing adv. banners.
-
Program must remain fast and simple.
-
Extremely configurable and tunable garbage collection. I can't accept the
design all or nothing used in other caches. I want to control what
and how long stays on my disk.
-
Possibility to continue with object downloading even if user press STOP in the
browser (idea from Squid).
-
Program must be robust and possibility of data loss must be minimalized.
[ previous ]
[ Contents ]
[ 1 ]
[ 2 ]
[ 3 ]
[ 4 ]
[ 5 ]
[ 6 ]
[ 7 ]
[ 8 ]
[ 9 ]
[ 10 ]
[ 11 ]
[ 12 ]
[ 13 ]
[ 14 ]
[ 15 ]
[ 16 ]
[ 17 ]
[ A ]
[ B ]
[ C ]
[ D ]
[ E ]
[ next ]
Smart Cache Manual
0.84
Radim Kolar hsn@cybermail.net