A New Approach to Web Archiving

At the Marriott Library, we’ve recently begun looking into what it would take to archive websites that are important to the University. During some research into this area, I came across the proceedings of the 2009 International Web Archiving Workshop (IWAW).

An interesting project is taking place in France that may change the way web archiving is approached. At University P. and M. Curie in Paris, researchers are developing a web crawler that will not only detect changes to a website but one that will be able to detect which changes are unimportant (changing ads on a page, etc.) versus which are important to the page’s content. If successful, this might greatly improve the effectiveness of the web archiving system because digital archives would no longer be gumming up bandwidth and storage space with needless data.

This project is taking place in conjunction with the French National Audio-Visual Institute (INA). The institute would like to archive French television and radio station websites. The visual component of the institute’s pages is very important to the project, not just the content.

According to the workshop proceedings, the project idea is to “use a visual page analysis to assign importance to web pages parts, according to their relative location. In other words, page versions are restructured according to their visual representation. Detecting changes on such restructured page versions gives relevant information for understanding the dynamics of the web sites. A web page can be partitioned into multiple segments or blocks and, often, the blocks in a page have a different importance. In fact, different regions inside a web page have different importance weights according to their location, area size, content, etc. Typically, the most important information is on the center of a page, advertisement is on the header or on the left side and copyright is on the footer. Once the page is segmented, then a relative importance must be assigned to each block…Comparing two pages based on their visual representation is semantically more informative than with their HTML representation.”

The main concept and hopeful contribution to the world of web archiving is summed up by the presenters as follows:

• A novel web archiving approach that combines three concepts: visual page analysis (or segmentation), visual change detection and importance of web page’s blocks.

• An extension of an existing visual segmentation model to describe the whole visual aspect of the web page.

• An adequate change detection algorithm that computes changes between visual layout structures of web pages with a reasonable complexity in time.

• A method to evaluate the importance of changes occurred between consecutive versions of documents.

• An implementation of our approach and some experiments to demonstrate its feasibility.

It will be interesting to follow up with this project as it reaches its conclusion and see how its results will affect current web archiving players like as well as fellow research endeavors like the Memento Project.

You can read about this project in much more technical detail at the IWAW website (unless it’s been taken down and hasn’t been properly archived).

