WARC becomes ISO standard

It's official. The International Internet Preservation Consortium has announced the publication of the WARC file format as an ISO standard.

For those who are unfamiliar with web archiving, the WARC format provides a standard way to structure, manage and store resources collected from the web and elsewhere. It is an extension of the ARC format, which has been used since 1996 to store files harvested on the web.

WARC allows recording of HTTP request headers and arbitrary metadata, the allocation of an identifier for every contained file, the management of duplicates and of migrated
records, and the segmentation of the records. WARC files are intended to store every type of digital content, either retrieved by HTTP or another protocol.

The demand for a container format such as WARC came about many years ago when heritage organizations started searching for appropriate ways to collect and keep track Web content using web-scale tools such as web crawlers.

At the same time, cultural organizations were concerned with the requirement to archive very large numbers of born-digital and digitized files. A need was for a container format that permits one file simply and safely to carry a very large number of constituent data objects (of unrestricted type, including many binary types) for the purpose of storage, management, and exchange. Another requirement was that the container need only minimal knowledge of the nature of the objects.

The standardization of WARC is the first step toward facilitating interoperability across institutions charged with preserving our cultural heritage on the Internet. Hopefully, this move will promote the use of WARC beyond the web archiving community.

0 Comments:

Post a Comment



Newer Post Older Post Home