Back

Datasets

File prefix Description Download
FTadhoc100 Ten datasets derived from the TREC ad hoc track (TRECs 6,7&8), using the Financial Times (1991-94) archive, as used by Cribbin (2010) to evaluate topic clustering in various spatializations. All datasets comprise 100 documents and one topic. Topic list: T319, T321, T343, T353, T354, T372, T390, T404, T416, T449. Follow this link for full topic definitions. The datasets provided here do not contain any of the original source document text. This can be obtained from NIST by following this link.

Term-document matrices

CHTML file

FTinteractive

Six datasets derived from the TREC interactive track (TRECs 6,7&8), using the Financial Times (1991-94) archive, as used by Cribbin (2010) to evaluate topic clustering in various spatializations. Datasets comprise corpora of various sizes, each being defined by an interactive topic and its respective aspects (sub-topics). Topic list: T307i, T347i, T352i, T387i, T408i, T446i. Follow T6i, T7i and T8i for full topic definitions. The datasets provided here do not contain any of the original source document text. This can be obtained from NIST by following this link.

 

Term-document matrices

 

CHTML file

 

 

Term-document matrices are tab-delimited files of TFIDF weights where terms are represented by rows and documents by columns. The first element in each row is a string type identifying the terms. The first numeric column represents the vector of document #1, the second document #2 etc. Note these are local document IDs. The original archive DOCNO (e.g. FT911-3410) is required to retrieve the source text from the TREC CD. I will create look-up tables mapping from local ID to DOCNO and add these to the download area soon.

 

CHTML is a custom format used by the author to prepare data prior to performing cluster hypothesis tests on spatializations. A readme file explaining this format is provided in the CHTML archive associated with each dataset.

 

Any report of experiments using these datasets should cite the following paper: Cribbin, T. (2010). Visualising the structure of document search results: a comparison of graph theoretic approaches. Information Visualization, 9(2), 83-97. DOI

 

More datasets and software to come here in due course...

 

Last updated on 24/05/2010.