Change Detection 2026 guidelines
This page describes the specific guidelines for the 2026 track. Please read this entire page fully... this is a new task and there are some novel features that may not be obvious.
Bring questions to the TREC Slack #trec-change-detection-2026 channel.
Task description
The document collection is a chronologically-ordered set of web news articles. The task proceeds day-by-day through the collection.
A topic represents an information need that the user is tracking over time. In the context of that topic, the user has a set of questions that they use to organize the information coming in.
On each day, your system will return a ranked list of questions that are relevant on that day. A question is relevant if there are relevant documents that inform the question on that day. Your system should only return relevant questions, and may decide to not rank any questions for that day.
For every ranked question, your system will return a ranked list of at most 100 documents predicted to be relevant to the question. "Relevant" here means that the document contains information that is at least responsive to the question and is hopefully useful in answering the question. If there are no relevant documents for the question on that day, the question should be left off the ranking.
Your system can additionally propose new questions, and put them into their question ranking starting on the day they are proposed.
On most days, there may be nothing of relevance to the topic. Systems should only return questions predicted to be relevant, and must decide where to cut off their ranking. (Missing questions will be assumed to have zero-length document rankings for that day.) In fact, it's pretty safe to assume there is very little if any actionable traffic on any given day.
Data
We will use the English portion of the RAGTIME1 dataset, available from HuggingFace Datasets. RAGTIME1 has 1M documents, with 1000 documents on each day of the collection.
A document in RAGTIME is a JSON object:
{
"id": "0e4ae416-c558-4741-9e4a-4988869bcd74_2154817",
"text": "Means, Franco lead Orioles to 5-2 win over Tigers\n\nDETROIT (AP) — John Means struck out six in six strong innings, Maikel Franco homered and the Baltimore Orioles beat the Detroit Tigers 5-2 on Saturday night.\n\nMeans (5-3) gave up one run on four hits and recorded his first victory since May 5....",
"url": "https://www.timesunion.com/sports/article/Means-Franco-lead-Orioles-to-5-2-win-over-Tigers-16355042.php",
"date": "2021-08-01T02:08:10.000Z"
}
The collection is timestamp-ordered. Participants must process the collection in order, one day at a time, and may not permit their systems to access documents in the "future".
Participant systems must to output their decisions on that day's documents that day, before receiving documents from subsequent days. Likewise, systems are only producing output for the documents of that day and not any days prior to that. Systems should process all the topics for day N before proceeding to day N+1. Everything happens in distinct one-day chunks.
We have developed a toolkit that you can use to build your system so that the track protocol is followed and your output file is correctly formatted. While the toolkit is implemented in Python, it uses an OpenAPI-based REST server and you can communicate with it however you like. The toolkit includes a simplistic model system that demonstrates its use, a few development topics, and (not yet) the scoring script.
Topics
The topics are in JSONL format: (JSONL here and below pretty-printed for clarity)
{
"tid": "a topic identifier",
"label": "a label for the topic, similar to a title",
"narrative": "a description of the information need in paragraph form. The narrative describes the topic succinctly, as one might tell it to someone conversationally. It is not intended to completely specify the information need.",
"questions": [
{
"qid": "a question identifier",
"question": "the question",
"rel_docs": [
"document-id-1",
"document-id-2",
...
]
},
{ ... }
]
}
All assessor-created analytic questions are present in the topic description, so they are known to the system at the start of the task. Additionally, questions will have a small number of example documents. Systems may use those documents when they arrive in the document stream, but not before. They are not guaranteed to be the first relevant documents in the collection.
A small number of development topics, with all relevant documents identified during topic development will be released with the API tool.
Output format
The format for a run is a JSONL file, starting with a run metadata line and then with one line per topic. The run metadata block will have several required fields (TBD). Both the run metadata line and the following per-topic lines may contain extra information in an extra slot. This is to allow you to include your own metadata about your run or about topic processing in the output file itself.
{ "runtag": "nist_cd_1",
"models": [ "llama47", "gpt-pi", "gemini-foo-preview" ],
"description": "a free text description of yout run",
"extra": { "other metadata": "as you like" }
},
{
"topic": "s topic identifier",
"extra": { "the run may contain": "other metadata here as you like"},
"results": {
"2021-08-01": [
{
"qid": "q_1",
"question-rank": 0,
"question-text": "This is a question from the topic",
"doc-ranking": [
{ "doc_id": "doc-id-1", "score": 0.9876 },
{ "doc_id": "doc-id-2", "score": 0.9870 },
...
],
"extra": { "you can include": "extra metadata about your question response, like a score, or other info useful to you." }
},
{
"qid": "nist_cd_1_q_1",
"question-rank": 1,
"question-text": "This is a new analytic question proposed by my system",
"doc-ranking": [ ... ]
},
...
],
"2021-08-02": ...
}
}
Note that the TREC submission system enforces limits on the runtag: it must be 20 characters or less and can only include letters, numbers, hyphens, periods (not as the first character), and underscores. It must also be unique. Traditionally runtags start with your team name or an abbreviation thereof, for example "nist-cd1".
The extra entries can contain anything you like. You can use these blocks to store additional metadata about your run or that specific topic. If your system is agentic, for example, you might include a representation of the agent interaction in the per-topic extra block.
The results block within a topic is an object with one entry per date in the collection. If a date in the collection is not present in the run, that is interpreted as indicating that all known questions are equally ranked at rank 0.
Each date entry is a list with one entry per question. The list should only include questions your system decides have relevant information to return. A question that is missing that day is assumed to have had an empty ranked list of documents for that day.
The question identifier qid is either the qid of a question in the topic file, or an identifier that starts with your runtag if the question is newly proposed by your system. Proposed questions are considered to be part of the set of questions for that topic from that day forward.
The question-rank is an integer greater than or equal to zero. Ties between questions will be broken arbitrarily.
The doc-ranking is a list of 1 to 100 docid,score pairs. This is interpreted as a ranked list of documents that are relevant to that question. The score is a double-precision floating-point number. The documents must be from that day (the 10-digit prefix of the document datestamp must match the date of this results entry). Systems should only return documents with relevant information, and should truncate their ranking at the point where they believe there are no more relevant documents.
Evaluation
Performance will be based on the daily ranking of questions, and the rankings of documents for each question.
The metrics will be the variants of NDCG, AP, RBP, and RR described in "Quit While Ahead: Evaluating Truncated Rankings", by Fei Liu, Alistair Moffat, Timothy Baldwin, and Xiuzhen Zhang (SIGIR 2016, Paper link). Notionally, these metrics append a sentinel document to the end of your ranking with a gain level related to the maximum "gain recall" of the ranking. The result is that when there are no relevant items, you get credit for returning nothing, and returning irrelevant items past the last relevant retrieved decreases the score.
Questions will be judged relevant on a given day as:
- Not relevant that day (gain = 0)
- Has some relevant information that day, but nothing critical (gain = 1)
- Has some important information (gain = 5)
- This question is has information of critical importance today (gain = 10)
Documents will be judged on a similar scale:
- Not relevant that day (gain = 0)
- Is "on topic" for the question, but not important (gain = 1)
- Is moderately important; needs to be noted but is not critical (gain = 5)
- Contains a vital update about this question today (gain = 10)
All metrics will use gain values, including AP, as described in the paper.
Average performance will be reported per question, averaged over all days in the collection. Question scores will be averaged to compute a single score for a topic, and topic scores will be averaged to compute a single score for the run, but it is expected that the per-question scores will be most useful for analyzing your system.
The scoring script has not yet been released.
Timeline
- Topics release: May 15
- Runs due: July 26, AoE, at Evalbase
- Scores released: Oct 15