Publications
Selected publications by categories in reversed chronological order. Generated by jekyll-scholar
. For a more comprehensive list, see my Google Scholar.
2024
- DyVo: Dynamic Vocabularies for Learned Sparse Retrieval with EntitiesThong Nguyen, Shubham Chatterjee, Sean MacAvaney, and 3 more authorsIn Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Nov 2024
Learned Sparse Retrieval (LSR) models use vocabularies from pre-trained transformers, which often split entities into nonsensical fragments. Splitting entities diminishes retrieval accuracy and limits the model’s ability to incorporate up-to-date world knowledge not included in the training data. In this work, we enhance the LSR vocabulary with Wikipedia concepts and entities, enabling the model to resolve ambiguities more effectively and stay current with evolving knowledge. Central to our approach is a Dynamic Vocabulary (DyVo) head, which leverages existing entity embeddings and an entity retrieval component that identifies entities relevant to a query or document. We use the DyVo head to generate entity weights, which are then merged with word piece weights to create joint representations for efficient indexing and retrieval using an inverted index. In experiments across three entity-rich document ranking datasets, the resulting DyVo model substantially outperforms several state-of-the-art baselines.
@inproceedings{nguyen-etal-2024-dyvo, title = {{D}y{V}o: Dynamic Vocabularies for Learned Sparse Retrieval with Entities}, author = {Nguyen, Thong and Chatterjee, Shubham and MacAvaney, Sean and Mackie, Iain and Dalton, Jeff and Yates, Andrew}, booktitle = {Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing}, month = nov, year = {2024}, address = {Miami, Florida, USA}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2024.emnlp-main.45}, pages = {767--783}, }
- Adaptive Latent Entity Expansion for Document RetrievalIain Mackie, Sean MacAvaney, Shubham Chatterjee, and 1 more authorIn Proceedings of the First Knowledge-Enhanced Information Retrieval workshop, Nov 2024
Despite considerable progress in neural relevance ranking techniques, search engines still struggle to process complex queries effectively - both in terms of precision and recall. Sparse and dense Pseudo-Relevance Feedback (PRF) approaches have the potential to overcome limitations in recall, but are only effective with high precision in the top ranks. In this work, we tackle the problem of search over complex queries using three complementary techniques. First, we demonstrate that applying a strong neural re-ranker before sparse or dense PRF can improve the retrieval effectiveness by 5-8%. This improvement in PRF effectiveness can be attributed directly to improving the precision of the feedback set. Second, we propose an enhanced expansion model, Latent Entity Expansion (LEE), which applies fine-grained word and entity-based relevance modelling incorporating localized features. Specifically, we find that by including both words and entities for expansion achieve a further 2-8% improvement in NDCG. Our analysis also demonstrated that LEE is largely robust to its parameters across datasets and performs well on entity-centric queries. And third, we include an ’adaptive’ component in the retrieval process, which iteratively refines the re-ranking pool during scoring using the expansion model and avoids re-ranking additional documents. We find that this combination of techniques achieves the best NDCG, MAP and R@1000 results on the TREC Robust 2004 and CODEC document datasets, demonstrating a significant advancement in expansion effectiveness.
@inproceedings{mackie2024adaptive, title = {Adaptive Latent Entity Expansion for Document Retrieval}, author = {Mackie, Iain and MacAvaney, Sean and Chatterjee, Shubham and Dalton, Jeffrey}, year = {2024}, booktitle = {Proceedings of the First Knowledge-Enhanced Information Retrieval workshop}, series = {ECIR '24}, url = {https://arxiv.org/abs/2306.17082v2}, eprint = {2306.17082}, archiveprefix = {arXiv}, primaryclass = {cs.IR}, }
- Doing Personal LAPS: LLM-Augmented Dialogue Construction for Personalized Multi-Session Conversational SearchHideaki Joko, Shubham Chatterjee, Andrew Ramsay, and 3 more authorsIn Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington DC, USA, Nov 2024
The future of conversational agents will provide users with personalized information responses. However, a significant challenge in developing models is the lack of large-scale dialogue datasets that span multiple sessions and reflect real-world user preferences. Previous approaches rely on experts in a wizard-of-oz setup that is difficult to scale, particularly for personalized tasks. Our method, LAPS, addresses this by using large language models (LLMs) to guide a single human worker in generating personalized dialogues. This method has proven to speed up the creation process and improve quality. LAPS can collect large-scale, human-written, multi-session, and multi-domain conversations, including extracting user preferences. When compared to existing datasets, LAPS-produced conversations are as natural and diverse as expert-created ones, which stays in contrast with fully synthetic methods. The collected dataset is suited to train preference extraction and personalized response generation. Our results show that responses generated explicitly using extracted preferences better match user’s actual preferences, highlighting the value of using extracted preferences over simple dialogue history. Overall, LAPS introduces a new method to leverage LLMs to create realistic personalized conversational data more efficiently and effectively than previous methods.
@inproceedings{10.1145/3626772.3657815, author = {Joko, Hideaki and Chatterjee, Shubham and Ramsay, Andrew and de Vries, Arjen P. and Dalton, Jeff and Hasibi, Faegheh}, title = {Doing Personal LAPS: LLM-Augmented Dialogue Construction for Personalized Multi-Session Conversational Search}, year = {2024}, isbn = {9798400704314}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3626772.3657815}, doi = {10.1145/3626772.3657815}, booktitle = {Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval}, pages = {796–806}, numpages = {11}, keywords = {conversational search, dialogue collection, personalization}, location = {Washington DC, USA}, series = {SIGIR '24}, }
- TREC iKAT 2023: A Test Collection for Evaluating Conversational and Interactive Knowledge AssistantsMohammad Aliannejadi, Zahra Abbasiantaeb, Shubham Chatterjee, and 2 more authorsIn Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington DC, USA, Nov 2024
Conversational information seeking has evolved rapidly in the last few years with the development of Large Language Models (LLMs), providing the basis for interpreting and responding in a naturalistic manner to user requests. The extended TREC Interactive Knowledge Assistance Track (iKAT) collection aims to enable researchers to test and evaluate their Conversational Search Agent (CSA). The collection contains a set of 36 personalized dialogues over 20 different topics each coupled with a Personal Text Knowledge Base (PTKB) that defines the bespoke user personas. A total of 344 turns with approximately 26,000 passages are provided as assessments on relevance, as well as additional assessments on generated responses over four key dimensions: relevance, completeness, groundedness, and naturalness. The collection challenges CSAs to efficiently navigate diverse personal contexts, elicit pertinent persona information, and employ context for relevant conversations.The integration of a PTKB and the emphasis on decisional search tasks contribute to the uniqueness of this test collection, making it an essential benchmark for advancing research in conversational and interactive knowledge assistants.
@inproceedings{10.1145/3626772.3657860, author = {Aliannejadi, Mohammad and Abbasiantaeb, Zahra and Chatterjee, Shubham and Dalton, Jeffrey and Azzopardi, Leif}, title = {TREC iKAT 2023: A Test Collection for Evaluating Conversational and Interactive Knowledge Assistants}, year = {2024}, isbn = {9798400704314}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3626772.3657860}, doi = {10.1145/3626772.3657860}, booktitle = {Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval}, pages = {819–829}, numpages = {11}, keywords = {conversational information seeking, conversational search agents, evaluation, test collection}, location = {Washington DC, USA}, series = {SIGIR '24}, }
- DREQ: Document Re-ranking Using Entity-Based Query UnderstandingShubham Chatterjee, Iain Mackie, and Jeff DaltonIn Advances in Information Retrieval: 46th European Conference on Information Retrieval, ECIR 2024, Glasgow, UK, March 24–28, 2024, Proceedings, Part I, Glasgow, United Kingdom, Nov 2024
While entity-oriented neural IR models have advanced significantly, they often overlook a key nuance: the varying degrees of influence individual entities within a document have on its overall relevance. Addressing this gap, we present DREQ, an entity-oriented dense document re-ranking model. Uniquely, we emphasize the query-relevant entities within a document’s representation while simultaneously attenuating the less relevant ones, thus obtaining a query-specific entity-centric document representation. We then combine this entity-centric document representation with the text-centric representation of the document to obtain a “hybrid” representation of the document. We learn a relevance score for the document using this hybrid representation. Using four large-scale benchmarks, we show that DREQ outperforms state-of-the-art neural and non-neural re-ranking methods, highlighting the effectiveness of our entity-oriented representation approach.
@inproceedings{10.1007/978-3-031-56027-9_13, author = {Chatterjee, Shubham and Mackie, Iain and Dalton, Jeff}, title = {DREQ: Document Re-ranking Using Entity-Based Query Understanding}, year = {2024}, isbn = {978-3-031-56026-2}, publisher = {Springer-Verlag}, address = {Berlin, Heidelberg}, url = {https://doi.org/10.1007/978-3-031-56027-9_13}, doi = {10.1007/978-3-031-56027-9_13}, booktitle = {Advances in Information Retrieval: 46th European Conference on Information Retrieval, ECIR 2024, Glasgow, UK, March 24–28, 2024, Proceedings, Part I}, pages = {210–229}, numpages = {20}, location = {Glasgow, United Kingdom}, }
2023
- Generative Relevance Feedback with Large Language ModelsIain Mackie, Shubham Chatterjee, and Jeffrey DaltonIn Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, Taipei, Taiwan, Nov 2023
Current query expansion models use pseudo-relevance feedback to improve first-pass retrieval effectiveness; however, this fails when the initial results are not relevant. Instead of building a language model from retrieved results, we propose Generative Relevance Feedback (GRF) that builds probabilistic feedback models from long-form text generated from Large Language Models. We study the effective methods for generating text by varying the zero-shot generation subtasks: queries, entities, facts, news articles, documents, and essays. We evaluate GRF on document retrieval benchmarks covering a diverse set of queries and document collections, and the results show that GRF methods significantly outperform previous PRF methods. Specifically, we improve MAP between 5-19% and NDCG@10 17-24% compared to RM3 expansion, and achieve state-of-the-art recall across all datasets.
@inproceedings{mackie2023grf, author = {Mackie, Iain and Chatterjee, Shubham and Dalton, Jeffrey}, title = {Generative Relevance Feedback with Large Language Models}, year = {2023}, isbn = {9781450394086}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3539618.3591992}, doi = {10.1145/3539618.3591992}, booktitle = {Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval}, pages = {2026–2031}, numpages = {6}, keywords = {document retrieval, pseudo-relevance feedback, text generation}, location = {Taipei, Taiwan}, series = {SIGIR '23}, }
- Generative and Pseudo-Relevant Feedback for Sparse, Dense and Learned Sparse RetrievalIain Mackie, Shubham Chatterjee, and Jeffrey DaltonIn Proceedings of the Workshop on Large Language Models’ Interpretation and Trustworthiness (LLMIT), Nov 2023
Pseudo-relevance feedback (PRF) is a classical approach to address lexical mismatch by enriching the query using first-pass retrieval. Moreover, recent work on generative-relevance feedback (GRF) shows that query expansion models using text generated from large language models can improve sparse retrieval without depending on first-pass retrieval effectiveness. This work extends GRF to dense and learned sparse retrieval paradigms with experiments over six standard document ranking benchmarks. We find that GRF improves over comparable PRF techniques by around 10% on both precision and recall-oriented measures. Nonetheless, query analysis shows that GRF and PRF have contrasting benefits, with GRF providing external context not present in first-pass retrieval, whereas PRF grounds the query to the information contained within the target corpus. Thus, we propose combining generative and pseudo-relevance feedback ranking signals to achieve the benefits of both feedback classes, which significantly increases recall over PRF methods on 95% of experiments.
@inproceedings{mackie2023generativeand, title = {Generative and Pseudo-Relevant Feedback for Sparse, Dense and Learned Sparse Retrieval}, author = {Mackie, Iain and Chatterjee, Shubham and Dalton, Jeffrey}, year = {2023}, publisher = {CEUR}, booktitle = {Proceedings of the Workshop on Large Language Models' Interpretation and Trustworthiness (LLMIT)}, series = {CIKM '23}, url = {https://arxiv.org/abs/2305.07477}, eprint = {2305.07477}, archiveprefix = {arXiv}, primaryclass = {cs.IR}, }
- GRM: Generative Relevance Modeling Using Relevance-Aware Sample Estimation for Document RetrievalIain Mackie, Ivan Sekulic, Shubham Chatterjee, and 2 more authorsIn , Nov 2023
Recent studies show that Generative Relevance Feedback (GRF), using text generated by Large Language Models (LLMs), can enhance the effectiveness of query expansion. However, LLMs can generate irrelevant information that harms retrieval effectiveness. To address this, we propose Generative Relevance Modeling (GRM) that uses Relevance-Aware Sample Estimation (RASE) for more accurate weighting of expansion terms. Specifically, we identify similar real documents for each generated document and use a neural re-ranker to estimate their relevance. Experiments on three standard document ranking benchmarks show that GRM improves MAP by 6-9% and R@1k by 2-4%, surpassing previous methods.
@inproceedings{mackie2023grm, title = {GRM: Generative Relevance Modeling Using Relevance-Aware Sample Estimation for Document Retrieval}, author = {Mackie, Iain and Sekulic, Ivan and Chatterjee, Shubham and Dalton, Jeffrey and Crestani, Fabio}, year = {2023}, publisher = {arXiv, https://arxiv.org/abs/2306.09938}, url = {https://arxiv.org/abs/2306.09938}, eprint = {2306.09938}, archiveprefix = {arXiv}, primaryclass = {cs.IR}, }
- Neural Entity Context ModelsPooja Oza, Shubham Chatterjee, and Laura DietzIn Proceedings of the 12th International Joint Conference on Knowledge Graphs, Nov 2023
A prevalent approach of entity-oriented systems involves retrieving relevant entities by harnessing knowledge graph embeddings. These embeddings encode entity information in the context of the knowledge graph and are static in nature. Our goal is to generate entity embeddings that capture what renders them relevant for the query. This differs from entity embeddings constructed with static resource, for example, E-BERT. Previously, Dalton et al. demonstrated the benefits obtained with the Entity Context Model, a pseudo-relevance feedback approach based on entity links in relevant contexts. In this work, we reinvent the Entity Context Model (ECM) for neural graph networks and incorporate pre-trained embeddings. We introduce three entity ranking models based on fundamental principles of ECM: (1) GAN, (2) Simple Graph Relevance Networks, and (3) Graph Relevance Networks. GAN and Graph Relevance Networks are the graph neural variants of ECM, that employ attention mechanism and relevance information of the relevant context respectively to ascertain entity relevance. Our experiments demonstrate that our neural variants of the ECM model significantly outperform the state-of-the-art BERT-ER by more than 14% and exceeds the performance of systems that use knowledge graph embeddings by over 101%. Notably, our findings reveal that leveraging the relevance of the relevant context is more effective at identifying relevant entities than the attention mechanism. To evaluate the efficacy of the models, we conduct experiments on two standard benchmark datasets, DBpediaV2 and TREC Complex Answer Retrieval. To aid reproducibility, our code and data are available. https://github.com/TREMA-UNH/neural-entity-context-models
@inproceedings{osti_10473615, title = {Neural Entity Context Models}, url = {https://par.nsf.gov/biblio/10473615}, year = {2023}, publisher = {Association for Computing Machinery}, booktitle = {Proceedings of the 12th International Joint Conference on Knowledge Graphs}, series = {IJCKG '23}, author = {Oza, Pooja and Chatterjee, Shubham and Dietz, Laura}, }
2022
- Predicting Guiding Entities for Entity Aspect LinkingShubham Chatterjee, and Laura DietzIn Proceedings of the 31st ACM International Conference on Information and Knowledge Management, Atlanta, GA, USA, Nov 2022
Entity linking can disambiguate mentions of an entity in text. However, there are many different aspects of an entity that could be discussed but are not differentiable by entity links, for example, the entity "oyster” in the context of "food” or "ecosystems”. Entity aspect linking provides such fine-grained explicit semantics for entity links by identifying the most relevant aspect of an entity in the given context. We propose a novel entity aspect linking approach that outperforms several neural and non-neural baselines on a large-scale entity aspect linking test collection. Our approach uses a supervised neural entity ranking system to predict relevant entities for the context. These entities are then used to guide the system to the correct aspect.
@inproceedings{chatterjee2022predicting, author = {Chatterjee, Shubham and Dietz, Laura}, title = {Predicting Guiding Entities for Entity Aspect Linking}, year = {2022}, isbn = {9781450392365}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3511808.3557671}, doi = {10.1145/3511808.3557671}, booktitle = {Proceedings of the 31st ACM International Conference on Information and Knowledge Management}, pages = {3848–3852}, numpages = {5}, keywords = {document similarity, entity aspect linking, entity ranking}, location = {Atlanta, GA, USA}, series = {CIKM '22}, }
- Wikimarks: Harvesting Relevance Benchmarks from WikipediaLaura Dietz, Shubham Chatterjee, Connor Lennox, and 3 more authorsIn Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, Nov 2022
We provide a resource for automatically harvesting relevance benchmarks from Wikipedia – which we refer to as "Wikimarks" to differentiate them from manually created benchmarks. Unlike simulated benchmarks, they are based on manual annotations of Wikipedia authors. Studies on the TREC Complex Answer Retrieval track demonstrated that leaderboards under Wikimarks and manually annotated benchmarks are very similar. Because of their availability, Wikimarks can fill an important need for Information Retrieval research. We provide a meta-resource to harvest Wikimarks for several information retrieval tasks across different languages: paragraph retrieval, entity ranking, query-specific clustering, outline prediction, and relevant entity linking and many more. In addition, we provide example Wikimarks for English, Simple English, and Japanese derived from the 01/01/2022 Wikipedia dump. Resource available: https://trema-unh.github.io/wikimarks/
@inproceedings{10.1145/3477495.3531731, author = {Dietz, Laura and Chatterjee, Shubham and Lennox, Connor and Kashyapi, Sumanta and Oza, Pooja and Gamari, Ben}, title = {Wikimarks: Harvesting Relevance Benchmarks from Wikipedia}, year = {2022}, isbn = {9781450387323}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3477495.3531731}, doi = {10.1145/3477495.3531731}, booktitle = {Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval}, pages = {3003–3012}, numpages = {10}, keywords = {test collections, relevant entity linking, query-specific clustering}, location = {Madrid, Spain}, series = {SIGIR '22}, }
- BERT-ER: Query-specific BERT Entity Representations for Entity RankingShubham Chatterjee, and Laura DietzIn Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, Nov 2022
Entity-oriented search systems often learn vector representations of entities via the introductory paragraph from the Wikipedia page of the entity. As such representations are the same for every query, our hypothesis is that the representations are not ideal for IR tasks. In this work, we present BERT Entity Representations (BERT-ER) which are query-specific vector representations of entities obtained from text that describes how an entity is relevant for a query. Using BERT-ER in a downstream entity ranking system, we achieve a performance improvement of 13-42% (Mean Average Precision) over a system that uses the BERT embedding of the introductory paragraph from Wikipedia on two large-scale test collections. Our approach also outperforms entity ranking systems using entity embeddings from Wikipedia2Vec, ERNIE, and E-BERT. We show that our entity ranking system using BERT-ER can increase precision at the top of the ranking by promoting relevant entities to the top. With this work, we release our BERT models and query-specific entity embeddings fine-tuned for the entity ranking task.
@inproceedings{10.1145/3477495.3531944, author = {Chatterjee, Shubham and Dietz, Laura}, title = {BERT-ER: Query-specific BERT Entity Representations for Entity Ranking}, year = {2022}, isbn = {9781450387323}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3477495.3531944}, doi = {10.1145/3477495.3531944}, booktitle = {Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval}, pages = {1466–1477}, numpages = {12}, keywords = {query-specific entity representations, entity ranking, bert}, location = {Madrid, Spain}, series = {SIGIR '22}, }
2021
- Entity Retrieval Using Fine-Grained Entity AspectsShubham Chatterjee, and Laura DietzIn Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, Nov 2021
Using entity aspect links, we improve upon the current state-of-the-art in entity retrieval. Entity retrieval is the task of retrieving relevant entities for search queries, such as "Antibiotic Use In Livestock". Entity aspect linking is a new technique to refine the semantic information of entity links. For example, while passages relevant to the query above may mention the entity "USA", there are many aspects of the USA of which only few, such as "USA/Agriculture", are relevant for this query. By using entity aspect links that indicate which aspect of an entity is being referred to in the context of the query, we obtain more specific relevance indicators for entities. We show that our approach improves upon all baseline methods, including the current state-of-the-art using a standard entity retrieval test collection. With this work, we release a large collection of entity-aspect-links for a large TREC corpus.
@inproceedings{chatterjee2021entity, author = {Chatterjee, Shubham and Dietz, Laura}, title = {Entity Retrieval Using Fine-Grained Entity Aspects}, year = {2021}, isbn = {9781450380379}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3404835.3463035}, doi = {10.1145/3404835.3463035}, booktitle = {Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval}, pages = {1662–1666}, numpages = {5}, keywords = {entity aspects, entity ranking, learning-to-rank}, location = {Virtual Event, Canada}, series = {SIGIR '21}, }
2019
- Why does this Entity matter? Support Passage Retrieval for Entity RetrievalShubham Chatterjee, and Laura DietzIn Proceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval, Santa Clara, CA, USA, Nov 2019
Our goal is to complement an entity ranking with human-readable explanations of how those retrieved entities are connected to the information need. While related to the problem of support passage retrieval, in this paper, we explore two underutilized indicators of relevance: contextual entities and entity salience. The effectiveness of the indicators are studied within a supervised learning-to-rank framework on a dataset from TREC Complex Answer Retrieval. We find that salience is a useful indicator, but it is often not applicable. In contrast, although performance improvements are obtained by using contextual entities, using contextual words still outperforms contextual entities.
@inproceedings{10.1145/3341981.3344243, author = {Chatterjee, Shubham and Dietz, Laura}, title = {Why does this Entity matter? Support Passage Retrieval for Entity Retrieval}, year = {2019}, isbn = {9781450368810}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3341981.3344243}, doi = {10.1145/3341981.3344243}, booktitle = {Proceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval}, pages = {221–224}, numpages = {4}, keywords = {joint query-entity-passage features, entity salience, entity context neighbors, entity context document}, location = {Santa Clara, CA, USA}, series = {ICTIR '19}, }