A seed list page is a web page that contains a list of terms or topics that are relevant to a particular subject or field. The purpose of a seed list page is to provide a starting point for further research on a topic by identifying key terms and concepts related to the topic.
Seed list pages can be created manually by experts in a field, or they can be generated automatically by software programs that crawl the web and identify pages that contain lists of terms related to a given topic. Once a seed list page has been created, it can be used as input for text analytics algorithms that extract information from unstructured text data. For example, a seed list page containing a list of medical terms could be used as input for a text analytics algorithm that extracts information about medical conditions from free-text patient records.
Seed list pages are sometimes also known as gazetteers, authority lists, or reference lists.
Seed list page and metadata page
A seed list page is similar to a metadata page, which is a web page that contains a list of terms or topics that are relevant to a particular subject or field. However, the purpose of a metadata page is to provide information about the structure and content of data, whereas the purpose of a seed list page is to provide a starting point for further research on a topic by identifying key terms and concepts related to the topic.
For example, a metadata page for a dataset containing medical records might include information about the fields in the dataset, such as patient age, gender, and diagnosis. A seed list page for the same dataset might include a list of medical conditions mentioned in the dataset.
Benefits of seed list page
There are several benefits of using a seed list page as input for text analytics algorithms.
First, a seed list page can provide a level of abstraction that makes it easier to extract information from unstructured text data. For example, consider a dataset containing free-text patient records. If the goal is to extract information about diagnoses mentioned in the records, it would be difficult to do this without some prior knowledge of medical terminology. A seed list page containing a list of medical conditions would make it easier to identify and extract this information.
Second, a seed list page can provide context for understanding the meaning of terms in unstructured text data. Consider the term “heart attack”. This term could mean different things in different contexts, such as a medical emergency or a figurative attack on someone’s character. A seed list page that includes the term “heart attack” along with other terms related to cardiac conditions would provide context that would help to disambiguate the meaning of the term.
Finally, a seed list page can improve the accuracy of information extraction algorithms by providing a source of ground truth data. For example, if an algorithm is designed to extract information about diagnoses from free-text patient records, the algorithm can be tested against a dataset of records that has been manually annotated with diagnosis information. This testing can be used to fine-tune the algorithm to improve its accuracy.