Sampling in statistics is the process of selecting a group of items from a population for analysis.
Sampling is often used in text analytics to select a representative sample of texts for analysis, based on some criterion (e.g., randomly, by topic, by geographical region). This is done to make inferences about the population of texts as a whole.
In some cases, sampling may be used to generate a representative corpus of texts for a specific purpose (e.g., for training a machine learning model). In other cases, it may be used to select a subset of texts for human annotation.
Sampling vs. Other Similar Terms
The term sampling is sometimes used interchangeably with other terms, such as selection bias and representativeness. However, these terms have different meanings in statistics and should not be confused.
Selection bias refers to the bias that can occur when items are selected for inclusion in a sample (e.g., if the selection criterion is not random). Representativeness refers to the degree to which a sample accurately reflects the population from which it was drawn.
Both selection bias and representativeness can be important considerations in text analytics, but they are not the same thing as sampling.
Types of Sampling
There are several different types of sampling, but the two most common are probability-based sampling and non-probability-based sampling.
Probability-based sampling is a type of sampling where each item in the population has a known, non-zero chance of being selected for inclusion in the sample. This is often done using random sampling, where items are selected for the sample purely by chance.
Non-probability-based sampling is a type of sampling where the selection of items for inclusion in the sample is not based on any known probabilities. This means that some items in the population may have zero chance of being selected, while others may have multiple chances.
Downside of Sampling
One downside of sampling is that it can introduce bias into the results of text analytics. This is because the selection of items for inclusion in the sample may not be random. For example, if a sample of texts is selected based on topic, then the results of the text analytics may be biased towards that topic.
Another downside of sampling is that it can make inferences about the population of texts as a whole. For example, if a sample of texts is selected randomly, then the results of the text analytics may not be representative of the population of texts as a whole.