Documents refers to a digital or physical record that contains text data that can be analyzed in order to extract insights. This could be anything from a social media post to a financial report.
When considering text analytics, it’s important to note that there are two different types of data that can be extracted from documents: structured and unstructured. Structured data is information that can be easily parsed and organized into a format that can be processed by a machine, such as tabular data. Unstructured data, on the other hand, is information that does not have a pre-defined structure and cannot be easily processed by a machine. This includes things like natural language text, images, and audio.
While both structured and unstructured data can be found in documents, the vast majority of text analytics methods are designed to work with unstructured data. This is because extracting insights from unstructured data is generally more complex and requires more sophisticated algorithms.
So, what exactly can you do with text analytics on Documents? Well, there are a variety of different applications, but some of the most common include:
- Extracting named entities (such as people, places, and organizations)
- Identifying key topics and themes
- Detecting sentiment (positive, negative, or neutral)
- Classifying document types
- And much more!
Storing documents
Documents are often stored in a database or a file system. There are many different formats that documents can be stored in, but some of the most common include PDF, Word, Excel, and JSON.
When choosing a storage format for documents, it’s important to consider the size of the document, the type of data it contains (structured or unstructured), and how it will be accessed (by humans or by machines).
For example, if you’re looking to store a large number of documents that will be primarily accessed by machines, then you might want to consider storing them in a JSON format. This is because JSON files are typically smaller in size than other formats (such as PDF or Word) and they can be easily parsed by computers.
On the other hand, if you’re looking to store a smaller number of documents that will be primarily accessed by humans, then you might want to consider storing them in a PDF or Word format. This is because these formats are typically easier for humans to read and understand.
Retrieving documents
There are many different ways to retrieve documents, but some of the most common include using a search engine (such as Google or Bing), browsing through a directory or file system, or accessing a database.
When retrieving documents, it’s important to consider how you will access the data (by humans or by machines) and what type of data you’re looking for (structured or unstructured).
For example, if you’re looking to retrieve a large number of documents that contain unstructured data, then you might want to consider using a search engine. This is because search engines are designed to index and rank large quantities of unstructured data.
On the other hand, if you’re looking to retrieve a smaller number of documents that contain structured data, then you might want to consider accessing a database. This is because databases are typically designed to store and query small amounts of structured data.