Review the metadata of the target document. The best place to start with any document is by reviewing its metadata – this is the data about the document itself. With this information, you will be able to find the document’s author, creation dates, last modified date, last modifying author, and more. This information allows you to put the document in context with other pertinent events or even find potential witnesses.
Find multiple versions across a production.
Using the metadata, you can then identify other versions of a document and put them in time order by their modified and/or creation dates. Additional versions may be readily available in the same location of your original document or as email attachments, as with the Manafort case. But other versions might be harder to locate. If a company maintains the document you seek, there are likely backup versions of that document that you could request. And if that fails, the next section discusses how big data methods for text analysis can be used to find similar documents in the production based off their content, not just of their title and metadata.
Compare text using big data methods for text analysis.
To identify or compare multiple versions of a document, we can analyze how the content and structure of the target document compares to others in the production. We do that by first building a term document matrix that measures how often each word appears in each document. We then use that to calculate the similarity of these versions using a metric such as cosine similarity, with similar documents having higher scores. Other metrics, such as the Levenshtein distance, can account for high similarity between words even if they are not the same due to typos, added or missing punctuation, or issues with optical character recognition (OCR) from PDFs. Word embeddings, which are mathematical representations of words, can also be used for measuring the similarity of different words based on whether they are known to be part of similar contexts. For example, the representations for “king” and “queen” are fairly similar. As such, word embeddings can be useful for capturing paraphrasing, analogies, or synonyms/antonyms. Sentence embeddings and document embeddings follow the same principles as word embeddings, but for sentences and paragraphs, respectively.
Gryphon Strategies can help you with complicated document and text analysis.
With nearly 30 years of experience, Gryphon Strategies is a leader in complex investigations and recently expanded its investigatory offerings to include Data Mining & Analytics. Our data team can help draft litigation holds, defend document requests, identify alternative datasets to use, and make sense of the data once you get it. Contact Lacey Keller, our Managing Director for Data Mining & Analytics at (914) 730-9063 or email@example.com, to evaluate your case and make recommendations. Read more here.