Our Blog

Share this article:

Forgers Usually Don’t Show Their Work – Detecting Altered Documents Using Data

by | Feb 26, 2018

Credit: Christopher Rusnak

Last week, federal prosecutors announced new charges against Paul Manafort for hiding millions of dollars in assets overseas. Prosecutors found this because of a lucky break – Manafort emailed his business associate Rick Gates asking him to convert a PDF statement into a Word document so that Manafort could edit it. Once Manafort finished editing the statement, he sent the document back to Gates to turn it back into a PDF. If you are not fortunate enough to have a paper trail showing the forgeries or alterations, what are your options? In this more likely scenario, you can use data science techniques to pull content and information from documents to track their differences.

Photo by Brandi Redd on Unsplash

Review the metadata of the target document. The best place to start with any document is by reviewing its metadata – this is the data about the document itself. With this information, you will be able to find the document’s author, creation dates, last modified date, last modifying author, and more. This information allows you to put the document in context with other pertinent events or even find potential witnesses.

Find multiple versions across a production.

Using the metadata, you can then identify other versions of a document and put them in time order by their modified and/or creation dates. Additional versions may be readily available in the same location of your original document or as email attachments, as with the Manafort case. But other versions might be harder to locate. If a company maintains the document you seek, there are likely backup versions of that document that you could request. And if that fails, the next section discusses how big data methods for text analysis can be used to find similar documents in the production based off their content, not just of their title and metadata.

Compare text using big data methods for text analysis.

To identify or compare multiple versions of a document, we can analyze how the content and structure of the target document compares to others in the production. We do that by first building a term document matrix that measures how often each word appears in each document. We then use that to calculate the similarity of these versions using a metric such as cosine similarity, with similar documents having higher scores. Other metrics, such as the Levenshtein distance, can account for high similarity between words even if they are not the same due to typos, added or missing punctuation, or issues with optical character recognition (OCR) from PDFs. Word embeddings, which are mathematical representations of words, can also be used for measuring the similarity of different words based on whether they are known to be part of similar contexts. For example, the representations for “king” and “queen” are fairly similar. As such, word embeddings can be useful for capturing paraphrasing, analogies, or synonyms/antonyms. Sentence embeddings and document embeddings follow the same principles as word embeddings, but for sentences and paragraphs, respectively.

Gryphon Strategies can help you with complicated document and text analysis.

With nearly 30 years of experience, Gryphon Strategies is a leader in complex investigations and recently expanded its investigatory offerings to include Data Mining & Analytics. Our data team can help draft litigation holds, defend document requests, identify alternative datasets to use, and make sense of the data once you get it. Contact Lacey Keller, our Managing Director for Data Mining & Analytics at (914) 730-9063 or lkeller@gryphon-strategies.com, to evaluate your case and make recommendations. Read more here.