In the digital age, the most useful tools for collecting intelligence are no longer a magnifying glass and a tape recorder, but a mouse and a keyboard. Open source intelligence made headlines last spring with the arrest of the Golden State Killer, after California police were able to match criminal DNA found at the scene with DNA of a relative who had uploaded genetic information to a public genealogy website. Although DNA is a well-established method to identify criminals, using this database marked a fundamentally new approach for investigators. Here we discuss how open and crowd sourced information is increasingly informing the casework of seasoned prosecutors and investigators.
Open Source Means Publicly Accessible Information
Open source is any information or tool made readily available to the public. We foremost think of it as online data or databases for public consumption, such as public record filings published by government entities. But open source can also mean information that was created following a call to action (often referred to as “crowd sourcing”) or free software provided by developers. Even large well-known software companies like Google and Microsoft have made some of their software accessible to the public. At this point, open source has evolved to where the originally published data or code/software is being continually improved and updated by the end users of that information. Below is a short list of noteworthy examples of open source, some of which you may know by name:
Open Source Examples You May (or May Not) Know
- Waze (crowd sourced traffic conditions)
- OwnCloud (file storage)
- FileZilla (FTP tool)
- Google’s TensorFlow (machine learning)
- Python (coding language)
Unlocking Sources of Information on the Internet
When referring to open data, many often think about the large repositories of public data. Government has invested in making their data more accessible to the public amidst pressure for increased accountability and transparency. Companies like Socrata provide platforms that increase the public’s access to big government data through a straightforward, streamlined platform. Starting a search with the Socrata site of a government entity (i.e., opendata.cityofnewyork.us) often leads to an immediate trove of useful information. Furthermore, corporations and non-profit organizations are increasingly publishing their information publicly for hacking competitions like Kaggle.
In addition to companies like Socrata, coders have created platforms to promote open source software development. Examples of platforms include Stack Overflow and GitHub, both websites that host open sourced code that allows programmers to readily share, edit, and comment on each other’s work. Many open datasets are hosted on websites maintained by independent organizations, but users on GitHub even maintain a crowd-sourced list of public datasets across a wide range of disciplines. It’s a great example of how open source software fosters a community for developers that benefits both the public and private industry.
Crowd sourced information also contributes significantly to the world of open data. The obvious places to look for crowd sourced information are typically the most popular ones such as social media platforms, online sites such as Wikipedia, and polls created by news outlets. If there is a dataset you want that does not exist, consider asking the internet to provide you with that information through crowd sourcing. Finally, cloud-based software providers, such as Amazon Web Services, also provide teams that can help you create datasets by manually entering data that may not exist natively in digital form.
Best Practices with Open Source Intelligence
Though police tracked down the Golden State Killer, intelligence forces are not the only people capable of conducting large scale open source intelligence. The very crux of the methodology lies in data being available to everyone. But accessibility does not guarantee usefulness, and it can be all too easy to get trapped in the lattice of the world wide web when hunting for information. While the concept of leveraging publicly available data is simple, best practices are more complex than typing words into Google.
Knowing when and how to use which of the vast array of internet resources is critical. Narrowing searches with shortcuts – such as quotation marks to query phrases, hyphens to omit words, or site specifications to limit sources – increases the potential output of search engines. Using a variety of search terms, in addition to tools like the Wayback Machine (which offers access to archived online data) improves the odds of successful investigation. But take caution in casting a wider net without watching your back: be sure that you’re not accidentally blowing your cover when conducting an online investigation. Accessing the website directly or improperly engaging on social media can reveal your identity to your subject, which may not be something you want to do. It should be noted that incognito mode does not hide your identity to anyone except your own web browser’s history.
Appropriately citing the source of open data or code is not only good ethics, but also protects your credibility. If you’ve used data or code that later changes or contains some sort of error, you need some way of referencing when and where you obtained that information or code. When borrowing code snippets, it is best practice to cite to the URL where you found the code as a comment in your own code. When using public data, be sure to cite the source of the information and how/when you obtained it.
Finally, it is important to not only cite, but also preserve openly sourced data to protect yourself if it ever changes or is removed. The internet is an ever-evolving landscape that can change in an instant. Upon collecting online intelligence through any form, you should take steps to preserve your process and findings, such as documenting when data was obtained through a URL and even taking screen shots of the information. Because sources are regularly changing or removed, collecting and preserving intelligence as soon as possible reduces the risk that information will no longer be available when you need it. Also remember that sources are also regularly updated, meaning that data that did not exist at the beginning of your investigation might be available later. You should periodically check for changes in information throughout the course of an investigation to identify and note any significant changes.
Gryphon Strategies Can Help Your Open Source Investigation
With nearly 30 years of experience, Gryphon Strategies is a leader in complex investigations and recently expanded its investigatory offerings to include Data Mining & Analytics. Our data team can help you maximize open source tools and resources to retrieve new data or interpret and leverage data you already have. Contact Lacey Keller, our Managing Director for Data Mining & Analytics at (914) 730-9063 or firstname.lastname@example.org, to evaluate your case and make recommendations. Read more about our Data Mining & Analytic capacity here.