English | 2016 | ISBN: 978-1-78528-340-6 | 130 Pages | PDF, EPUB | 10 MB
Python is easy to learn and extensible programming language that allows any manner of secret agent to work with a variety of data. Agents from beginners to seasoned veterans will benefit from Python’s simplicity and sophistication. The standard library provides numerous packages that move beyond simple beginner missions. The Python ecosystem of related packages and libraries supports deep information processing.
This book will guide you through the process of upgrading your Python-based toolset for intelligence gathering, analysis, and communication. You’ll explore the ways Python is used to analyze web logs to discover the trails of activities that can be found in web and database servers. We’ll also look at how we can use Python to discover details of the social network by looking at the data available from social networking websites.
Finally, you’ll see how to extract history from PDF files, which opens up new sources of data, and you’ll learn about the ways you can gather data using an Arduino-based sensor device.
What You Will Learn
- Upgrade Python to the latest version and discover its latest and greatest tools
- Use Python libraries to extract data from log files that are designed more for people to read than for automated analysis
- Summarize log files and extract meaningful information
- Gather data from social networking sites and leverage your experience of analyzing log files to summarize the data you find
- Extract text and images from social networking sites
- Parse the complex and confusing data structures in a PDF file to extract meaningful text that we can analyze
- Connect small, intelligent devices to our computer to use them as remote sensors
- Use Python to analyze measurements from sensors to calibrate them and use sensors efficiently
Table of Contents
Chapter 1. New Missions – New Tools
Chapter 2. Tracks, Trails, and Logs
Chapter 3. Following the Social Network
Chapter 4. Dredging up History
Chapter 5. Data Collection Gadgets
The PostScript language describes the look of a page of text. It doesn’t require that the text on that page is provided in any coherent order. This is different from HTML or XML, where tags can be removed from the HTML source and a sensible plain text document can be recovered.
In PostScript, text is commingled with page layout commands in such a way that the underlying sequence of characters, words, sentences and paragraphs can be lost. Pragmatically, complete obfuscation of a page is rare. Many documents occupy a middle ground where the content is difficult to parse. One common quirk is out-of-place headline text; we have to use the coordinates on the page to deduce where it belongs in the text.
PDF can be abused too. In the most extreme cases, people will print some content, scan the pages, and build a PDF around the scanned images. This kind of document will display and print nicely. But it defies simple analysis. More complex OCR is required to deal with this. This is beyond our scope, since the algorithms can be very complex.
Here’s a typical document that contains mountains of useful data. However, it’s hard to access because it’s locked up in a PDF. The title is the Compendium of Federal Data Sources to Support Health Workforce Analysis April 2013.
Agents interested in industrial espionage—particularly about the workforce—would need to understand the various sources in this document.
Some agents agree that governments (and industry) use PDFs entirely to provide data in a “see-but-don’t-touch” mode. We can only leverage the data through expensive, error-prone, manual operations. We can’t easily reach out and touch the data digitally to do deeper analysis.