CHAPTER 1: THE SURFACE WEB

The Surface Web is the most open and permissive of the three layers of cyberspace. Easily reachable via the most basic computer or mobile phone hardware, the Surface Web is something that almost everyone in the Western world and growing numbers in the developing world are becoming intimately familiar with. The Surface Web is the backbone for everyday business services such as email, web browsing, entertainment and commerce of all descriptions. With such a broad set of online resources available and the ease of access, the Surface Web is almost always the starting point of any OSINT project. Despite its ubiquity, the Surface Web does contain huge pools of data that are valuable to the investigator and often the central challenge to effectively using the Surface Web is locating the important pieces of information within the forest of irrelevant babble.

The core challenge of developing a practitioner’s skill with regard to the Surface Web lies not in showing the usefulness of using this layer of cyberspace (that is innately obvious) but in highlighting to the investigator new possibilities for using the Surface Web.

Image Exercise: conduct a search

Before you continue further, conduct a short piece of research into the Nigerian terrorist group Boko Haram. Spend five minutes researching the group using the Internet in any way that you see fit.

Having run the preceding exercise several hundred times, I would hazard a guess that you did the following: used your computer’s default web browser (Internet Explorer for Windows, Safari for Mac users), used the Google of your home web domain (.co.uk, .ca, etc.) as your search engine, entered a couple of search phrases, read mostly articles from Wikipedia and mainstream news sites, and made no attempts to hide your identity while on the Internet.

Although these steps are all logical and where most OSINT investigations start, this is also where most OSINT investigations stop. Too often the OSINT part of an investigation is declared ‘complete’ after the preceding steps are taken. The remainder of this chapter is about expanding your investigative repertoire and imparting an understanding of why you need to do so.

Image Consider for a moment…

Does the Internet look the same from every angle? In other words, are people in Russia looking at the same Internet as people in the UK? The answer to this point is explored in more depth later in this chapter in the Cyber Geography section.

Web browsers – the first steps

A web browser is the generic term for the class of software that is used in conjunction with a search engine to browse the Internet. The importance of web browsers as both a starting point for the practical section of this book and to OSINT professional practice in general, is that web browsers are the ‘nuts and bolts’ foundations that support the remainder of this book.

Typically, operating systems10 such as the Microsoft Windows family and those loaded onto Macs come bundled with web browsers such as Internet Explorer and Safari. Although these web browsers are perfectly serviceable for the needs of an everyday web user, they are inadequate for the OSINT practitioner due to their lack of functionally and extendibility.

For the OSINT professional, having knowledge of just two non-standard web browsers can vastly expand investigative possibilities. This is due to the fact that certain pieces of software, called plugins, can be added to web browsers and make a huge difference to the insight that can be derived from a website as well as adding to the speed, efficiency and robustness of the results of an investigation.

Although new web browsers are coming onto the market almost every day, the author recommends to the reader that they use Google Chrome and Mozilla Firefox. Both products can be downloaded for free, and installing them involves a few clicks on the relevant pop-up boxes.

Image Be warned!

Be very careful when installing any kind of software sourced from the Internet onto your computer as many apparently legitimate downloads are in fact just malicious malware delivery vehicles. If you are unsure how to differentiate legitimate from malicious software then consult your IT department or consult a knowledgeable colleague.

The reasons for choosing these two web browsers are as follows:

Flexibility

‘Tabbed browsing’ is the term used to describe the functionality within web browsers that allows multiple pages to be open within one web browsing window (or ‘pane’ to use the correct term). New sub-windows are opened by clicking on the areas shown in Figure 1.

Image

Figure 1: Tabbed browsing in Chrome and Firefox

Although tabbed browsing is not unique to Firefox or Chrome, this functionality allows the investigator to have multiple windows open at any one time. This may seem a relatively trivial addition to an individual’s OSINT skill set, but tabbed browsing is sometimes a departure from the way many older Internet users are accustomed to browsing the web. The benefit of mastering tabbed browsing across two separate web browsers is that multiple investigative threads can be followed and cross-referenced by the investigator at any one moment. In the highly visual environment of the Internet this approach can prove invaluable, especially if combined with a duel monitor display.

Extendibility with Add-Ons

Chrome and especially Firefox can have their functionality hugely extended by the addition of small pieces of software called add-ons.

Installing add-ons is easy: simply load the web browser you wish to install an add-on into, and then navigate to the online resource for that particular browser:

   Chrome: https://chrome.google.com/webstore/category/extensions

   Firefox: https://addons.mozilla.org/

Once an appropriate online resource for the browser has loaded you can then add new add-ons directly from there11.

There are thousands of available add-ons for both Firefox and Chrome and although most are irrelevant to OSINT professional practice, a few can make a difference within an investigation. As Firefox has been around far longer than Chrome, there are more useful add-ons for the OSINT practitioner for this platform. Some useful add-ons are listed next (all add-ons can be found by Googling the term ‘Firefox add-on’ plus the name of the add-on):

Table 1: Firefox plugins quick reference table

Image

Image

Image

Image

Image

One point of caution to make regarding add-ons is that the more that are added to a web browser, the slower the browser will run. With a handful of add-ons the slowdown in browser performance is negligible; however, with ten or more add-ons running the debilitating effects on browser speed become obvious. The solution to this issue is to toggle individual add-ons on and off depending on the needs of the investigator. This can be done via the “Tools – Add-ons” menu that brings up the control panel shown in Figure 2 in the Firefox browser.

Image

Figure 2: Extension (add-ons) control panel in Firefox

Clicking the Disable button will temporarily remove that add-on from the system-processing load of the Firefox browser. Obviously the add-on can be easily reactivated by clicking the Enable button when the user wishes to use that specific software tool again.

By their very nature add-ons are not mainstream pieces of software. Lone, mostly unpaid software developers are often the authors and many have limited time and resources to support their products. This means that add-ons periodically break, as web technologies change and add-ons become outdated as the developers fail to keep pace with these new developments. As such, close management of these pieces of software (installing updates, removing hopelessly broken add-ons) is just an unfortunate part of using add-ons. However, the benefit of add-ons outweighs the management overhead.

Speed

If Google’s Chrome browser excels at one thing, it’s speed. Chrome has been designed for the modern generation of Internet users who grew up being weaned on broadband Internet connections with the expectation of lightning-fast page loading.23 Although increasing Internet speeds may have led to an overall decline in attention spans, the benefits of rapid page loading that Chrome provides to the OSINT practitioner are undeniable.

Additionally, Chrome has an extremely useful way of visualising text searches within a specific web page, a function activated by holding down Ctrl+F in Windows and Cmd+F on a Mac and then entering the desired search phrase (results shown in Figure 3).

Image

Figure 3: Search result flags in Chrome (highlighted in red)

This is particularly useful for interrogating web pages for specific search terms that can often prove elusive within reams of text. This feature of Chrome can be a huge time-saver in a situation where it is not obvious how a web page is related to a search term. By using the search, the investigator can quickly locate where a term such as a name, phone number or email address occurs within a body of text, which allows the investigator to rapidly contextualise the data presented within the page and decide if that particular page is relevant to the investigation or not.

Image Chapter key learning point: investigate outside the box

The core message that this chapter communicates is that the skilled OSINT investigator should break away from the constraints of viewing one web page within one web browser. Although you will almost certainly use the Internet for personal reasons, using the Internet for professional purposes demands a different approach. By using the technologies and approaches outlined within this chapter, you can agilely pivot through an informational space as new investigative opportunities present themselves.

Combining Chrome’s tabbed browsing, fast loading speed and visual search capabilities, the web browser allows the skilled OSINT practitioner to almost concurrently examine hundreds of web pages at speed, discarding the irrelevant and archiving the useful as the search progresses.

Search engines

Recall the exercise at the start of this chapter and the Author’s criticism of the exclusive use of Google as a tool for OSINT research. The reader may wonder where the Author’s opinion on this issue may come from. Try the next exercise before we explore this point further:

Image Exercise: spot the difference

Open a web browser (either Firefox or Chrome) and conduct an image search24 for the individual ‘Stewart Bertram’ in Google, Yahoo and Bing and place the results next to one another using tabbed browsing. Then spot the difference between the results presented by each browser.

The result that you will no doubt be viewing after completing the exercise is three pages of images. Although many of the results are identical across the three web browsers, as you scroll down that page of each of the three results pages you will see that there is an increasing divergence between the images that the three search engines have produced and large numbers of images that are unique to specific search engines.

The image results vary between the search engines due to the fact that each search engine is drawing from different sources of data when presenting its results.

Image Side knowledge: search engine indexing

Have you ever wondered how search engines generate their results? A large part of the process involves software agents autonomously surfing the Internet and trying to understand what the core themes of a particular website are. This process is called indexing and allows search engines such as Google to match search strings from a user to relevant web pages. Each search engine uses different indexing algorithms; as such, the same search yields different results when used across a number of search engines. The power of search engines and the indexes they create is that without them users would have to know the exact address (or Uniform Resource Locator (URL)) to visit a site. The true power of an index is that it allows search engines to match human search terms such as ‘shopping in London’ to an appropriate list of websites for that search term.

The obvious question to ask is how different are search results across the main search engine providers? According to a joint study carried out by the search engine provider Dogpile, Queensland University of Technology and the Pennsylvania State University, nearly a whopping 90 percent25 of search results were unique to one of the four major search engine providers. Based on these stats and the tangible proof provided by the ‘Spot the difference’ exercise, it’s the author’s hope that the reader understands why the author advocates using more than one search engine to conduct OSINT. In plain terms: if you are doing OSINT research using just Google and are not finding the result that you are looking for, you are only examining a tiny proportion of the available sources.

Image Key learning point: Google is not the Internet

Contrary to popular belief, the index of Google does not contain the address of every Internet site on the web. In fact no one really knows how many websites are actually on the Internet. Although Google’s database is huge (possible the largest database on Earth), within the infinite space that is the Internet even this database may in fact be tiny.

Search engines such as Google and Bing can be described as ‘single-source intelligence’ and using single sources of intelligence within the context of an investigation is generally considered to be a bad thing by professionals.

So if that is the problem, what is the solution? Enter meta search engines…

Search engines – meta search engines

Simply put, meta search engines are search engines that query other search engines and aggregate these search results. Take a few moments to review the list of meta search engines shown next; they can be reached by Googling the name, or alternatively a link is provided to the service as a footnote.

Table 2: Meta search engine quick reference table

Image

Although the preceding list may prove interesting to the busy investigator, the list is in effect the digital equivalent of a ‘flash of a stocking top’ and is unlikely to fully demonstrate the benefits of these tools to OSINT professional practice. The following points step beyond a superficial presentation of the features of the tools by delving deeper into the benefits each can bring to an investigation:

•    Throwing a wide net: for the OSINT practitioner the most basic benefit of meta search engines over single-source tools is to throw the widest possible investigative net to trawl the web for data of interest. The Dogpile tool is a meta search engine that queries Google and Bing to obtain its result32, and is a viable replacement for the default Google tool used during the casual searching that occurs at the start of any OSINT project. Zuula is similar to the ‘no frills’ Dogpile but differs in that it queries nine separate search engines for its results.33 Zuula also provides a ‘Recent Search’ function that saves a user’s previous searches to a left-aligned toolbar. This can prove handy when exploring a large field of search results as the investigator is free to explore multiple angles within the data, safe in the knowledge that they can easily return to a more relevant search term at any point. The investigator should consider Dogpile as a ‘day-to-day’ search engine and Zuula as a tool to use when an investigation has run completely dry.

•    Concept mapping: taking a step back from a straight explanation of tools and technique, it’s worth considering what a website is actually trying to do. Every website is, at its core, trying to communicate information to the reader. Whereas conventional search engines tend to just represent a description of what information is contained within a site, some meta search engines have software tools that attempt to present the core themes of the site to the user. Polymeta is a search engine with its unique ‘Concept Map’ feature (available on Windows operating systems only, with up-to-date Java settings). The benefit of using a concept map within an OSINT task is that the investigator is presented with a conceptual map of the subject they are researching. Not only does this generate new investigative leads but also allows the user to assimilate new data and visualise the cornerstones of an investigation much quicker than when limiting oneself to single-source search engines.

The previous theory points and preceding tools are ideal for the start of an OSINT project when the investigator’s effort is focused on ‘picking up the scent’ and conceptualising the key issue surrounding a case. Of course all cases develop in time and the initial rapid sense-making efforts give way to a more measured approach to an investigation, as the need for precise details becomes apparent. The following meta search tools are particularly useful once an investigation has reached this stage:

Image Side knowledge: accuracy of Entity Extraction tools

You may notice a few odd entries are included in the Entity Extraction list within iSeek, e.g. places within people categories and so on. These small errors should in no way discredit the tool. Entity Extraction is a subfield of artificial intelligence and is an extremely difficult process to implement, especially in the lightning-fast speed of web browsing. As such, minor errors are to be expected and accepted by the OSINT practitioner.

Granular refinement of investigative details: iSeek is very useful within the later stages of an investigation due to the powerful Entity Extraction engine integrated into the software. Entity Extraction is a process by which a computer program examines a body of text within a document set, finds the main themes and then creates bookmarks across the document set according to defined categories such as date, place, person and so on. The result of this is that the user can easily filter the document according to the data within the categories that the software extracts, e.g. Stewart Bertram (category: people), Canada (category: countries). This in itself is a useful feature for an investigator. However, the really clever part of the iSeek Entity Extraction process is that there is not a predefined list of items that the tool works from when placing document text within each category. Entity Extraction works by identifying significant elements within the text and then examining the context in which they are used, to determine the appropriate categorisation. As an example of how this process works, take the two phrases “the jaguar eats meat” and “the Jaguar eats gas.”In both cases the categorisation of the Jaguar as either an animal or car comes from the context provided by the remainder of the sentence. The power of iSeek’s Entity Extraction tool to the investigator is that it provides a huge amount of granular facts along with a map of how they fit together, which can be used to great effect by the skilled investigator to advance any OSINT project.

Image Boko Haram

As a demonstration of this technique try running Boko Haram through iSeek and then combining the results from different categories (i.e. the phrase Boko Haram with a place and a date), within a Google search. Do you feel this approach yields any better (i.e. more insightful and precise) results than a straight Google search?

Continuing the theme of granularity, Cluuz is another unique meta search engine that is worthy of inclusion within the OSINT practitioner’s toolbox. What separates Cluuz from the rest of the meta search engine pack is the integrated tools that allow the software to not only conduct Entity Extraction but also to visualise how these entities map together (example shown in Figure 4).

Image

Figure 4: Cluuz network diagram

A visualisation such as the preceding one is created by clicking on the small network diagram that is placed next to the search result in the main window.

The Cluuz tool can be particularly useful when the investigator has a clear idea of a website or well-defined entity (the exact spelling of the name of a person of interest, company name and so on) that is linked to a case. Focusing the Cluuz tool on a specific website and then exploring the data using the social network graphs that the software generates, can clarify facts and clearly identify the sources of data where the case information is coming from. Additionally, the results that Cluuz presents to the investigator, if used correctly, are excellent at separating data on a specific individual from the mass of generic data that has been generated by a search on a common name (this technique is explored in more detail in Chapter 2: Deep Web).

Returning to the immediate usefulness of Cluuz, the tool can be very useful during the initial phases of an investigation that has an individual website as the starting point. Clicking on the Advanced tab reveals two new search boxes; the box labelled Site(s)? allows the user to input the address of a specific website (shown in Figure 5).

Image

Figure 5: Site search revealed by clicking on the Advanced tab

When the search is run a network diagram is generated for the specific site entered into the Site(s)? box. This can prove incredibly useful in increasing the velocity of the initial stages of an investigation.

In conclusion, the main focus of the early-stage tools is to expand the search area of an investigation as wide as possible. Effectively using the mid- to late-stage tools requires the investigator to actually pay closer attention to the outputs of the analytical tools integrated into the search engines (iSeek’s Entity Extraction list and Cluuz’s social network graphs) than the search results that the engines return. This concept is usually a change in approach to OSINT for many investigators new and old alike. However, at this point quoting the classic saying, “don’t look at your finger, look at the moon” becomes appropriate. The user’s finger is an obvious analogy to a list of search results, and the moon is the equivalent of the enhanced understanding that meta search tools can provide

Image Top tip: images and meta search engines

All the meta search engines are extremely poor at image searching and it would appear that most simply pull from the Google image search results. When looking for specific images, say of a person, investigators should go individually to the ‘big three’ search engines (Google, Yahoo and Bing) and view image search results separately within each search engine.

You might have noticed that we have not covered the Carrot2 Cluster search engine yet, partly because the author wants to save the best for last and partly because it fits particularly well with the next theoretical section.

Cyber geography

One of the four key points outlined in the introduction was the concept of cyber geography. The author would posit that this idea is a crucial theoretical concept that any skilled OSINT practitioner needs to master.

To illustrate the effect and importance of cyber geography, examine the two images show next in Figure 6:

Image

Figure 6: Cyber Geography in action

Both images are screenshots of the results of a Google image search for the phrase ‘Tiananmen Square’ that were taken on the twenty-fifth anniversary of the massacre that occurred in the square in 198934. The one on the left shows the results of the search conducted via Google.co.uk, and the one on the right shows the results of the same search conducted using Google.cn (the Chinese-specific Google). The obvious difference between the two results pages is possibly the sharpest demonstration of how cyber geography affects the search results the user sees. For the ‘on the ground’ OSINT practitioner the practical effect of cyber geography is that an investigator can miss huge volumes of relevant data if he is looking in the wrong Internet region.

A more granular explanation of why cyber geography so drastically affects the results that search engines generate comes from how the structure of the Internet is designed and how search engines are used by mainstream users.

In the early days of the Internet, or ARPANET as it was called then, all website addresses were compiled into a central list and periodically distributed to all network hosts. As the Internet grew this system quickly became unwieldy and a number of technical solutions were developed to make the rapidly expanding Internet more user friendly. The Domain Name System (more commonly, DNS) became the equivalent of a phone book for the Internet that translated machine-readable address, called Internet Protocol addresses, into human-readable addresses, e.g. www.microsoft.com (human readable), 134.170.184.133 (machine-readable IP address). An essential component of the DNS system is the subcategories called Top Level Domains (TLDs), which always show in a human-readable web address as the final part of the text, i.e. .com, .co.uk and so on. The most obvious way of categorising TLDs within the DNS records was to allocate one TLD to each country, which created the current cyber geography that we know today. To expand their user base to people in as many countries as possible, search engine providers were quick to develop specialist indexes and search interfaces for each TLD, e.g. Google.ca is a specialist index for the Canadian TLD space.

Image Side knowledge: ‘the Internet’ is not a single entity

The term ‘Internet’ implies that the Internet is one single technology, being administrated by a centralised command and control body. An important thing to understand for the OSINT practitioner is that the Internet is in fact a conglomerate of many technologies and that the DNS system and the various search engines are in fact two separate technologies. The critical point of this piece of knowledge is that just because a search engine index does not have a record of a website does not mean that it does not exist. Indeed it is possible that the majority of the Internet is not listed on any search engine at all.

Of course how people use search engines fundamentally affects how they are designed and research has shown that about 50% of searches on Google are for information about services that are in the same geographic location as the user. Search engine results are ordered according to the physical geography of the user, e.g. a London-based user searching for ‘Curry restaurants’ will be shown London-based results first with results on Mumbai-based restaurants being many pages back in the search results. As the majority of search engine providers generate their income via advertising, and as the majority of users select search engines based on the relevance of the results, it is obvious why companies such as Google and Yahoo structure their indexing process so heavily around physical geography. Although for the everyday user this situation is fine, the effects of cyber geography can severely constrain an investigator.

Although the theory behind cyber geography is involved, thankfully the solution to the issue is relatively simple. Due to the fact that many of the main search providers have created engines specifically configured for various geographic regions, it’s merely a case of finding the appropriate search engine for that region via a search within the provider’s main .com site, e.g. a search in Google.com for the phrase ‘Google Canada’ returns Google.ca, the Google engine specifically designed for the Canadian market.

Image Top tip: No Country Redirect

When conducting searches across multiple cyber geographies, Google has an annoying tendency to return users back to their ‘home’ domain after a few searches. This can be extremely frustrating if one is engaged in an intensive regional-specific OSINT project. The way around this issue is the No Country Redirect command, by typing the characters /NCR into the URL bar in the web browser and pressing enter, e.g. Google.ca/NCR. This simple command then locks that pane of a browser to the selected search engine.

There are of course 196 countries (counting Taiwan) in the world, with a still greater number of TLDs not allocated to a country (.aq is the domain for Antarctica!). All of these TLDs have potentially useful dates on a research subject within them and this complex cyber topography gives rise to the obvious question of ‘which TLD should an OSINT practitioner start an investigation in?’This question dovetails nicely with the functionality of the Carrot2 meta search engine.

The designers of Carrot2 describe it as a ‘clustering engine,’ by which they are implying it attempts to present to the user an easily readable, aggregated view of the data within a search. Activating the clustering feature of Carrot2 is done by clicking on the Circles and FoamTree tabs within the browser (shown in Figure 7):

Image

Figure 7: Folders, Circles and FoamTree tabs in Carrot2 (Circles selected)

The way to interpret the slightly hallucinogenic output of Carrot2 shown in Figure 7, is to use each of the coloured segments to navigate the associated document sets. At this point take a moment to review the various categories, i.e. people, places and so on of data presented by the wheel in Figure 7. Which is the most useful category of data presented with this search?

I believe the most useful category of data that Carrot2 presents is geographic associated with a search topic, as this gives the shrewd investigator a clear direction on which TLD they should look within in relation to a search. Taking this overt focus on cyber geography a step further, Carrot2 can be easily configured to display websites clustered by TLD. This feature is activated by clicking FoamTree, More Options and then selecting in the Cluster with the drop-down box the ‘By URL’ option. When properly configured this generates results similar to Figure 8:

Image

Figure 8: Cluster by URL results in a Carrot2 search

This result is most useful to the investigator when focusing on the TLDs shown around the outside of the display. These results represent the hidden nooks of the Internet, and often hold within them data that would be overlooked by mainstream search engines due to their proclivity to concentrate on the .org and .com domains. Combining Carrot2’s ability to highlight important TLDs with a good foreign language search engine can quickly advance any OSINT task that spans multiple geographies and linguistic spheres.

Cyber geography is one of the simplest but often overlooked features of OSINT tradecraft. With hindsight, when the theoretical foundations and practical workarounds are explained, cyber geography seems obvious; but more than any other point examined within this book, cyber geography pervades almost every aspect of OSINT professional practice. Ignore it at your peril…

(Slightly below) the Surface Web

So far this chapter has concentrated on harvesting data in ways that the average web user will be relatively familiar with. However, there is a huge pool of data that sits slightly beneath the websites and the files that can only be accessed with specialist knowledge and tools. This section examines some of these tools used to pull back the covers of data held on the Surface Web.

Metadata (subsurface data)

Metadata is data that describes other types of data. For example, the .docx file extension tells a computer’s operating system that the file is a Word document and should be opened within the Microsoft Office program. Although each individual file contains many hundreds and even thousands of points of metadata, certain fields are more useful to the investigator than others. The three most important metadata fields to be aware of are:

1. File creation date: this is typically interpreted as the date when the file was originally created by the user. Be warned! The date can be tampered with and inadvertently altered by both the user and the actions of services on the computer system such as antivirus programs. As such, the file creation date should only be used as intelligence to advance an investigation as opposed to being hard evidence35. Having said that, this piece of metadata can be very useful in constructing a timeline of events by mapping file content within the creation date within the metadata.

2. File author: typically, the username used to log in to the operating system of the computer that initially created the file is recorded within the meta of that file. This can prove insightful if the username reflects the name of the person whose login created the file.

3. Latitude and longitude where the file was created (for image files only): most modern camera devices, including phones and tablets, will typically have some kind of GPS system built in. The latitude and longitude where a photograph was taken is stamped into the metadata of the image. Clearly this can be of huge benefit to an investigation and has proven so in extreme cases involving kidnap and ransom. One point to note is that not all images have this metadata within them, particularly if the image has been uploaded to a social media platform, which usually deliberately strips metadata from uploaded files.

Although the preceding metadata only gives three points of information on a file, if the investigator has the ability to get creative with this data then there are a number of ways that these data points can be effectively used. For example, file creation dates can be used to challenge witness testimony within an interview context, and location information attached to images can tangibly link events to locations.

Extracting metadata from a digital artifact is surprisingly easy, with the creation date and author information being available by right-clicking on a file icon and then clicking on the Properties option. The latitude and longitude within the images is slightly more difficult to extract and requires specialist software to do so. Luckily this software is freely available online from resources such as http://regex.info/exif.cgi.

ImageFOCA search tools

FOCA is an exceptionally powerful desktop search tool that leverages the advanced search syntax of Google, Yahoo and Bing to ‘scrape’ a target website for a huge range of file types. The program works by enumerating a website and searching for file paths to documents. Once it has identified these docs the user is free to download these files to their own machine. The power of FOCA lies in a number of areas: firstly, it rapidly automates the reparative task of entering complex search syntax for each file type into the three search engines FOCA utilises. Secondly, it assists in the mass extraction of metadata from the reports that it recovers. Thirdly and most importantly, FOCA has the ability to recover documents from a website that have no explicit download links on the site’s pages. This almost magical ability, combined within the program’s other functions, creates a powerful tool for the investigator that can rapidly increase the scale, scope and speed of an investigation. As of writing this book, FOCA is freely available from downloadcrew.co.uk/article/22211-foca_free.

Specialist search syntax

The majority of, if not all, search engines let users specify specialist syntax to generate unique results. The incentive for the investigator to use specialist syntax is that by effectively doing so, unique and insightful results can be gained that would not otherwise be revealed by merely entering search terms into a search engine.

The basic foundation for all search syntax is the AND, OR and NOT principles borrowed from a branch of discrete mathematics called Set Theory. These three basic operators are the basis upon which search engine algorithm results return data:

1. AND – returns results that contains all elements of the submitted search terms, e.g. the search Fish Chips would only return pages that contained both Fish and Chips (note: there is no need to explicitly specify the AND operator to most search engines as this is used by default within search strings. Additionally, search engines extract conjunctions such as and, but and then from search results as they are far too common in language to effectively match a search to a result).

2. OR – returns results that contain either elements of the search terms, e.g. Fish OR Chips would return pages that solely mentioned the distinct concepts of Fish or Chips as well as pages that mentioned both (note: the OR operator must be explicitly entered into the search engine with a capitalised OR between search terms).

3. NOT – one of the more strange mathematical concepts, NOT specifies a concept that should not be included in a set of search results. For example, Fish NOT Chips would return pages that contained the Fish concept with no mention of the Chips concept (note: NOT is specified by the – sign, so Fish NOT Chips would be entered into Google as Fish – Chips).

Although AND is used by default, OR and NOT can provide significant benefits during an investigation. OR throws the investigative net as wide as possible and NOT can be useful in removing concepts that are dominating a search result group and potentially obscuring useful data, e.g. Employees Microsoft – ‘Bill Gates’ would show you who worked for Microsoft free from the company’s most famous employee.

AND, OR and NOT are the most basic specialist search syntax. However, they effectively introduce the concept of search syntax and what can be achieved with these techniques. There are a number of books that are devoted solely to search syntax, or ‘Google Kung-Fu’ to use the correct computer hacker terminology. The author recommends the reader read Google Hacking36, which despite being nearly eight years old is still an excellent starting point to this niche of OSINT research. The author would also like to highlight a number of specialist search syntax strings and place them within the context of OSINT that the rest of this book focuses on. Listed below are five examples of specialist search syntax that, although not commonly known, are vital to building a more effective OSINT toolkit for the individual investigator.

Image AND as default

It often surprises investigators that the default setting for almost all search engines is the AND operator. The result of this is that if the investigator insists on simply submitting text as a search string with no additional syntax the returned results will be extremely limited. Combining this approach with simply using Google as the main source of information further reduces the chances of success for the investigator.

Table 3: Key specialist search syntax

 

Search Syntax (search syntax in red, input data in blue)

Search Engine

Explanation

ip: xx.xxx.xxx.x

Bing (unique)

List all websites hosted on an IP address. Very useful as this can show site importance not listed on any search engine index

site: www.asite.com/

Bing (unique)

List all pages on a specific website. Great for making sure that you have examined all pages on a site of interest

Search term domain: www.asite.com

Bing but similar functionality in other search engines

Look for something specific within a defined website

search term

Google but similar functionality in other search engines

Prevent the spelling corrector from automatically correcting searches. Useful for unusual searches

Search term: image

Yahoo (unique)

Return pages with the search term in an image. Good for advanced people stalking

One point to note about these searches is that many of them only work in a specific search engine, which continues the earlier theme that there is no one perfect solution for OSINT and that a blend of tools is necessary to achieve maximum effect.

Specialist web investigation tools

Most OSINT tools are similar to the FOCA tools discussed earlier in that although they allow faster access to data, they do not in themselves give access to any more data than if the investigator conducted a search or action manually. The following three OSINT tools buck this trend and give access to unique data that can be critical in advancing an investigation:

1. DomainTools: whenever a website is hosted on a Surface Web Internet server the designer is required to provide a set of details including contact name, email address, telephone number and physical address of the owner of the site. This data is collectively known as WHOIS data and is publicly available (surprisingly) via services such as DomainTools. In addition to a straight WHOIS lookup, DomainTools can provide a host of other services that can prove useful for the investigator:

a. Reverse WHOIS lookup – find all sites that have commonalities within the WHOIS registration data. As designers tend to reuse registration details, this is an incredible tool for expanding the scope of an investigation as it finds links between otherwise disparate websites.

b. DomainTools monitors – tracks activity on specific domains, e.g. changes in WHOIS and so on, which is useful for tracking changes in large constellations of suspicious sites.

c. Screenshots – see how a site used to be in the past.

d. Reverse MX – somewhat more technical but in essence websites are often configured to share an email service. The MX record provides access for multiple domains to the same email server and is another way to link suspicions domains.

2. web.archive.org: within this site is hosted the Wayback Machine (WBM), a tool that gives access to web.archive.org’s historical record of the Internet. This incredible project seeks to archive as much of the Internet as possible and pieces of autonomous software called spider run constantly collect images of websites on behalf of web.archive.org. The archive goes back to 2005 and users are free to browse these sites via the WBM interface. One point to note is that all websites stored within web.archive.org are no longer hosted on the original web spaces that they were designed for; as such, any WHOIS searches via DomainTools will return registration information for web.archive.org and not the original site. Despite this the WBM is an incredibly useful tool for viewing how sites have developed over time and as the sites are stored on web.archive.org’s servers (and hence effectively out of the control of the original authors) any incriminating information is stored for posterity within the archive as well.

3. Google Earth: often overlooked as just another tool, Google Earth and the imagery it contains can provide critical insight on a location if used correctly. Telltale indicators such as children’s toys and types of industrial equipment can provide key indicators with regard to the use of a geographic location and its occupants. One point to note is that Google Earth is not a real-time monitoring tool and the imagery within the service is dated. How dated can be seen by the small date/time information displayed in each Google Earth image that indicates when the image was originally created.

Suggested search and the knowledge of the crowd

Have you noticed when using Google that when you type in the start of a search term Google makes suggestions via a drop-down box of what it thinks you might be searching for?

Image

Figure 9: Google suggested search example

This is called ‘suggested search’ and is one of the most useful features of Google and other search engines. Of course the Google algorithm, just like all other computational constructs, has no idea what you mean when you type in a search term; however, it does know that other users have searched for similar things that bear a relation to the initial part of your search string. For example, Google suggests ‘James Bond’ when a user types in ‘James’ as ‘Bond’ is the most commonly associated term for ‘James’.

This effect is created by Google data mining hundreds of thousands of search terms to identify relations between search terms. This is in effect pattern matching on a grand scale and this neat technical trick taps into what sociologists have called the ‘knowledge of the crowd’. As well as being an impressive technical party trick, suggested search can be useful for researchers due to its ability to highlight entities important to a case that the investigator might not have come across as part of their own investigation.

Image The right to be forgotten

Having one’s name associated with undesirable entities and concepts such as ‘fraudster’ is not a happy situation to be in and can seriously damage an individual’s standing in today’s information society. Negative association has become such a serious issue that many individuals have taken legal action against service providers such as Google to remove negative suggested search results37. People legally challenging suggested search results is part of a wider developing concept called the ‘right to be forgotten’,which at its core proposes that an individual should have the right to remove personal information from the Internet as they age.

Take, for example, the search phrase ‘Boko Haram’. Common suggested searches that Google returns include generic phrases such as ‘terrorist’ and ‘atrocity’;however, there are often terms suggested such as ‘Maiduguri’,the town in northern Nigeria that the group’s geographic centre of gravity revolves around. The important point is that although the researcher may not have known that Maiduguri was associated with Boko Haram, others did, and they searched for articles using both entities. By querying the suggested search database in Google the investigator has the ability to vastly speed up their investigation by tapping into the knowledge that has been generated by others.

Image People are strange

Records within the suggested search database are not created by a few casual searches; instead, many thousands of repetitions of a search phrase are required to create a suggested search. The majority of these suggestions make complete sense; however, some of them are decidedly strange. Try entering the phrase ‘sometimes I like to lie’ and see what comes back. Thousands of people have entered the result as a search term for this strange suggested search to be created…

As with just about everything within this book there is a specialist tool for querying the suggested search database in Google: http://soovle.com/.

Soovle has the added benefit of searching the Wikipedia, Amazon, Yahoo, Bing, YouTube and Answers suggested search datasets as well.

Conclusion

The Surface Web is the most easily accessible and permissive layer of the Internet and will most likely be where the majority of researchers spend the majority of their time. The Surface Web also provides the most variety for the application of novel tools and techniques, hence the length of this chapter compared to the following chapters in this book.

10 An operating system is the piece of software that runs the hardware of your computer. In plain terms, it’s the thing that you see on the screen when you boot up you computer.

11 If further clarification of how to integrate add-ons into either Firefox or Chrome is required then judicious Googling will display many ‘how to’ instruction manuals on the subject. Additionally, the online video service YouTube lists several hundred videos discussing the subject with onscreen walkthroughs.

12 https://addons.mozilla.org/en-US/firefox/addon/flagfox/?src=search

13 https://addons.mozilla.org/en-US/firefox/addon/email-extractor/?src=search

14 https://addons.mozilla.org/en-US/firefox/addon/translate-this

15 https://addons.mozilla.org/en-US/firefox/addon/directions-with-google-maps/?src=search

16 https://addons.mozilla.org/en-US/firefox/addon/resurrect-pages

17 https://addons.mozilla.org/en-US/firefox/addon/mementofox

18 https://addons.mozilla.org/en-US/firefox/addon/easy-youtube-video-download/?src=search

19 https://addons.mozilla.org/en-US/firefox/addon/abduction

20 https://addons.mozilla.org/en-US/firefox/addon/save-text-to-file/?src=search

21 https://addons.mozilla.org/en-US/firefox/addon/search-by-image-by-google

22 https://addons.mozilla.org/en-US/firefox/addon/carrot2/?src=userprofile

23 I remember during the era of dial-up modems when I could make a cup of coffee while a page loaded – happy days…

24 The image search button is located just beneath the search bar in Google.

25 Sol. (9 Dec 2008). Dogpile comes out at the top of the pile. Download from http://federatedsearchblog.com/2007/12/09/dogpile-comes-out-at-the-top-of-the-pile

26 www.zuula.com

27 www.dogpile.com

28 www.polymeta.com

29 www.iseek.com

30 www.cluuz.com

31 http://search.carrot2.org/stable/search

32 Dogpile has queried Bing in the past and may do so again in the future.

33 As of writing this book, many of the search engine connections to Zuula are broken. The site experiences periodic outages and the status of the development for this project is unknown.

34 If you are interested in seeing exactly which websites are censored in which countries, take a look at Herdict.org, a site that tracks Internet censorship.

35 Computer forensic technique that can be used to enter the file creation date and other dates associated within files into an evidence chain. However, this is a highly specialist skill that is outside the scope of this book.

36 Long, J. Temmingh, R. Petkov, P. (2007). Google Hacking. Syngress

37 https://reportingproject.net/occrp/index.php/en/ccwatch/cc-watch-briefs/2571-hong-kong-man-sues-google-over-search-suggestions

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset