Deep Dive

This section explains how Skirnir's main functions work by going through the program code.

Surface Crawl

Boolean Request

A Boolean request is a type of query used to retrieve information based on certain conditions or criteria. It uses Boolean logic, which involves the use of logical operators such as AND, OR and NOT to combine conditions.

We use Boolean requests during crawling to refine results by keeping only the profiles of the selected social networks. They are supported by most of the internet search engine (Google, DuckDuckGo, Bing, etc.)

Boolean operators used in Skirnir :

  • AND => Combines conditions, all must be true.

  • OR => Combines conditions, at least one must be true.

  • site: => Restricts search to specific site/domain.

  • inurl: => Searches for specific term in URL.

  • NOT => Excludes specific term or condition.

Example

This query will only display results for linkedin profiles linked to John Doe.

"John Doe" AND site:linkedin.com AND inurl:/in

Surface Crawl Query

The program currently have four queries for surface crawling, each tailored to specific specifications: basic crawl, nickname crawl, first name nickname variation crawl, and keyword search.

The choice of having four queries is strategic because we have Limited number of characters. We can't send long requests because of the 32 criterias (words) limit, so we've split the requests into smaller ones to stay within the character limit.

To be more resilient against anti-crawling we use two methods (plus another that depends on the user):

  1. Use of a pause between request : Introduce a sleep call between each request to space out the queries. This helps to mimic human behavior and avoid triggering anti-crawling measures by giving the impression of natural browsing patterns.

  2. Use of user agents: Alternate between different user agents for each query. This makes it more difficult for Google to identify and block browsing activity, as it appears to come from different devices or browsers. We also use mobile user-agents, which are less blocked than computer user-agents.

  3. Use of proxy: You can import your personal proxy into the program. Using a proxy is the most effective way of preventing the program from being blocked during crawling.

Details of the Surface Crawl queries

Basic Crawling

Performs a basic crawl over internet to look for profiles of the selected social networks for the given firstname and lastname.

Example :

"John Doe" AND (site:linkedin.com OR site:instagram.com OR site:facebook.com)

Nickname Crawling

Conducts a crawl for the given nickname. We still keep firstname + lastname to improve result.

Example

("DarkSasuke" OR "John Doe") AND (site:linkedin.com OR site:instagram.com)

First name nickname variation crawl

Executes a crawl aimed at identifying potential nicknames associated with the provided first name, utilizing a CSV database for nickname matching.

(("Johnny" OR "JohnJohn") AND "Doe") AND (site:linkedin.com OR site:facebook.com)

Keyword Crawling

Launches a crawl driven by user-provided keywords, which can range from single words to more complex boolean queries.

"John Doe" AND (site:linkedin.com) AND NOT unknown

with keyword = NOT unknown

Deep Crawl

Aliases Generation

Skirnir genetates possible aliases with the given parameters (firstname, lastname, birthday, nickname, alias). It is possible to limit the size of the generated alias with the field "limit the size of the generated aliases" in the main window.

The generation process handles multiples operations to create custom aliases :

  • First character only

  • Remove of vowels

  • Various delimiters ("-" "_" "." " ")

The generation also handles birthday, alias/nickname and composed fistname/lastname.

For example, let's use the following parameters

  • Firstname : John

  • Lastname : Doe

  • Birthday : 31/10/2020

We can have the following results :

JhnD3110
JohnDoe-2020
J.D.31-10-2020
John_D_31

Mapping nickname and firstname

In the code, we utilize a file to associate first names with possible nicknames. For instance, for a name like "John Doe," we may want to try variations like "JohnJohn," "Johnny," or "Jon." This nickname matching mechanism is employed specifically when both a first name and a last name are provided as parameters for the program during a crawl.

You're welcome to contribute to the "mapping_nicknames_names" file with additional nickname mappings by opening a pull request on the github repository of the project.

Request, website viewers

We utilize mirror websites to check the existence of the previously generated nicknames on selected socials networks.

To prevent potential blocking, we implement two mechanisms:

  • Random selection of a mirror website from a predefined list associated with the social network.

  • Utilization of a proxy if a proxy file is provided to the program.

Only the existing profiles are then displayed in the result window.

Scoring Result

After surface or deep crawling, the results are ranked in relevance order according to this matrix :

When solely relying on an alias for crawling (by clicking on the alias exclusively), the ranking system operates through a string comparison method between the URL and the provided alias, utilizing the Jaro-Winkler distance metric. Jaro-Winkler distance assesses the dissimilarity between two character strings, with a higher emphasis placed on the initial segment of a string.

For example, let's use the alias "Johnny75" and compare it with these URLs:

socialnetwork.com/Johnni75
socialnetwork.com/Johnny85
socialnetwork.com/Johnny75

The Jaro-Winkler distance metric would prioritize the initial segment "Johnny" in the alias "Johnny75." Therefore:

URL 1: Matches "Johnny75" exactly at the beginning.
URL 2: Matches "Johnny85" with a similar prefix but different digits.
URL 3: Contains "Johnni75," which is similar but not exactly the same as the alias.

Based on this comparison, the urls will be displayed as follow :

socialnetwork.com/Johnny75
socialnetwork.com/Johnny85
socialnetwork.com/Johnni75

Last updated