Omology Guide How Search Engines Find and Rank your Pages

Search Engine Ranking

How Search Engines Find and Rank your Pages

To many, Google seems like magic. It indexes the entire Internet and returns the most relevant results in a flash. How does it know what we want to see? Understanding how Google can accomplish such a feat isn’t too hard to wrap your head around. First, you have to learn the technology that’s used. And once you can grasp that, you’ll know how websites get “indexed.” Additionally, this will help you understand how to build your website so that it’s readable by Googlebot (and other search engine robots). You’ll be able to speak Google’s language, and that will help you generate more traffic to your site.

How Ranking Works

Website ranking starts with an army of robots (or computer software). Google and other search engines send millions of “bots” to “crawl” the Internet’s websites. When a bot crawls a website, it finds and stores hundreds of data points on a website. This helps determine:

  1. A) what your website is about
  2. B) and the types of information people can find there.

All of this information that the bots have collected then gets “indexed,” or filed in Google’s massive filing cabinet. When a user searches on Google or Bing, the search engine rifles through this massive filing cabinet to serve the most relevant results. If you type in “cat videos,” for example, Google goes to its cat video filing tab and shows you what has been indexed there.

So how do you make sure that your webpage about cat videos shows up in results? Google uses more than 200 “ranking factors” – or tiny pieces of information on your webpage that tell Google what your webpage is all about. This includes things like keywords, and titles, URLs and page rank. All of these points matter (some more than others), and when you use them strategically, you can help the army of robots know what your website offers.

What Is Crawling?

The army of robots – might create an alarming mental picture. In reality, this “army of bots” is really a network of super-computers that fetch information from billions of web pages. These computers scan your webpage and take notes about what’s included. This is known as “crawling,” and all that information that’s collected then gets “indexed.” Using algorithmic learning, Google and Bing tell its bots which pages to index, how often to index them, and how many pages from your site should be indexed. When a crawler visits your website, you can help it understand what it’s looking at.

In fact, there are two files that you can include in your website’s code that help bots crawl your site:

Robots.txt – is a file that tells Googlebot and other crawlers what to do, which pages it can index and may include instructions for how bots should treat links. You can set crawl delays – to prevent a bot from crawling your site too quickly and slowing down your server – or you can limit one particular bot from searching your site (i.e. allowing Google and Bing, but disallowing Yahoo).

Search engines use bots to determine if a website provides the product or information that somebody is searching for. If a website passes the test it becomes indexed for that topic. When a website is indexed it shows up as a search result.

A simple example is a search for deep dish pizza in Chicago. The results of this search will be websites like Yelp and restaurant websites that offer the popular food. Google bots already indexed those restaurants for selling the product that is being searched for: deep dish pizza in Chicago. This way, the search results are almost immediate. What does it mean for a search engine bot to crawl a webpage?

These bots search millions of web pages to direct the consumer to the right website. It’s a process called crawling. Google web crawling bots scour the web for new pages to add to Google’s index. They also check back with web pages that update to re-index.

What happens if a webpage isn’t ready to be indexed because it’s still in development? There is a way to block bots from crawling an unfinished website, avoiding becoming poorly or inaccurately indexed.

How To Block Bots From Crawling A Website: Robots.txt For a bot to find a website, it first has to have access. Website owners will put a robot.txt file on their server to direct keep bots or crawlers to certain sections of their website. Other reasons to use a robot.txt file to keep bots from crawling a website or page

  1. Block content from search engines. Duplicated content, private content, admin sections or pages under development are blocked so bots can’t access and index them.
  2. Stop bots from crawling advertisements. Bots can read paid links and advertisements and accidentally index a site for that content. Since these links don’t always pertain to the website’s niche, a website owner wants to tell the bots that this is an advertiser’s link.
  3. The only bots allowed on the site are reputable bots. A non-reputable bot is a bot created by hackers that is looking for websites that answer specific search queries. Unlike Google bots, these bots steal the information that they find and use out of date software and plugins to hack the website. If these non-reputable bots continuously visit a website, they can drastically decrease page speed as well.
  4. Need to block access for an above reason but don’t have access to immediate access to a web server.

When a website is ready to be indexed, there is a way to create a virtual map to make it easier for the bots to read the content. This virtual map is a sitemap and it translates the information on a website into an easy to read format for a bot.

The simplest version of a Robots.txt file is

User-agent: * Allow: /
Sitemap: http://www.example.com/sitemap.xml

Important note that the URL to your sitemap is included in your robots.txt file.  Many websites fail to do this. This version allows all crawlers and bots to crawl every page on your website.

To be clear a robots.txt file is only a guide to crawlers and by having a robots.txt file you are trying to guide the bot the content you want to be indexed.  However, crawlers/bots can ignore your robots.txt file and they are not obliged to do as your robots.txt file suggests.

So you should always ensure your website is secure.

Sitemap.XML – Like the robots.txt file, your Sitemap is a helpful tool for web crawlers. It tells the bot about the organisation of your web pages.

A sitemap can also incorporate metadata – tiny bits of data that help the crawler understand what it’s crawling. Sitemaps are particularly important for large, multi-page websites, new sites without many external links pointing to them, or sites with many rich types of media.

It can be difficult for a site bot to read a webpage with hundreds of products, as well as stay up to date with the content. A sitemap condenses that information into a code called XML. This coding creates trackable URLs and lets the bots know how often to check back for new content.

Each sitemap holds information for a single webpage. For example, a sitemap can be for a landing page, a products page or a page with a blog article. The sitemap is submitted to search engines, which will be explained in further detail later.

Here is what a sitemap looks like

 

<?xml version=”1.0″ encoding=”UTF-8″?>
<urlset xmlns=”http://www.sitemaps.org/schemas/sitemap/0.9″> <url> <loc>http://www.example.com/</loc>

<lastmod>2018-01-01</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>

</url>
</urlset>

 

Above text in bold font must be in every sitemap. Text in bold and italic are optional tags can provide more, but not necessary, information to the search engine bot. A sitemap doesn’t have to include the date of the last modification, how often their content will update or a priority level.

It could look like this:

<?xml version=”1.0″ encoding=”UTF-8″?>
<urlset xmlns=”http://www.sitemaps.org/schemas/sitemap/0.9″> <url> <loc>http://www.example.com/</loc>
</url> </urlset>

 

Since sitemaps help the bots understand website content faster, it is best to provide as much information as possible. For the same website, webpage sitemaps don’t have to be submitted individually.

Submit two or more sitemaps at once by placing the second webpage link with the following information here:

 

<?xml version=”1.0″ encoding=”UTF-8″?>
<sitemapindex xmlns=”http://www.sitemaps.org/schemas/sitemap/0.9″>
<sitemap>
<loc>http://www.example.com/sitemap1.xml.gz</loc>
<lastmod>2018-10-01T18:23:17+00:00</lastmod>
</sitemap>
<sitemap>

<loc>http://www.example.com/sitemap2.xml.gz</loc>

<lastmod>2018-01-01</lastmod>

</sitemap>

</sitemapindex>

 

The only restriction here is that each sitemap file can only contain 50,000 URLs. The above example is a sitemap file with two URLs.

Sitemaps help answer the questions that search engine bots are asking when they crawl a webpage. To further help search engine bots index content, specifically updated content, there are several important factors to focus on.


Get us to do the Analysis for you.

Need help understanding what to do next to rank your pages? Let us help you for free.

Join Free