Tag: search engines

Sitemap.xml files: what they are for, how to use them, and how to bypass “Too many URLs” error and size limits

Table of contents

  1. What are Sitemaps
  2. What are the restrictions for sitemap files
  3. How can you compress a sitemap file
  4. Can I use multiple sitemaps?
  5. What is the structure of sitemap files
  6. How to generate sitemap files
  7. How to Import a Sitemap into Google Search Console
  8. Sitemap.xml file status “Couldn't fetch”
  9. Is it necessary to use the sitemap.xml file?
  10. What to do if the sitemap contains an error. How to remove a sitemap file from Google Search Console

What are Sitemaps

Sitemaps are XML-formatted files that contain a list of the URLs of your site's pages for submission to the Google search engine so that it can quickly find out and index them.

What are the restrictions for sitemap files

  1. The file size should not be more than 50 MB
  2. There can be no more than 50,000 links in one file

How can you compress a sitemap file

In addition to the simple text format with XML markup, the file can be compressed into a .gz archive. In this case, the file size decreases dramatically because text files compress very well. For example, my 25 MB file was compressed into a 500 KB file.

To do this, it is enough to compress the original sitemap.xml file into .gz format. As a link in Google Search Console, you need to specify the path to the archive, for example: https://site.net/sitemap.xml.gz

If, when you try to open the https://site.net/sitemap.xml.gz file in a web browser, it downloads it to your computer instead of showing the content as for the sitemap.xml file, then this is normal. Either way, Google Search Console will be able to process this file.

Can I use multiple sitemaps?

For each site or domain resource, you can create multiple Sitemaps and import them all into Google Search Console – this is not only allowed, but also recommended by Google itself for sitemaps that are too large.

If there are many Sitemap files, then a complete list of them can be collected in a separate Sitemap file. This file is called “Sitemap Index File”. An example of the content of the sitemap.xml file:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
	<sitemap>
		<loc>https://site.net/sitemaps/sitemap_1.xml</loc>
	</sitemap>
	<sitemap>
		<loc>https://site.net/sitemaps/sitemap_2.xml</loc>
	</sitemap>
	<sitemap>
		<loc>https://site.net/sitemaps/sitemap_3.xml</loc>
	</sitemap>
</sitemapindex>

After that, it is enough to import this main file into Google Search Console.

The rest of the sitemaps listed in the main index file will automatically be imported into the Google Search Console.

To see them, click on the file name. You will see a list of imported Sitemaps.

You need to wait before these files are processed and their status changes to “Success”.

What is the structure of sitemap files

Sitemap files have the following structure:

<?xml version="1.0" encoding="utf-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
	<url>
		<loc>https://domain.site.net/?p=1</loc>
		<lastmod>2022-10-08T14:14:27+00:00</lastmod>
		<changefreq>monthly</changefreq>
		<priority>0.8</priority>
	</url>
<url>
	<loc>https://domain.site.net/?p=2</loc>
		<lastmod>2022-10-08T14:14:27+00:00</lastmod>
		<changefreq>monthly</changefreq>
		<priority>0.8</priority>
	</url>
	<url>
		<loc>https://domain.site.net/?p=3</loc>
		<lastmod>2022-10-08T14:14:27+00:00</lastmod>
		<changefreq>monthly</changefreq>
		<priority>0.8</priority>
	</url>
</urlset>

Each entry consists of four elements:

  1. URL
  2. Date of last modification
  3. Frequency of modification (e.g. monthly)
  4. A priority

How to generate sitemap files

If you are using WordPress, then the easiest way is to install a sitemap plugin.

If there is no sitemap plugin for your site engine, then it is quite easy to generate it yourself, since it is just a text file with XML markup.

How to Import a Sitemap into Google Search Console

Go to Google Search Console, select the site you want to report the Sitemap for, enter the URL of the Sitemap.

Sitemap.xml file status “Couldn't fetch”

At first, an inscription may appear that the sitemap.xml file “Couldn't fetch”. This inscription appears even if everything is alright with the sitemap.xml file. You just need to wait a little.

The bottom line is that this inscription does not mean that there are problems with the sitemap.xml file. It's just that the turn to analyze this file has not yet come.

A little later, the status of the file will change to “Successful”. At the same time, it will show how many URLs were revealed thanks to this file.

Even later, you can view the link indexing report from the sitemap.xml file.

Is it necessary to use the sitemap.xml file?

In fact, I don't usually use a sitemap.xml file. I add articles to most sites manually and, in my opinion, the sitemap.xml file is not particularly needed, since pages on such sites are indexed very quickly.

But if you're unhappy with your site's indexing speed, or need to quickly report a large number of URLs to be indexed, then try using sitemap.xml files.

What to do if the sitemap contains an error. How to remove a sitemap file from Google Search Console

If, after trying to process the Sitemap, you find that it contains errors (for example, an incorrect date format or broken links), then you do not have to wait until the time comes for the next crawling.

You can delete a Sitemap from Google Search Console and add it again right away. After that, quite quickly (within a few minutes), Google will check the Sitemap file again.

To remove a Sitemap file from Google Search, click on it. On the page that opens, in the upper right corner, find the button with three horizontal dots. Click it and select “Remove sitemap”.

After that, the Sitemap file will be deleted and you, after correcting errors in it, can immediately re-add the Sitemap file with the same or a different URL.

How to prevent search engines from indexing only the main page of the site

To prevent search engines from indexing only the main page, while allowing indexing of all other pages, you can use several approaches, depending on the characteristics of a particular site.

1. Using the robots.txt file

If the main page has its own address (usually it is index.php, index.html, index.htm, main.html and so on), and while trying to open a link like w-e-b.site/ a website redirects to the main page, for example, to w-e-b.site/index.htm, then you can use the robots.txt file with something like the following content:

User-agent: *
Disallow: /index.php
Disallow: /index.html
Disallow: /index.htm
Disallow: /main.html

In fact, using an explicit name for the main page is the exception rather than the rule. So let's look at other options.

You can use the following approach:

  1. Deny site-wide access with the “Disallow” directive.
  2. Then allow the indexing of the entire site using the “Allow” directive, except for the main page.

Sample robots.txt file:

User-agent: *
Allow: ?p=
Disallow: /

The “Allow” directive must always come before “Disallow”. The “Allow” directive allows all pages with a URL like “?p=”, and the “Disallow” directive disables all pages. As a result, the following result is obtained: indexing of the entire site (including the main page) is prohibited, except for pages with an address like “?p=”.

Let's look at the result of checking two URLs:

  • https://suay.ru/ (main page) – indexing is prohibited
  • https://suay.ru/?p=790#6 (article page) – indexing allowed

In the screenshot, number 1 marks the contents of the robots.txt file, number 2 is the URL being checked, and number 3 is the result of the check.

2. Using the robots meta tag

If your site is separate files, then add the robots meta tag to the HTML code of the main page file:

<meta name="robots" content="noindex,nofollow>

3. With .htaccess and mod_rewrite

Using .htaccess and mod_rewrite, you can block access to a specific file as follows:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} Google [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Yandex [NC]
RewriteRule (index.php)|(index.htm)|(index.html) - [F]

Please note that when you try to open a link like https://w-e-b.site/ (that is, without specifying the name of the main page), a specific file is still requested on the web server side, for example, index.php, index.htm or index. html. Therefore, this method of blocking access (and, accordingly, indexing) works even if the main page of your site opens without specifying a specific file name (index.php, index.html, index.htm, main.html, and so on), as is usually the case.

What does “Programmable Search Engine revenue sharing changes start April 30th, 2021” mean?

The other day I received this letter from AdSense:

Programmable Search Engine revenue sharing changes start April 30th, 2021
Beginning April 30th, 2021, we will discontinue revenue sharing on the following search engines:
The public URL - a link provided for a Google hosted public page for your engine that hosts both the search box and the search results
The Google hosted layout, in which only the search results are displayed on a Google-hosted webpage
Search engines that fall into the two above categories will continue to show ads, but no revenue will be shared.
What does this mean for me?
To continue sharing revenue, you’ll need to use the Search Element deployed on your own site, which will continue to allow for monetization.
What should I do next?
Please review the developer site for more information.

Do you understand something? I did not understand anything, especially this text is difficult to perceive in a localized version.

The bottom line is that in the search results on your site, ads will still appear, but Google will take all the money for this ad.

“The Google hosted layout” - what is it? I suspect this is the layer on top of the site that the search results are showing, or am I wrong?

The links lead to pages with technical documentation, which does not increase the clarity of the question – does this mean that the "Search engine" blocks made in the AdSense dashboard will no longer make money for the site owners? Or does this not apply to these blocks?

Related article: Search engine ad: why nothing was found and why it doesn’t show ads

From the documentation and from the AdSense dashboard, links lead to https://cse.google.com/cse/all, sometimes to https://programmablesearchengine.google.com/cse – both show the same thing.

The string “Programmable Search Engine” from the letter and the subdomain programmablesearchengine.google.com hint that these are related things.

On the Programmable Search Engine settings tab, you can see the line “Edition - Standard with revenue sharing”.

And you can also see “Public URL” there – yeah, the first paragraph of the letter “The public URL” refers to this, that is, if you search on a page like https://cse.google.com/cse?cx=d95930401ffbc147a, then there will be advertising, but you will not be given money for it.

If you click on the “Get code” button, then the code is approximately the following:

<script async src="https://cse.google.com/cse.js?cx=d95930401ffbc147a"></script>
<div class="gcse-search"></div>

Does this apply to “The Google hosted layout, in which only the search results are displayed on a Google-hosted webpage” or not ?!

Technically, an overlay could very well be a “Google-hosted webpage”, that is, a page loaded from Google that is displayed on top of your site. Even an ordinary search block on a site, from a technical point of view, can be a page hosted on Google, which was asynchronously loaded and displayed on your site.

In general, the sent letter is a clear example of how not to make notifications, since nothing is clear from such messages anyway.

Or vice versa – this is an example of how to make a notification, if you want no one to understand anything …

After all, I managed to figure out what “Google-Hosted” is.

The search box is placed on one of your webpages. The search results are displayed on a Google-hosted webpage, which can be opened either in the same window or in a new window.

This means that the upcoming changes will not affect the Search engine blocks created in the AdSense dashboard.

So, they will no longer share the profit if:

  1. Search form and results are placed on the Google page
  2. The search form are placed on the site, and the search results on the Google page

Search engine ad: why nothing was found and why it doesn’t show ads

The ability to set up custom Google searches on your site has been available for at least a decade. Including displaying ads. But now there is a special block in AdSense called “Search engine”.

What is the profitability of search pages

Approximately 1 person out of 100 visitors to the site will search for something on it. About 1 out of 100 people looking for something will click on the ad. On my sites, search results pages bring in several times less than regular pages. That is, Programmable Search will generate some tangible income only if you have really large volumes of traffic.

But this search has other advantages as well:

  • you can set up a search for several sites at once – that is, if a user entered a query that is answered not on this one, but on your other site, he will see it in the search results
  • using search, you can promote pages (make them appear in any results), for example, with CPA or others

In any case, the site needs a search. And the Google search engine is very good, well, plus some kind of earnings.

Using it is nowhere easier – create a block, add the code to the site in the widgets and you're done! If you previously (many years ago) set up the code for search when you gave two snippets of code – one for the form and the other for the results – now you don't need it by default. By default, the results are shown directly in the widget itself.

How to create a Google site search form

In AdSense, go to Ads → Overview → By ad unit. And select “Search engine”.

On the page that opens:

  1. Enter the name.
  2. List the sites you want to search (one on each line).
  3. Enter a search string to see examples of results.
  4. Click Create button.

When entering a list of sites, the following help is given:

Specify a list of sites to search, one per line. You can add any of the following:

  • Individual pages: www.example.com/page.html
  • Entire site: www.example.com/*
  • Parts of site: www.example.com/docs/* or www.example.com/docs/
  • Entire domain: *.example.com

If you follow these tips, you might think that you must specify “*.suay.site” to enter a domain, but in fact, “suay.site” also works (that is, both options do not work, but how to fix it – see below ).

Everything is ready – copy the code and paste it into the website widget.

We check the search on the site and… nothing was found.

Why nothing was found in the Search engine ad unit

Go back to the ad review and press the edit button.

The Programmable Search Engine editor will open.

Scroll down the page that opens until you see the “Sites to search” section.

We click on each site and instead of “Include just this specific page or URL pattern I have entered”, switch to “Include all pages whose address contains this URL”.

There is no need to change the code – it remains the same.

We check - now everything works.

Why does the "Search engine" block not show ads?

To answer this question, you need to go to Setup → Ads, there you will see:

Note: in order to ensure a high quality experience for our advertisers, we are reviewing Programmable Search Engine ad traffic quality. It may take several weeks for revenue sharing to begin.

That is, there may be no advertisements for the first few weeks – nothing can be done, you have to wait.

By the way, check that “Search Engine Monetization” is enabled in the same place.

How to Change the Design of the Adsense Search Engine Ad Block

By default, search results are shown below the input form, stretching into a long “sausage”. You can change that. To do this, on the “Programmable Search Engine” edit page, go to the “Look and feel” tab and select the desired search results design.

On the Suay.site, I chose the “Overlay” option, that is, search results are shown in a large area that overlaps the page content.

When adding sites or changing the design, the ad unit code does not need to be changed.

How to promote pages through website search

On the Programmable Search Edit page, go to the Search Features tab and move the Enable promotions slider to the On position.

Add the pages you want to promote.

Loading...
X