What happens if <base href…> is set with a double slash?

寵の児 提交于 2019-12-21 13:23:10

问题


I like to understand how to use a <base href="" /> value for my web crawler, so I tested several combinations with major browsers and finally found something with double slashes I don't understand.

If you don't like to read everything jump to the test results of D and E. Demonstration of all tests:
http://gutt.it/basehref.php

Step by step my test results on calling http://example.com/images.html:

A - Multiple base href

<html>
<head>
<base target="_blank" />
<base href="http://example.com/images/" />
<base href="http://example.com/" />
</head>
<body>
<img src="/images/image.jpg">
<img src="image.jpg">
<img src="./image.jpg">
<img src="images/image.jpg"> not found
<img src="/image.jpg"> not found
<img src="../image.jpg"> not found
</body>
</html>

Conclusion

  • only the first <base> with href counts
  • a source starting with / targets the root
  • ../ goes one folder up

B - Without trailing slash

<html>
<head>
<base href="http://example.com/images" />
</head>
<body>
<img src="/images/image.jpg">
<img src="image.jpg"> not found
<img src="./image.jpg"> not found
<img src="images/image.jpg">
<img src="/image.jpg"> not found
<img src="../image.jpg"> not found
</body>
</html>

Conclusion

  • <base href> ignores everything after the last slash so http://example.com/images becomes http://example.com/

C - How it should be

<html>
<head>
<base href="http://example.com/" />
</head>
<body>
<img src="/images/image.jpg">
<img src="image.jpg"> not found
<img src="./image.jpg"> not found
<img src="images/image.jpg">
<img src="/image.jpg"> not found
<img src="../image.jpg"> not found
</body>
</html>

Conclusion

  • Same result as in Test B as expected

D - Double Slash

<html>
<head>
<base href="http://example.com/images//" />
</head>
<body>
<img src="/images/image.jpg">
<img src="image.jpg">
<img src="./image.jpg">
<img src="images/image.jpg"> not found
<img src="/image.jpg"> not found
<img src="../image.jpg">
</body>
</html>

E - Double Slash with whitespace

<html>
<head>
<base href="http://example.com/images/ /" />
</head>
<body>
<img src="/images/image.jpg">
<img src="image.jpg"> not found
<img src="./image.jpg"> not found
<img src="images/image.jpg"> not found
<img src="/image.jpg"> not found
<img src="../image.jpg">
</body>
</html>

Both are not "valid" URLs, but real results of my web crawler. Please explain what happend in D and E that ../image.jpg could be found and why causes the whitespace a difference?

Only for your interest:

  • <base href="http://example.com//" /> is the same as Test C
  • <base href="http://example.com/ /" /> is completely different. Only ../image.jpg is found
  • <base href="a/" /> finds only /images/image.jpg

回答1:


The behavior of base is explained in the HTML spec:

The base element allows authors to specify the document base URL for the purposes of resolving relative URLs.

As shown in your test A, if there are multiple base with href, the document base URL will be the first one.

Resolving relative URLs is done this way:

Apply the URL parser to url, with base as the base URL, with encoding as the encoding.

The URL parsing algorithm is defined in the URL spec.

It's too complex to be explained here in detail. But basically, this is what happens:

  • A relative URL starting with / is calculated with respect to base URL's host.
  • Otherwise, the relative URL is calculated with respect to base URL's last directory.
  • Be aware that if the base path doesn't end with /, the last part will be a file, not a directory.
  • ./ is the current directory
  • ../ goes one directory up

(Probably, "directory" and "file" are not the proper terminology in URLs)

Some examples:

  • http://example.com/images/a/./ is http://example.com/images/a/
  • http://example.com/images/a/../ is http://example.com/images/
  • http://example.com/images//./ is http://example.com/images//
  • http://example.com/images//../ is http://example.com/images/
  • http://example.com/images/./ is http://example.com/images/
  • http://example.com/images/../ is http://example.com/

Note that, in most cases, // will be like /. As said by @poncha,

Unless you're using some kind of URL rewriting (in which case the rewriting rules may be affected by the number of slashes), the uri maps to a path on disk, but in (most?) modern operating systems (Linux/Unix, Windows), multiple path separators in a row do not have any special meaning, so /path/to/foo and /path//to////foo would eventually map to the same file.

However, in general / / won't become //.

You can use the following snippet to resolve your list of relative URLs to absolute ones:

var bases = [
  "http://example.com/images/",
  "http://example.com/images",
  "http://example.com/",
  "http://example.com/images//",
  "http://example.com/images/ /"
];
var urls = [
  "/images/image.jpg",
  "image.jpg",
  "./image.jpg",
  "images/image.jpg",
  "/image.jpg",
  "../image.jpg"
];
function newEl(type, contents) {
  var el = document.createElement(type);
  if(!contents) return el;
  if(!(contents instanceof Array))
    contents = [contents];
  for(var i=0; i<contents.length; ++i)
    if(typeof contents[i] == 'string')
      el.appendChild(document.createTextNode(contents[i]))
    else if(typeof contents[i] == 'object') // contents[i] instanceof Node
      el.appendChild(contents[i])
  return el;
}
function emoticon(str) {
  return {
    'http://example.com/images/image.jpg': 'good',
    'http://example.com/images//image.jpg': 'neutral'
  }[str] || 'bad';
}
var base = document.createElement('base'),
    a = document.createElement('a'),
    output = document.createElement('ul'),
    head = document.getElementsByTagName('head')[0];
head.insertBefore(base, head.firstChild);
for(var i=0; i<bases.length; ++i) {
  base.href = bases[i];
  var test = newEl('li', [
    'Test ' + (i+1) + ': ',
    newEl('span', bases[i])
  ]);
  test.className = 'test';
  var testItems = newEl('ul');
  testItems.className = 'test-items';
  for(var j=0; j<urls.length; ++j) {
    a.href = urls[j];
    var absURL = a.cloneNode(false).href;
      /* Stupid old IE requires cloning
         https://stackoverflow.com/a/24437713/1529630 */
    var testItem = newEl('li', [
      newEl('span', urls[j]),
      ' → ',
      newEl('span', absURL)
    ]);
    testItem.className = 'test-item ' + emoticon(absURL);
    testItems.appendChild(testItem);
  }
  test.appendChild(testItems);
  output.appendChild(test);
}
document.body.appendChild(output);
span {
  background: #eef;
}
.test-items {
  display: table;
  border-spacing: .13em;
  padding-left: 1.1em;
  margin-bottom: .3em;
}
.test-item {
  display: table-row;
  position: relative;
  list-style: none;
}
.test-item > span {
  display: table-cell;
}
.test-item:before {
  display: inline-block;
  width: 1.1em;
  height: 1.1em;
  line-height: 1em;
  text-align: center;
  border-radius: 50%;
  margin-right: .4em;
  position: absolute;
  left: -1.1em;
  top: 0;
}
.good:before {
  content: ':)';
  background: #0f0;
}
.neutral:before {
  content: ':|';
  background: #ff0;
}
.bad:before {
  content: ':(';
  background: #f00;
}

You can also play with this snippet:

var resolveURL = (function() {
  var base = document.createElement('base'),
      a = document.createElement('a'),
      head = document.getElementsByTagName('head')[0];
  return function(url, baseurl) {
    if(base) {
      base.href = baseurl;
      head.insertBefore(base, head.firstChild);
    }
    a.href = url;
    var abs = a.cloneNode(false).href;
    /* Stupid old IE requires cloning
       https://stackoverflow.com/a/24437713/1529630 */
    if(base)
      head.removeChild(base);
    return abs;
  };
})();
var base = document.getElementById('base'),
    url = document.getElementById('url'),
    abs = document.getElementById('absolute');
base.onpropertychange = url.onpropertychange = function() {
  if (event.propertyName == "value")
    update()
};
(base.oninput = url.oninput = update)();
function update() {
  abs.value = resolveURL(url.value, base.value);
}
label {
  display: block;
  margin: 1em 0;
}
input {
  width: 100%;
}
<label>
  Base url:
  <input id="base" value="http://example.com/images//foo////bar/baz"
         placeholder="Enter your base url here" />
</label>
<label>
  URL to be resolved:
  <input id="url" value="./a/b/../c"
         placeholder="Enter your URL here">
</label>
<label>
  Resulting url:
  <input id="absolute" readonly>
</label>


来源:https://stackoverflow.com/questions/29122106/what-happens-if-base-href-is-set-with-a-double-slash

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!