问题
I like to understand how to use a <base href="" />
value for my web crawler, so I tested several combinations with major browsers and finally found something with double slashes I don't understand.
If you don't like to read everything jump to the test results of D and E. Demonstration of all tests:
http://gutt.it/basehref.php
Step by step my test results on calling http://example.com/images.html
:
A - Multiple base href
<html>
<head>
<base target="_blank" />
<base href="http://example.com/images/" />
<base href="http://example.com/" />
</head>
<body>
<img src="/images/image.jpg">
<img src="image.jpg">
<img src="./image.jpg">
<img src="images/image.jpg"> not found
<img src="/image.jpg"> not found
<img src="../image.jpg"> not found
</body>
</html>
Conclusion
- only the first
<base>
withhref
counts - a source starting with
/
targets the root ../
goes one folder up
B - Without trailing slash
<html>
<head>
<base href="http://example.com/images" />
</head>
<body>
<img src="/images/image.jpg">
<img src="image.jpg"> not found
<img src="./image.jpg"> not found
<img src="images/image.jpg">
<img src="/image.jpg"> not found
<img src="../image.jpg"> not found
</body>
</html>
Conclusion
<base href>
ignores everything after the last slash sohttp://example.com/images
becomeshttp://example.com/
C - How it should be
<html>
<head>
<base href="http://example.com/" />
</head>
<body>
<img src="/images/image.jpg">
<img src="image.jpg"> not found
<img src="./image.jpg"> not found
<img src="images/image.jpg">
<img src="/image.jpg"> not found
<img src="../image.jpg"> not found
</body>
</html>
Conclusion
- Same result as in Test B as expected
D - Double Slash
<html>
<head>
<base href="http://example.com/images//" />
</head>
<body>
<img src="/images/image.jpg">
<img src="image.jpg">
<img src="./image.jpg">
<img src="images/image.jpg"> not found
<img src="/image.jpg"> not found
<img src="../image.jpg">
</body>
</html>
E - Double Slash with whitespace
<html>
<head>
<base href="http://example.com/images/ /" />
</head>
<body>
<img src="/images/image.jpg">
<img src="image.jpg"> not found
<img src="./image.jpg"> not found
<img src="images/image.jpg"> not found
<img src="/image.jpg"> not found
<img src="../image.jpg">
</body>
</html>
Both are not "valid" URLs, but real results of my web crawler. Please explain what happend in D and E that ../image.jpg
could be found and why causes the whitespace a difference?
Only for your interest:
<base href="http://example.com//" />
is the same as Test C<base href="http://example.com/ /" />
is completely different. Only../image.jpg
is found<base href="a/" />
finds only/images/image.jpg
回答1:
The behavior of base is explained in the HTML spec:
The base element allows authors to specify the document base URL for the purposes of resolving relative URLs.
As shown in your test A, if there are multiple base
with href
, the document base URL will be the first one.
Resolving relative URLs is done this way:
Apply the URL parser to url, with base as the base URL, with encoding as the encoding.
The URL parsing algorithm is defined in the URL spec.
It's too complex to be explained here in detail. But basically, this is what happens:
- A relative URL starting with
/
is calculated with respect to base URL's host. - Otherwise, the relative URL is calculated with respect to base URL's last directory.
- Be aware that if the base path doesn't end with
/
, the last part will be a file, not a directory. ./
is the current directory../
goes one directory up
(Probably, "directory" and "file" are not the proper terminology in URLs)
Some examples:
http://example.com/images/a/./
ishttp://example.com/images/a/
http://example.com/images/a/../
ishttp://example.com/images/
http://example.com/images//./
ishttp://example.com/images//
http://example.com/images//../
ishttp://example.com/images/
http://example.com/images/./
ishttp://example.com/images/
http://example.com/images/../
ishttp://example.com/
Note that, in most cases, //
will be like /
. As said by @poncha,
Unless you're using some kind of URL rewriting (in which case the rewriting rules may be affected by the number of slashes), the uri maps to a path on disk, but in (most?) modern operating systems (Linux/Unix, Windows), multiple path separators in a row do not have any special meaning, so /path/to/foo and /path//to////foo would eventually map to the same file.
However, in general / /
won't become //
.
You can use the following snippet to resolve your list of relative URLs to absolute ones:
var bases = [
"http://example.com/images/",
"http://example.com/images",
"http://example.com/",
"http://example.com/images//",
"http://example.com/images/ /"
];
var urls = [
"/images/image.jpg",
"image.jpg",
"./image.jpg",
"images/image.jpg",
"/image.jpg",
"../image.jpg"
];
function newEl(type, contents) {
var el = document.createElement(type);
if(!contents) return el;
if(!(contents instanceof Array))
contents = [contents];
for(var i=0; i<contents.length; ++i)
if(typeof contents[i] == 'string')
el.appendChild(document.createTextNode(contents[i]))
else if(typeof contents[i] == 'object') // contents[i] instanceof Node
el.appendChild(contents[i])
return el;
}
function emoticon(str) {
return {
'http://example.com/images/image.jpg': 'good',
'http://example.com/images//image.jpg': 'neutral'
}[str] || 'bad';
}
var base = document.createElement('base'),
a = document.createElement('a'),
output = document.createElement('ul'),
head = document.getElementsByTagName('head')[0];
head.insertBefore(base, head.firstChild);
for(var i=0; i<bases.length; ++i) {
base.href = bases[i];
var test = newEl('li', [
'Test ' + (i+1) + ': ',
newEl('span', bases[i])
]);
test.className = 'test';
var testItems = newEl('ul');
testItems.className = 'test-items';
for(var j=0; j<urls.length; ++j) {
a.href = urls[j];
var absURL = a.cloneNode(false).href;
/* Stupid old IE requires cloning
https://stackoverflow.com/a/24437713/1529630 */
var testItem = newEl('li', [
newEl('span', urls[j]),
' → ',
newEl('span', absURL)
]);
testItem.className = 'test-item ' + emoticon(absURL);
testItems.appendChild(testItem);
}
test.appendChild(testItems);
output.appendChild(test);
}
document.body.appendChild(output);
span {
background: #eef;
}
.test-items {
display: table;
border-spacing: .13em;
padding-left: 1.1em;
margin-bottom: .3em;
}
.test-item {
display: table-row;
position: relative;
list-style: none;
}
.test-item > span {
display: table-cell;
}
.test-item:before {
display: inline-block;
width: 1.1em;
height: 1.1em;
line-height: 1em;
text-align: center;
border-radius: 50%;
margin-right: .4em;
position: absolute;
left: -1.1em;
top: 0;
}
.good:before {
content: ':)';
background: #0f0;
}
.neutral:before {
content: ':|';
background: #ff0;
}
.bad:before {
content: ':(';
background: #f00;
}
You can also play with this snippet:
var resolveURL = (function() {
var base = document.createElement('base'),
a = document.createElement('a'),
head = document.getElementsByTagName('head')[0];
return function(url, baseurl) {
if(base) {
base.href = baseurl;
head.insertBefore(base, head.firstChild);
}
a.href = url;
var abs = a.cloneNode(false).href;
/* Stupid old IE requires cloning
https://stackoverflow.com/a/24437713/1529630 */
if(base)
head.removeChild(base);
return abs;
};
})();
var base = document.getElementById('base'),
url = document.getElementById('url'),
abs = document.getElementById('absolute');
base.onpropertychange = url.onpropertychange = function() {
if (event.propertyName == "value")
update()
};
(base.oninput = url.oninput = update)();
function update() {
abs.value = resolveURL(url.value, base.value);
}
label {
display: block;
margin: 1em 0;
}
input {
width: 100%;
}
<label>
Base url:
<input id="base" value="http://example.com/images//foo////bar/baz"
placeholder="Enter your base url here" />
</label>
<label>
URL to be resolved:
<input id="url" value="./a/b/../c"
placeholder="Enter your URL here">
</label>
<label>
Resulting url:
<input id="absolute" readonly>
</label>
来源:https://stackoverflow.com/questions/29122106/what-happens-if-base-href-is-set-with-a-double-slash