Reference: This is a self-answered question. It was meant to share the knowledge, Q&A style.
How do I detect the ty
My answer, because I could make neither ohaal's one or transilvlad's one work, is:
function detect_newline_type($content) {
$arr = array_count_values(
explode(
' ',
preg_replace(
'/[^\r\n]*(\r\n|\n|\r)/',
'\1 ',
$content
)
)
);
arsort($arr);
return key($arr);
}
The general idea in both proposed solutions is good, but implementation details hinder the usefulness of those answers.
Indeed, the point of this function is to return the kind of newline used in a file, and that newline can either be one or two character long.
This alone renders the use of str_split()
incorrect. The only way to cut the tokens correctly is to use a function that cuts a string with variable lengths, based on character detection instead. That is when explode()
comes into play.
But to give useful markers to explode, it is necessary to replace the right characters, in the right amount, by the right match. And most of the magic happens in the regular expression.
3 points have to be considered:
.*
as suggested by ohaal will not work. While it is true that .
will not match newline characters, on a system where \r
is not a newline character, or part of a newline character, .
will match it incorrectly (reminder: we are detecting newlines because they could be different from the ones on our system. Otherwise there is no point)./[^\r\n]*/
with anything will "work" to make the text vanish, but will be an issue as soon as we want to have a separator (since we remove all characters but the newlines, any character that isn't a newline will be a valid separator). Hence the idea to create a match with the newline, and use a backreference to that match in the replacement.If you care just about LF/CRs here is a method I wrote. No need to treat all possible cases of files you'll never ever see.
/**
* @param string $path
* @param string $format real or human_readable
* @return false|string
* @author Sorin-Iulian Trimbitas
*/
public static function getLineBreak(string $path, $format = 'real')
{
// Hopefully my idea is ok, the rest of the stuff from the internet doesn't seem to work ok in some cases
// 1. Take the first line of the CSV
$file = new \SplFileObject($path);
$line = $file->getCurrentLine();
// Do we have an empty line?
if (mb_strlen($line) == 1) {
// Try the next line
$file->next();
$line = $file->getCurrentLine();
if (mb_strlen($line) == 1) {
// Give up
return false;
}
}
// What does we have at its end?
$last_char = mb_substr($line, -1);
$penultimate_char = mb_substr($line, -2, 1);
if ($last_char == "\n" || $last_char == "\r") {
$real_format = $last_char;
if ($penultimate_char == "\n" || $penultimate_char == "\r") {
$real_format = $penultimate_char.$real_format;
}
if ($format == 'real') {
return $real_format;
}
return str_replace(["\n", "\r"], ['LF', 'CR'], $real_format);
}
return false;
}
Wouldn't it be easier to just replace everything except new lines using regex?
The dot matches a single character, without caring what that character is. The only exception are newline characters.
With that in mind, we do some magic:
$string = 'some string with new lines';
$newlines = preg_replace('/.*/', '', $string);
// $newlines is now filled with new lines, we only need one
$newline = substr($newlines, 0, 1);
Not sure if we can trust regex to do all this, but I don't have anything to test with.
The here already given answers provide the user of enough information. The following code (based on the already given anwers) might help even more:
/**
Newline characters in different Operating Systems
The names given to the different sequences are:
============================================================================================
NewL Chars Name Description
----- ----------- -------- ------------------------------------------------------------------
LF 0x0A UNIX Apple OSX, UNIX, Linux
CR 0x0D TRS80 Commodore, Acorn BBC, ZX Spectrum, TRS-80, Apple II family, etc
LFCR 0x0A 0x0D ACORN Acorn BBC and RISC OS spooled text output.
CRLF 0x0D 0x0A WINDOWS Microsoft Windows, DEC TOPS-10, RT-11 and most other early non-Unix
and non-IBM OSes, CP/M, MP/M, DOS (MS-DOS, PC DOS, etc.), OS/2,
----- ----------- -------- ------------------------------------------------------------------
*/
const EOL_UNIX = 'lf'; // Code: \n
const EOL_TRS80 = 'cr'; // Code: \r
const EOL_ACORN = 'lfcr'; // Code: \n \r
const EOL_WINDOWS = 'crlf'; // Code: \r \n
then use the following code in a static class Utility to detect
/**
Detects the end-of-line character of a string.
@param string $str The string to check.
@param string $key [io] Name of the detected eol key.
@return string The detected EOL, or default one.
*/
public static function detectEOL($str, &$key) {
static $eols = array(
Util::EOL_ACORN => "\n\r", // 0x0A - 0x0D - acorn BBC
Util::EOL_WINDOWS => "\r\n", // 0x0D - 0x0A - Windows, DOS OS/2
Util::EOL_UNIX => "\n", // 0x0A - - Unix, OSX
Util::EOL_TRS80 => "\r", // 0x0D - - Apple ][, TRS80
);
$key = "";
$curCount = 0;
$curEol = '';
foreach($eols as $k => $eol) {
if( ($count = substr_count($str, $eol)) > $curCount) {
$curCount = $count;
$curEol = $eol;
$key = $k;
}
}
return $curEol;
} // detectEOL
and then for a file:
/**
Detects the EOL of an file by checking the first line.
@param string $fileName File to be tested (full pathname).
@return boolean false | Used key = enum('cr', 'lf', crlf').
@uses detectEOL
*/
public static function detectFileEOL($fileName) {
if (!file_exists($fileName)) {
return false;
}
// Gets the line length
$handle = @fopen($fileName, "r");
if ($handle === false) {
return false;
}
$line = fgets($handle);
$key = "";
<Your-Class-Name>::detectEOL($line, $key);
return $key;
} // detectFileEOL
Change the Your-Class-Name into your name for the implementation Class (all static members).
/**
* Detects the end-of-line character of a string.
* @param string $str The string to check.
* @param string $default Default EOL (if not detected).
* @return string The detected EOL, or default one.
*/
function detectEol($str, $default=''){
static $eols = array(
"\0x000D000A", // [UNICODE] CR+LF: CR (U+000D) followed by LF (U+000A)
"\0x000A", // [UNICODE] LF: Line Feed, U+000A
"\0x000B", // [UNICODE] VT: Vertical Tab, U+000B
"\0x000C", // [UNICODE] FF: Form Feed, U+000C
"\0x000D", // [UNICODE] CR: Carriage Return, U+000D
"\0x0085", // [UNICODE] NEL: Next Line, U+0085
"\0x2028", // [UNICODE] LS: Line Separator, U+2028
"\0x2029", // [UNICODE] PS: Paragraph Separator, U+2029
"\0x0D0A", // [ASCII] CR+LF: Windows, TOPS-10, RT-11, CP/M, MP/M, DOS, Atari TOS, OS/2, Symbian OS, Palm OS
"\0x0A0D", // [ASCII] LF+CR: BBC Acorn, RISC OS spooled text output.
"\0x0A", // [ASCII] LF: Multics, Unix, Unix-like, BeOS, Amiga, RISC OS
"\0x0D", // [ASCII] CR: Commodore 8-bit, BBC Acorn, TRS-80, Apple II, Mac OS <=v9, OS-9
"\0x1E", // [ASCII] RS: QNX (pre-POSIX)
//"\0x76", // [?????] NEWLINE: ZX80, ZX81 [DEPRECATED]
"\0x15", // [EBCDEIC] NEL: OS/390, OS/400
);
$cur_cnt = 0;
$cur_eol = $default;
foreach($eols as $eol){
if(($count = substr_count($str, $eol)) > $cur_cnt){
$cur_cnt = $count;
$cur_eol = $eol;
}
}
return $cur_eol;
}
Notes:
mb_detect_eol()
(multibyte) and detect_eol()
Based on ohaal's answer.
This can return one or two caracters for EOL like LF, CR+LF..
$eols = array_count_values(str_split(preg_replace("/[^\r\n]/", "", $string)));
$eola = array_keys($eols, max($eols));
$eol = implode("", $eola);