Trying to write a short method so that I can parse a string and extract the first word. I have been looking for the best way to do this.
I assume I would use s
You could also use http://download.oracle.com/javase/6/docs/api/java/util/StringTokenizer.html
for those who are searching for kotlin
var delimiter = " "
var mFullname = "Mahendra Rajdhami"
var greetingName = mFullname.substringBefore(delimiter)
I know this question has been answered already, but I have another solution (For those still searching for answers) which can fit on one line: It uses the split functionality but only gives you the 1st entity.
String test = "123_456";
String value = test.split("_")[0];
System.out.println(value);
The output will show:
123
The second parameter of the split
method is optional, and if specified will split the target string only N
times.
For example:
String mystring = "the quick brown fox";
String arr[] = mystring.split(" ", 2);
String firstWord = arr[0]; //the
String theRest = arr[1]; //quick brown fox
Alternatively you could use the substring
method of String.
None of these answers appears to define what the OP might mean by a "word". As others have already said, a "word boundary" may be a comma, and certainly can't be counted on to be a space, or even "white space" (i.e. also tabs, newlines, etc.)
At the simplest, I'd say the word has to consist of any Unicode letters, and any digits. Even this may not be right: a String
may not qualify as a word if it contains numbers, or starts with a number. Furthermore, what about hyphens, or apostrophes, of which there are presumably several variants in the whole of Unicode? All sorts of discussions of this kind and many others will apply not just to English but to all other languages, including non-human language, scientific notation, etc. It's a big topic.
But a start might be this (NB written in Groovy):
String givenString = "one two9 thr0ee four"
// String givenString = "oňňÜÐæne;:tŵo9===tĥr0eè? four!"
// String givenString = "mouse"
// String givenString = "&&^^^%"
String[] substrings = givenString.split( '[^\\p{L}^\\d]+' )
println "substrings |$substrings|"
println "first word |${substrings[0]}|"
This works OK for the first, second and third givenString
s. For "&&^^^%" it says that the first "word" is a zero-length string, and the second is "^^^". Actually a leading zero-length token is String.split
's way of saying "your given String starts not with a token but a delimiter".
NB in regex \p{L}
means "any Unicode letter". The parameter of String.split
is of course what defines the "delimiter pattern"... i.e. a clump of characters which separates tokens.
NB2 Performance issues are irrelevant for a discussion like this, and almost certainly for all contexts.
NB3 My first port of call was Apache Commons' StringUtils
package. They are likely to have the most effective and best engineered solutions for this sort of thing. But nothing jumped out... https://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/StringUtils.html ... although something of use may be lurking there.
You could use a Scanner
http://download.oracle.com/javase/1.5.0/docs/api/java/util/Scanner.html
The scanner can also use delimiters other than whitespace. This example reads several items in from a string:
String input = "1 fish 2 fish red fish blue fish"; Scanner s = new Scanner(input).useDelimiter("\\s*fish\\s*"); System.out.println(s.nextInt()); System.out.println(s.nextInt()); System.out.println(s.next()); System.out.println(s.next()); s.close();
prints the following output:
1 2 red blue