Java: Count Number Of Word Occurrence In String

Republicat de Platon

Urmaritori: 0

Introducere

Counting the number of word occurrences in a string is a fairly easy task, but has several approaches to doing so. You have to account for the efficiency of the method as well, since you’ll typically want to employ automated tools when you don’t want to perform manual labor – i.e. when the search space is large.

In this guide, you’ll learn how to count the number of word occurences in a string in Java:

String searchText = "Your body may be chrome, but the heart never changes. It wants what it wants.";
String targetWord = "wants";

We’ll search for the number of occurrences of the targetWord, Folosind String.split(), Collections.frequency() și expresii regulate.

Numărați aparițiile cuvintelor din șir cu String.split()

Cel mai simplu mod de a număra apariția unui cuvânt țintă într-un șir este de a împărți șirul pe fiecare cuvânt și de a itera prin matrice, incrementând un wordCount la fiecare meci. Rețineți că atunci când un cuvânt are orice fel de punctuație în jurul său, cum ar fi wants. at the end of the sentence – the simple word-level split will correctly treat wants și wants. ca cuvinte separate!

Pentru a rezolva acest lucru, puteți elimina cu ușurință toate semnele de punctuație din propoziție înainte împărțirea lui:

String[] words = searchText.replaceAll("p{Punct}", "").split(" ");

int wordCount = 0;
for (int i=0; i < words.length; i++)
    if (words[i].equals(targetWord))
        wordCount++;
System.out.println(wordCount);

În for buclă, pur și simplu iterăm prin matrice, verificând dacă elementul de la fiecare index este egal cu targetWord. Dacă este, creștem valoarea wordCount, care la sfârșitul execuției, tipărește:

Numărați aparițiile cuvintelor din șir cu Collections.frequency()

Collections.frequency() Metoda oferă o implementare mult mai curată, de nivel superior, care face abstractie de un simplu for buclă și verifică atât identitatea (dacă un obiect is alt obiect) și egalitate (dacă un obiect este egal cu un alt obiect, în funcție de trăsăturile calitative ale acelui obiect).

frequency() metoda acceptă o listă prin care să căutați și obiectul țintă și funcționează și pentru toate celelalte obiecte, unde comportamentul depinde de modul în care obiectul însuși implementează equals(). În cazul șirurilor, equals() verificări pentru conținutul șirului:


searchText = searchText.replaceAll("p{Punct}", "");

int wordCount = Collections.frequency(Arrays.asList(searchText.split(" ")), targetWord);
System.out.println(wordCount);

Here, we’ve converted the array obtained from split() într-un Java ArrayList, folosind ajutorul asList() metodă a Arrays clasă. Operația de reducere frequency() returnează un număr întreg care denotă frecvența lui targetWord în listă și are ca rezultat:

Apariții de cuvinte în șir cu Matcher (Regular Expressions – RegEx)

Finally, you can use Regular Expressions to search for patterns, and count the number of matched patterns. Regular Expressions are made for this, so it’s a very natural fit for the task. In Java, the Pattern clasa este folosită pentru a reprezenta și a compila expresii regulate și Matcher clasa este folosită pentru a găsi și potrivi modele.

Using RegEx, we can code the punctuation invariance into the expression itself, so there’s no need to externally format the string or remove punctuation, which is preferable for large texts where storing another altered version in memory might be expenssive:

Pattern pattern = Pattern.compile("b%s(?!w)".format(targetWord));

Pattern pattern = Pattern.compile("bwants(?!w)");
Matcher matcher = pattern.matcher(searchText);

int wordCount = 0;
while (matcher.find())
    wordCount++;

System.out.println(wordCount);

Acest lucru are ca rezultat și:

Benchmark de eficiență

So, which is the most efficient? Let’s run a small benchmark:

int runs = 100000;

long start1 = System.currentTimeMillis();
for (int i = 0; i < runs; i++) {
    int result = countOccurencesWithSplit(searchText, targetWord);
}

long end1 = System.currentTimeMillis();
System.out.println(String.format("Array split approach took: %s miliseconds", end1-start1));

long start2 = System.currentTimeMillis();
  for (int i = 0; i < runs; i++) {
    int result = countOccurencesWithCollections(searchText, targetWord);
}

long end2 = System.currentTimeMillis();
System.out.println(String.format("Collections.frequency() approach took: %s miliseconds", end2-start2));

long start3 = System.currentTimeMillis();
for (int i = 0; i < runs; i++) {
    int result = countOccurencesWithRegex(searchText, targetWord);
}

long end3 = System.currentTimeMillis();
System.out.println(String.format("Regex approach took: %s miliseconds", end3-start3));

Fiecare metodă va fi rulată de 100000 de ori (cu cât numărul este mai mare, cu atât varianța și rezultatele datorate întâmplării sunt mai mici, datorită legii numerelor mari). Rularea acestui cod are ca rezultat:

Array split approach took: 152 miliseconds
Collections.frequency() approach took: 140 miliseconds
Regex approach took: 92 miliseconds

However – what happens if we make the search more computationally expensive by making it larger? Let’s generate a synthetic sentence:

List possibleWords = Arrays.asList("hello", "world ");
StringBuffer searchTextBuffer = new StringBuffer();

for (int i = 0; i < 100; i++) {
    searchTextBuffer.append(String.join(" ", possibleWords));
}
System.out.println(searchTextBuffer);

Aceasta creează un șir cu conținutul:

hello world hello world hello world hello ...

Consultați ghidul nostru practic și practic pentru a învăța Git, cu cele mai bune practici, standarde acceptate de industrie și fisa de cheat incluse. Opriți căutarea pe Google a comenzilor Git și de fapt învăţa aceasta!

Now, if we were to search for either “hello” or “world” – there’d be many more matches than the two from before. How do our methods do now in the benchmark?

Array split approach took: 606 miliseconds
Collections.frequency() approach took: 899 miliseconds
Regex approach took: 801 miliseconds

Now, array splitting comes out fastest! In general, benchmarks depend on various factors – such as the search space, the target word, etc. and your personal use case might be different from the benchmark.

Indicații: Încearcă metodele pe propriul text, notează orele și alege-o pe cea mai eficientă și elegantă pentru tine.

Concluzie

In this short guide, we’ve taken a look at how to count word occurrences for a target word, in a string in Java. We’ve started out by splitting the string and using a simple counter, followed by using the Collections clasa de ajutor și, în sfârșit, folosind expresii regulate.

In the end, we’ve benchmarked the methods, and noted that the performance isn’t linear, and depends on the search space. For longer input texts with many matches, splitting arrays seems to be the most performant. Try all three methods on your own, and pick the most performant one.

Timestamp-ul: 21 Septembrie, 2022Octombrie 8, 2022