Đăng nhập
 
Tìm kiếm nâng cao
 
Tên bài báo
Tác giả
Năm xuất bản
Tóm tắt
Lĩnh vực
Phân loại
Số tạp chí
 

Bản tin định kỳ
Báo cáo thường niên
Tạp chí khoa học ĐHCT
Tạp chí tiếng anh ĐHCT
Tạp chí trong nước
Tạp chí quốc tế
Kỷ yếu HN trong nước
Kỷ yếu HN quốc tế
Book chapter
Tạp chí quốc tế 2023
Số tạp chí 39(2023) Trang: 101–124
Tạp chí: Journal of Computer Science and Cybernetics

Big data processing is attracting the interest of many researchers to process large-scale datasets and extract useful information for supporting and providing decisions. One of the biggest challenges is the problem of querying large datasets. It becomes even more complicated with similarity queries instead of exact match queries. A fuzzy join operation is a typical operation frequently used in similarity queries and big data analysis. Currently, there is very little research on this issue, thus it poses significant barriers to the efforts of improving query operations on big data efficiently. As a result, this study overviews the similarity algorithms for fuzzy joins, in which the data at the join key attributes may have slight differences within a fuzzy threshold.
We analyze six similarity algorithms including Hamming, Levenshtein, LCS, Jaccard, Jaro, and Jaro - Winkler, to show the difference between these algorithms through the three criteria: output enrichment, false positives/negatives, and the processing time of the algorithms. Experiments of fuzzy joins algorithms are implemented in the Spark environment, a popular big data processing platform. The algorithms are divided into two groups for evaluation: group 1 (Hamming, Levenshtein, and LCS) and group 2 (Jaccard, Jaro, and Jaro - Winkler). For the former, Levenshtein has an advantage over the other two algorithms in terms of output enrichment, high accuracy in the result set (false positives/negatives), and acceptable processing time. In the letter, Jaccard is considered the worst algorithm considering all three criteria mean while Jaro - Winkler algorithm has more output richness and higher accuracy in the result set. The overview of the similarity algorithms in this study will help users to choose the most suitable algorithm for their problems.

Các bài báo khác
Số tạp chí 11(2023) Trang: 303-313
Tạp chí: Russian Law Journal
Số tạp chí 72(2023) Trang: 7004411
Tạp chí: IEEE Transactions on Instrumentation and Measurement
Số tạp chí 31(2023) Trang: 825-857
Tạp chí: International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems
Số tạp chí 5(2023) Trang:
Tạp chí: Applied Set-Valued Analysis and Optimization
Số tạp chí 22(2023) Trang: https://worldscientific.com/doi/10.1142/S1469026823500165
Tạp chí: International Journal of Computational Intelligence and Applications
Số tạp chí 1863(2023) Trang:
Tạp chí: Communications in Computer and Information Science
Số tạp chí 6(2023) Trang: 956-963
Tạp chí: International Journal of Multidisciplinary Research and Analysis
Số tạp chí 23(2023) Trang: 271-282
Tạp chí: Scientific Papers. Series “Management, Economic Engineering in Agriculture and Rural Development
Số tạp chí 6(2023) Trang: 5980-5988
Tạp chí: JOURNAL OF ECONOMICS, FINANCE AND MANAGEMENT STUDIES
Số tạp chí 21(2023) Trang: 113-117
Tạp chí: The University of Danang, Journal of Science and Technology
Số tạp chí 21(2023) Trang: 55-59
Tạp chí: The University of Danang, Journal of Science and Technology


Vietnamese | English






 
 
Vui lòng chờ...