shacharmirkin.github.io

Shachar Mirkin’s page

I’m a data scientist / applied researcher with extensive hands-on experience in Natural Language Processing & Machine Learning, in both academic and industry, including 6 years as a data science team leader.

My PhD thesis addressed employing context for natural language inference, and during my postdoc I’ve mostly worked on multilingual tasks, notably machine translation. I’ve published multiple academic articles at top-tier venues as well as several patents (Google Scholar).

Among the many tasks I’ve been working on along the years are machine translation, sentiment analysis, information extraction, object detection, location classification, automatic punctuation and text classification over different languages and types of texts, including spoken and user generated data. That, using various approaches, from rules, through classic machine learning, and up to deep learning, with real-time inference in production.

Currently, I’m mostly interested in developing practical ML solutions for real-world tasks, choosing the right solution for the product, depending on its specific characteristics and constraints.

My Linkedin profile


Academic research interests

Computational argumentation and debating

Selected publications:

Datasets:

Personalized Machine Translation (PMT)

Machine Translation has advanced in recent years to produce better translations for clients’ specific domains, and sophisticated tools allow translators to obtain translations according to their prior edits. We suggest that MT should be further personalized to the end-user level – the receiver or the author of the text – as done in other applications. Language use is known to be influenced by personality traits as well as by demographic characteristics such as age or mother tongue. As a result, it is possible to automatically identify these traits of the author from her texts. To provide the most faithful translation and to allow user modeling based on translations, we posit that machine translation should be personalized. PMT for the readers of the translations can take into account the reader’s translational preferences, as reflected e.g. in complexity or style.

Selected publications:

Datasets:

Model-aware improvement of source translatability for MT

Some source texts are more difficult to translate than others. One way to handle such texts is to modify them prior to translation (aka pre-editing). A prominent factor that is often overlooked is the source translatability with respect to the specific translation system and the specific model that are being used. Our research aims to improve source translatability either automatically, or through interactive tools which enable monolingual speakers of the source language to obtain better translation.

Selected publications:

Semantic inference / Textual entailment

Textual Entailment (TE) is a popular paradigm for modeling semantic inference. The core TE task, Textual Entailment recognition, is to determine whether the meaning of one text can be inferred (or entailed) from another. My textual entailment research mostly focused around understanding entailment in context, to deal with either lexical ambiguity or discourse-based interpretation, but also addressed acquisition of lexical entailment relationships and the application of TE to different applications (e.g. SMT, as in several of the above works).

Selected publications:

SMT domain adaptation

Data selection is a common technique for adapting statistical translation models for a specific domain, which has been shown to both improve translation quality and to reduce model size. Selection often relies on in-domain data, of the same domain of the texts expected to be translated, selecting the sentence-pairs that are most similar to the in-domain data from a pool of parallel texts; yet, this approach holds the risk of resulting in a limited coverage, when necessary n-grams that do appear in the pool are less similar to in-domain data that is available in advance. Our research aims to find ways to bridge these two potentially contradicting considerations, while producing compact translation models.

Selected publications:


Academic service

Program committee member / reviewer:

W-NUT 2022 // ARR April 2022 / ACL 2022 (ARR) // W-NUT 2021 // EMNLP 2021 // EACL 2021 // COLING 2020 // *SEM 2020 // EMNLP 2020 // ACL 2020 // LREC 2020 // W-NUT 2019 // ACL 2019 // NLP+CSS 2019 // COLING 2018 // ACL 2018 // NAACL 2018 // EMNLP 2017 // *SEM 2017 // ACL 2017 // Journal of Natural Language Engineering (JNLE) 2016 // COLING 2016 // LREC 2016 // EMNLP 2016 // *SEM 2016 // EMNLP 2015 // *SEM 2015 // CICLING 2015 // Journal of Language Resources and Evaluation (LREV) 2014 //EMNLP 2014 // COLING 2014 // WMT 2014 // LREC 2014 // WMT 2013 // Journal of Language Resources and Evaluation (LREV) 2013 // IJCNLP 2013 // *SEM 2013 // Journal of Computer Science and Technology (JCST) 2013 // WMT 2012 // EACL 2012 // LREC 2012 // ACM TIST Journal, Special Issue on Paraphrasing 2011 // EMNLP 2011 // TextInfer 2011 // COLING 2010 // EMNLP 2009 // AAAI 2008


Contact

Twitter

Facebook

LinkedIn

Mastodon