The CooccurrenceFieldSampler (CFS) was developed for sampling from corpora to facilitate lexicographical data analysis. It works with corpora from different sources, text types or years. In random sentence sampling (random/opportunistic sampling), it can be observed that corpora containing different text types and lengths (per source) cannot always be mixed optimally, as they usually do not have the same size and have different topic weightings, for example. The CFS was designed to solve this problem.
The CFS first calculates all co-occurrences for all tokens within sentences – separately for each source. These corpora are then combined in a 1:1 mixture and the co-occurrences for the entire data set are recalculated. The tool evaluates which co-occurrences disappear and which new ones are created, resulting in quotas that control the random mixing of the corpora sentence by sentence.
The end result is a sentence-based corpus that (A) strives to retain the maximum number of co-occurrences from all sources (as accurately as possible) and (B) minimises the rejection of corpus data.
To use the CFS tool, follow these steps:
- Unzip the ZIP file containing the necessary files.
- For Windows, Linux, and macOS, you will find precompiled binaries that run exclusively on x64 processors.
- If you are using a different processor type, such as ARM or ARM64, please use the Universal folder.
- Windows users should run "cfs.exe" directly.
- For Linux and macOS users, you must mark the cfs file as executable.
- If using the Universal version, ensure .NET 10.0 is installed for compiling. You can then run the program with "dotnet cfs.dll".
- To display help information, use the --help parameter.
Help/Parameter:
--from (Default: cec / recommended: cec) import file format (valid: cec, bnc, catma, clan, conll, cora, cwd, dewac, dta, folia, fln, korap, leipzig, xces,
relannis, salt, json, sketch, speedy, tiger, tlv, treetagger, tsv, txm, weblicht)
--input (Default: input/) folder with input-files (mix per file)
--to (Default: cec / recommended: cec) export file format (valid: cec, catma, conll, cwd, csv, dta, folia, i5, korap, xces, plain, salt, json, sketch,
speedy, tlv, tsv, treetagger, txm, weblicht)
--layer (Default: Wort) use this layer to calculate the co-occurrences
--output (Default: output.cec6) output file (every round and logfile)
--minFrequency (Default: 1 / recommended: 5) min. absolute frequency
--minSignificance (Default: 1.0 / recommended: 1.0) min. significance (poisson distribution)
--minChangeRate (Default: 0.1 / recommended: 0.1) min. significance (poisson distribution)
--maxRounds (Default: 10 / recommended: 5) min. absolute frequency
--help Display this help screen.
--version Display version information.
Supported corpus formats (input/output):
cec - CorpusExplorer Corpus (v6) - http://corpusexplorer.de
bnc - British National Corpus - http://www.natcorp.ox.ac.uk/
catma - CATMA (Computer assisted text markup and analysis) - https://catma.de/
clan - CLAN/CHILDES - https://talkbank.org/childes/
conll - CoNLL-U https://universaldependencies.org/format.html
cora - CORA XML - https://cora.readthedocs.io/en/latest/coraxml/
cwd - IMS Open Corpus Workbench (CWB) - https://cwb.sourceforge.io/
dewac - https://wacky.sslmit.unibo.it/doku.php?id=corpora
dta - DTA TCF-XML - https://www.deutschestextarchiv.de/download
folia - FoLiA XML - https://proycon.github.io/folia/
fln - Folker/OrthoNormal - https://exmaralda.org/de/folker-de/
korap - KorAP - http://korap.ids-mannheim.de/
leipzig - Wortschatz Leipzig - https://wortschatz.uni-leipzig.de/en/download/
xces - XCes XML - http://www.xces.org/ / https://www.cs.vassar.edu/CES/
relannis - https://corpus-tools.org/annis/
salt - https://corpus-tools.org/archive-2015-2025/salt/
json - https://de.wikipedia.org/wiki/JSON
sketch - SketchEngine VERT - https://www.sketchengine.eu/glossary/vertical-file/
speedy - SPEEDy Annotation Editor - http://kups.ub.uni-koeln.de/id/eprint/55224
tiger - TiGER-XML - https://www.ims.uni-stuttgart.de/documents/ressourcen/werkzeuge/tigersearch/doc/html/TigerXML.html
tlv - TLV-XML
treetagger - TreeTagger - https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/
tsv - Tab-separated values - https://en.wikipedia.org/wiki/Tab-separated_values
txm - TXM - https://txm.gitpages.huma-num.fr/textometrie/?lang=en
weblicht - Weblicht - https://weblicht.sfs.uni-tuebingen.de/weblichtwiki/Main_Page.html
csv - Comma-separated values - https://en.wikipedia.org/wiki/Comma-separated_values
i5 - i5-XML - https://www.ids-mannheim.de/en/digspra/pb-s1/projects/corpus-development/ids-text-model/
plain - Plaintext - https://en.wikipedia.org/wiki/Plain_text