Last active
June 13, 2020 08:55
-
-
Save stormwatch/5a6dd4f05a1a260e79339701087e3dca to your computer and use it in GitHub Desktop.
3.3 Programming challenge: indexing a file https://www.futurelearn.com/courses/functional-programming-erlang/3/steps/488124/
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
-module(index). | |
-export([create/1]). | |
create(Name) -> | |
{ok,File} = file:open(Name,[read]), | |
create(File, 1, #{}). | |
create(File, Line_number, Map) -> | |
case io:get_line(File,"") of | |
eof -> file:close(File), | |
Map; | |
Line -> | |
create(File, Line_number + 1, | |
lists:foldl( | |
fun(Word, New_map) -> | |
maps:update_with( | |
string:uppercase(Word), | |
fun(Lines) -> | |
[Line_number|Lines] | |
end, | |
[Line_number], | |
New_map) | |
end, | |
Map, | |
re:split(Line, "\s *|[[:^alnum:]]", [trim]))) | |
end. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Please excuse the delay. I couldn't get to work on the assignment during the last week. As it stands, this is a minimum viable product (“make it work”) with a lot of room for improvement.
I renamed
get_file_contents/1
andget_all_lines/2
tocreate/1
andcreate/2
and modified them in order to read the file and compute the index in one go.Questions
Is it possible to replace the foldl with a foreach in line 14?
Some quick notes on the work remaining ahead:
TODO
Removing
all short words (e.g. words of length less than 3)
modify the regexp string to also filter short words.
common words (you‘ll have to think about how to define these).
provide means to define an arbitrary collection of words to ignore when indexing
Sorting the output so that the words occur in lexicographic order
Having this map relationship between characters and integers already provided will come handy; but further normalization wil be required if I'll take accents, other diacritic marks, even related alphabets into account.
Normalising the words so that capitalised ("Foo") and non capitalised versions ("foo") of a word are identified.
I am already uppercasing everything so you might consider this done.
Normalising so that common endings, plurals etc. identified.
This would be nice! I'd have to read about the rule for english plurals. We'd need to be able to distinguish for example “kiss”, “dress” as singular words and not as plurals of “kis” and “dres”.
Harder) Thinking how you could make the data representation more efficient than the one you first chose. This might be efficient for lookup only, or for both creation and lookup.
This is starting to look more and more like a compression problem.
Can you think of other ways that you might extend your solution?
[{ignore, my_dictionary}, {min_word, 3}, distinguish_plurals, ignore_case, {oder, order_fun}]
, etc.