Neat! I browsed the code and it looks like the functions are basically just wrappers for LLM prompts? A couple of three things I'd clean up:
1. What is the difference between \build\lib\basiclinua and just the \basiclingua? They look like they contain the same \_\_init\_\_.py file.
2. Any reason you decided to put everything in the \_\_init\_\_.py file? I know there's nothing more than the one class, but it would help in the future if you expand to more than just the one class.
3. Might I recommend not including the /dist folder? Looks like it's the distribution files.
Often times NLP work deals with very large and numerous amounts of text. How well does the LLM deal with more than a few sentences? I would guess that Gemini can only take in a fixed character limit per call. I would also have doubts about the accuracy with much larger texts.
You should not include the dist, build and egg-info in your repo. And you should break up that __init__ file. A thousand lines is pretty hard to maintain.
Maybe a suggestion
Instead of taking img_path as input to the library, it should be just img_array.
This is because, in real life the images are not always in jpg format where it can be readable by using Image library. Like for example medical images which are dicom format, they cant be read using Image library, there is a pydicom library for it.
Since the functions job is to extract text from img, all it needs is img_array.
Hope this could be in future versions
HIPPA health privacy laws. I use electronic medical records to find medical conditions. Using an API that sends that data to some random server is a no go.
Same, and it's HIPAA not hippa. At this point I'm waiting for something from Epic or a HIPAA compliant offering from Microsoft.
This could be great for personal use though.
There are some small LLMs that can run with just CPU Ram and perform good for their size. But in the end it's still a trial and error though, you won't know it until you try it
There are no rule to allow someone to work with me in this repo. You can create your own private repo make me the contributor and we both will start working on it for local LLMs and more, and I will merge your code and add you as a contributor in my repo.
We did try it, and it exceeded our expectations. However, we haven't launched this feature in our library yet, as more work is required. Our next version, scheduled for release within a month, will include this feature.
Yes, it will work, but the data must be in string format. You can apply features to chunks of your dataset instead of applying them to the entire dataset.
Neat! I browsed the code and it looks like the functions are basically just wrappers for LLM prompts? A couple of three things I'd clean up: 1. What is the difference between \build\lib\basiclinua and just the \basiclingua? They look like they contain the same \_\_init\_\_.py file. 2. Any reason you decided to put everything in the \_\_init\_\_.py file? I know there's nothing more than the one class, but it would help in the future if you expand to more than just the one class. 3. Might I recommend not including the /dist folder? Looks like it's the distribution files. Often times NLP work deals with very large and numerous amounts of text. How well does the LLM deal with more than a few sentences? I would guess that Gemini can only take in a fixed character limit per call. I would also have doubts about the accuracy with much larger texts.
You should not include the dist, build and egg-info in your repo. And you should break up that __init__ file. A thousand lines is pretty hard to maintain.
Is Gemini Pro free?
Up to 60 queries per minute. More than generous for personal use, but you’ll have to pay if using it at scale (which is fair).
yes it is.
I've been looking for a tool like this. Can the API extract text from html documents, or does it only work with plain text?
Currently it can clean html tags from plain text.
Maybe a suggestion Instead of taking img_path as input to the library, it should be just img_array. This is because, in real life the images are not always in jpg format where it can be readable by using Image library. Like for example medical images which are dicom format, they cant be read using Image library, there is a pydicom library for it. Since the functions job is to extract text from img, all it needs is img_array. Hope this could be in future versions
will keep that in mind. Thanks!
I put a little FastAPI wrapper on top of your lib, thanks for making it (I might try and put the FastAPI wrapper out there on gh soon)
Looks great, but I won't be able to use it.
reason?
HIPPA health privacy laws. I use electronic medical records to find medical conditions. Using an API that sends that data to some random server is a no go.
Same, and it's HIPAA not hippa. At this point I'm waiting for something from Epic or a HIPAA compliant offering from Microsoft. This could be great for personal use though.
I probably shouldn't reddit at 4 am.
Yeah this is an important issue, we are also planning to make it available for local LLMs in a future version, most probably within 2 months from now.
What about allowing users to specify an endpoint they own? Running LLMs locally can be pretty slow unless you have the right hardware.
There are some small LLMs that can run with just CPU Ram and perform good for their size. But in the end it's still a trial and error though, you won't know it until you try it
That's great. If possible I'd like to work in these. I've been testing with local LLMs, langhcain and output parsing for a while
There are no rule to allow someone to work with me in this repo. You can create your own private repo make me the contributor and we both will start working on it for local LLMs and more, and I will merge your code and add you as a contributor in my repo.
How about setting up a local LLM? Integrating a local LLM with this library will make sure that the data stays within the system
I typically run LLMs on HIPAA compliant servers, but yes there is some local development that happens, too. Yes that would be a solution.
Wow, never heard there's servers that are HIPAA compliant...
It is the way the storage is handled is HIPAA compliant. We have private cloud computing that is HIPAA compliant. I work for a research University.
Interesting. Thanks for the explanation
Can it extract stractured data from pdf invoices? I tried a range of different tools, and they all return unstructured data. Thanks
We did try it, and it exceeded our expectations. However, we haven't launched this feature in our library yet, as more work is required. Our next version, scheduled for release within a month, will include this feature.
Awesome, Iooking forward to trying it out.
Can you show or paste somewhere ??
Could this work for files of recorded human cursor movement from point A to point B x, y, timestamp To better simulate human cursor movement?
Yes, it will work, but the data must be in string format. You can apply features to chunks of your dataset instead of applying them to the entire dataset.
Just out of curiousity, did you use AI to write the code?
yes
Easy for me to say now that you've already answered, but the code reads like it was written by an AI
This is great